Bug 364 - When joined to an ADS domain the sequence number becomes DISCONNECTED after 30 or so minutes of idle time.
Summary: When joined to an ADS domain the sequence number becomes DISCONNECTED after 3...
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.0preX
Hardware: Other other
: P1 critical
Target Milestone: 3.0.0rc3
Assignee: Gerald (Jerry) Carter (dead mail address)
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-08-28 09:09 UTC by Marc Kaplan
Modified: 2005-08-24 10:25 UTC (History)
2 users (show)

See Also:


Attachments
Fix sequence_number timeout (1.08 KB, patch)
2003-08-29 10:04 UTC, Ken Cross
no flags Details
Full log of Winbind in action @ level 10. (50MB when inflated) (890.80 KB, application/octet-stream)
2004-08-01 18:47 UTC, Greg Folkert
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Kaplan 2003-08-28 09:09:24 UTC
This is a showstopper IMO, since it basically causes a denial of service for 
those in ADS domains.

Email from Ken Cross:
"Samba-folk:

The nsswitch/winbindd_ads.c module uses "ads_cached_connection" to maintain
an open connection to an LDAP server.  I have found at least one instance
where that isn't working too well.

When viewing the sequence number with wbinfo --sequence, I noticed that
servers would periodically show up "DISCONNECTED", then later (usually)
they'd be OK.  The problem seems to be in using the cached connection.  The
call to ads_USN in the sequence_number routine was returning "Can't contact
LDAP server".

I added a retry in the sequence_number function (blowing away the cached
connection) and the retry always seems to succeed.

I'm speculating here, but is it really valid to leave a connection open
indefinitely?  There can be long intervals between updating sequence
numbers.

Also, if the sequence_number routine fails at times, could other routines
using ads_cached_connection (11 of them) be having problems occasionally?
If so, a more general solution needs to be considered.

Ken"

So one moment I have this (LOTHLORIEN is an NT4 domain, the rest are ADS):
[root@ThunderBird lib]# wbinfo --sequence
LOTHLORIEN : 444
C-DOMAIN : 332974
B-DOMAIN : 423905
A-DOMAIN : 789953

15 minutes or so later, I have:
[root@ThunderBird lib]# wbinfo --sequence
LOTHLORIEN : 444
C-DOMAIN : DISCONNECTED
B-DOMAIN : DISCONNECTED
A-DOMAIN : DISCONNECTED

Here's a log snippet @ debug level 10:
When I first join, and I request sequence, I see:
process_request: request fn SHOW_SEQUENCE
[  861]: show sequence
refresh_sequence_number: A-DOMAIN time ok
refresh_sequence_number: A-DOMAIN seq number is now 788398


What's happening is that I'm getting:
process_request: request fn SHOW_SEQUENCE
[16992]: show sequence
refresh_sequence_number: A-DOMAIN time ok
refresh_sequence_number: A-DOMAIN seq number is now -1

There's not too much other interesting info in the log about this.
Comment 1 Ken Cross 2003-08-29 10:04:09 UTC
Created attachment 113 [details]
Fix sequence_number timeout

FWIW, here's the patch I did to retry the sequence_number.  However, I am still
concerned that other routines using ads_cached_connection could be having
similar problems, but may be more subtle and harder to detect.
Comment 2 Marc Kaplan 2003-08-30 09:50:03 UTC
Yes, it's not just the sequence num that is affected. Authentication fails to 
a Samba server in this state, which is big problem, since it can't succeed
again until winbindd is restarted.
Comment 3 Gerald (Jerry) Carter (dead mail address) 2003-09-04 21:47:16 UTC
I'm checking in a version of Ken's retry patchfor 
all major winbindd_ads operations.  I've yet to get a 
disconnected message but please check.  

I think this should be fixed now.  If not, then reopen.
Comment 4 Greg Folkert 2004-08-01 18:47:32 UTC
Created attachment 592 [details]
Full log of Winbind in action @ level 10. (50MB when inflated)

This log did not capture the timeout event, but it did catch the restart and 8
hours befor it.

winbind reset happened at 2004/07/31 22:29:01 and at about 23:06 (for a config
change)
Comment 5 Greg Folkert 2004-08-01 19:02:23 UTC
This is a major major annoyance, as it can be fixed relatively quickly, but the
fact that I am a contracted consultant to finish this, give me a bit more
"responsibility" for this large firm I am working on it for.

Just a note to say this *IS* still happening, even in 3.0.5 (security release
version not bugfix/enhancement release candidate)

This is on an RHES v3.0 Service release 2, with update MIT kerberos to v1.3.4,
samba v3.0.5, pam v0.77, pam_smb v1.1.7, pam_krb5 v2.0.10, e2fsprogs v


Winbind settings in smb.conf:

        realm = NETWORKMCS.COM
        security = ADS
        auth methods = winbind, sam
        obey pam restrictions = Yes
        username level = 3
        lanman auth = No
        ntlm auth = No
        client NTLMv2 auth = Yes
        client lanman auth = No
        client plaintext auth = No
        smb ports = 445
        disable netbios = Yes
        server signing = auto
        socket options = IPTOS_LOWDELAY TCP_NODELAY

        idmap uid = 10000-40000
        idmap gid = 10000-40000
        template homedir = /lf/data/home/%D/%U
        template shell = /bin/bash
        winbind separator = +
        winbind cache time = 20
        winbind nested groups = Yes



-----SNIPPET of console----
[root@mash drop_off]# l /lf/data/home/NETWORKCCI/
total 24
drwxr-xr-x    4 10847    10216        4096 Jul 26 12:48 cbungard01_
drwxr-xr-x    4 10635    10008        4096 Jul 26 13:09 hholkey
drwxr-xr-x    4 10849    10216        4096 Jul 26 12:46 hholkey1_
drwxr-xr-x    4 10659    10008        4096 Jul 26 12:50 jmonchek
drwxr-xr-x    4 10850    10216        4096 Jul 26 12:51 jmonchek01_
drwxr-xr-x    4 10722    10008        4096 Jul 26 12:43 mstewart


[root@mash drop_off]# wbinfo -p
Ping to winbindd succeeded on fd 4

[root@mash drop_off]# wbinfo --sequence
CC3 : DISCONNECTED
NETWORKMG : 74731
NETWORKCCI : 19309
NETWORKMCS : 18305
NETWORKDMC : 1255
CCGROUPNET : DISCONNECTED


(added by greg@gregfolkert.net: Date 2004/07/31 22:29:01)
[root@mash drop_off]#  /etc/init.d/winbind  restart

Shutting down Winbind services:                            [  OK  ]
Starting Winbind services:                                 [  OK  ]


[root@mash drop_off]# wbinfo --sequence
CC3 : 1
NETWORKMG : 74731
NETWORKCCI : 19309
NETWORKMCS : 18305
NETWORKDMC : 1255
CCGROUPNET : 6524523



[root@mash drop_off]# l /lf/data/home/NETWORKCCI/
total 24
drwxr-xr-x    4 NETWORKCCI+CBUNGARD01$ NETWORKCCI+Domain Computers
4096 Jul 26 12:48 cbungard01_
drwxr-xr-x    4 NETWORKCCI+hholkey NETWORKCCI+Domain Users     4096 Jul
26 13:09 hholkey
drwxr-xr-x    4 NETWORKCCI+HHOLKEY1$ NETWORKCCI+Domain Computers
4096 Jul 26 12:46 hholkey1_
drwxr-xr-x    4 NETWORKCCI+jmonchek NETWORKCCI+Domain Users     4096 Jul
26 12:50 jmonchek
drwxr-xr-x    4 NETWORKCCI+JMONCHEK01$ NETWORKCCI+Domain Computers
4096 Jul 26 12:51 jmonchek01_
drwxr-xr-x    4 NETWORKCCI+mstewart NETWORKCCI+Domain Users     4096 Jul
26 12:43 mstewart


Comment 6 Greg Folkert 2004-08-01 19:04:45 UTC
Also, please either re-open this bug, or add one for me.

This is the deal exactly. Same as reported the first time.
Comment 7 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:06:10 UTC
originally reported against one of the 3.0.0rc[1-4] releases.
Cleaning up non-production versions.
Comment 8 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:25:40 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.