This is a showstopper IMO, since it basically causes a denial of service for those in ADS domains. Email from Ken Cross: "Samba-folk: The nsswitch/winbindd_ads.c module uses "ads_cached_connection" to maintain an open connection to an LDAP server. I have found at least one instance where that isn't working too well. When viewing the sequence number with wbinfo --sequence, I noticed that servers would periodically show up "DISCONNECTED", then later (usually) they'd be OK. The problem seems to be in using the cached connection. The call to ads_USN in the sequence_number routine was returning "Can't contact LDAP server". I added a retry in the sequence_number function (blowing away the cached connection) and the retry always seems to succeed. I'm speculating here, but is it really valid to leave a connection open indefinitely? There can be long intervals between updating sequence numbers. Also, if the sequence_number routine fails at times, could other routines using ads_cached_connection (11 of them) be having problems occasionally? If so, a more general solution needs to be considered. Ken" So one moment I have this (LOTHLORIEN is an NT4 domain, the rest are ADS): [root@ThunderBird lib]# wbinfo --sequence LOTHLORIEN : 444 C-DOMAIN : 332974 B-DOMAIN : 423905 A-DOMAIN : 789953 15 minutes or so later, I have: [root@ThunderBird lib]# wbinfo --sequence LOTHLORIEN : 444 C-DOMAIN : DISCONNECTED B-DOMAIN : DISCONNECTED A-DOMAIN : DISCONNECTED Here's a log snippet @ debug level 10: When I first join, and I request sequence, I see: process_request: request fn SHOW_SEQUENCE [ 861]: show sequence refresh_sequence_number: A-DOMAIN time ok refresh_sequence_number: A-DOMAIN seq number is now 788398 What's happening is that I'm getting: process_request: request fn SHOW_SEQUENCE [16992]: show sequence refresh_sequence_number: A-DOMAIN time ok refresh_sequence_number: A-DOMAIN seq number is now -1 There's not too much other interesting info in the log about this.
Created attachment 113 [details] Fix sequence_number timeout FWIW, here's the patch I did to retry the sequence_number. However, I am still concerned that other routines using ads_cached_connection could be having similar problems, but may be more subtle and harder to detect.
Yes, it's not just the sequence num that is affected. Authentication fails to a Samba server in this state, which is big problem, since it can't succeed again until winbindd is restarted.
I'm checking in a version of Ken's retry patchfor all major winbindd_ads operations. I've yet to get a disconnected message but please check. I think this should be fixed now. If not, then reopen.
Created attachment 592 [details] Full log of Winbind in action @ level 10. (50MB when inflated) This log did not capture the timeout event, but it did catch the restart and 8 hours befor it. winbind reset happened at 2004/07/31 22:29:01 and at about 23:06 (for a config change)
This is a major major annoyance, as it can be fixed relatively quickly, but the fact that I am a contracted consultant to finish this, give me a bit more "responsibility" for this large firm I am working on it for. Just a note to say this *IS* still happening, even in 3.0.5 (security release version not bugfix/enhancement release candidate) This is on an RHES v3.0 Service release 2, with update MIT kerberos to v1.3.4, samba v3.0.5, pam v0.77, pam_smb v1.1.7, pam_krb5 v2.0.10, e2fsprogs v Winbind settings in smb.conf: realm = NETWORKMCS.COM security = ADS auth methods = winbind, sam obey pam restrictions = Yes username level = 3 lanman auth = No ntlm auth = No client NTLMv2 auth = Yes client lanman auth = No client plaintext auth = No smb ports = 445 disable netbios = Yes server signing = auto socket options = IPTOS_LOWDELAY TCP_NODELAY idmap uid = 10000-40000 idmap gid = 10000-40000 template homedir = /lf/data/home/%D/%U template shell = /bin/bash winbind separator = + winbind cache time = 20 winbind nested groups = Yes -----SNIPPET of console---- [root@mash drop_off]# l /lf/data/home/NETWORKCCI/ total 24 drwxr-xr-x 4 10847 10216 4096 Jul 26 12:48 cbungard01_ drwxr-xr-x 4 10635 10008 4096 Jul 26 13:09 hholkey drwxr-xr-x 4 10849 10216 4096 Jul 26 12:46 hholkey1_ drwxr-xr-x 4 10659 10008 4096 Jul 26 12:50 jmonchek drwxr-xr-x 4 10850 10216 4096 Jul 26 12:51 jmonchek01_ drwxr-xr-x 4 10722 10008 4096 Jul 26 12:43 mstewart [root@mash drop_off]# wbinfo -p Ping to winbindd succeeded on fd 4 [root@mash drop_off]# wbinfo --sequence CC3 : DISCONNECTED NETWORKMG : 74731 NETWORKCCI : 19309 NETWORKMCS : 18305 NETWORKDMC : 1255 CCGROUPNET : DISCONNECTED (added by greg@gregfolkert.net: Date 2004/07/31 22:29:01) [root@mash drop_off]# /etc/init.d/winbind restart Shutting down Winbind services: [ OK ] Starting Winbind services: [ OK ] [root@mash drop_off]# wbinfo --sequence CC3 : 1 NETWORKMG : 74731 NETWORKCCI : 19309 NETWORKMCS : 18305 NETWORKDMC : 1255 CCGROUPNET : 6524523 [root@mash drop_off]# l /lf/data/home/NETWORKCCI/ total 24 drwxr-xr-x 4 NETWORKCCI+CBUNGARD01$ NETWORKCCI+Domain Computers 4096 Jul 26 12:48 cbungard01_ drwxr-xr-x 4 NETWORKCCI+hholkey NETWORKCCI+Domain Users 4096 Jul 26 13:09 hholkey drwxr-xr-x 4 NETWORKCCI+HHOLKEY1$ NETWORKCCI+Domain Computers 4096 Jul 26 12:46 hholkey1_ drwxr-xr-x 4 NETWORKCCI+jmonchek NETWORKCCI+Domain Users 4096 Jul 26 12:50 jmonchek drwxr-xr-x 4 NETWORKCCI+JMONCHEK01$ NETWORKCCI+Domain Computers 4096 Jul 26 12:51 jmonchek01_ drwxr-xr-x 4 NETWORKCCI+mstewart NETWORKCCI+Domain Users 4096 Jul 26 12:43 mstewart
Also, please either re-open this bug, or add one for me. This is the deal exactly. Same as reported the first time.
originally reported against one of the 3.0.0rc[1-4] releases. Cleaning up non-production versions.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.