364 – When joined to an ADS domain the sequence number becomes DISCONNECTED after 30 or so minutes of idle time.

Bug 364 - When joined to an ADS domain the sequence number becomes DISCONNECTED after 30 or so minutes of idle time.

Summary: When joined to an ADS domain the sequence number becomes DISCONNECTED after 3...

Status:	CLOSED FIXED

Alias:	None

Product:	Samba 3.0
Classification:	Unclassified
Component:	winbind (show other bugs)
Version:	3.0.0preX
Hardware:	Other other

Importance:	P1 critical
Target Milestone:	3.0.0rc3
Assignee:	Gerald (Jerry) Carter (dead mail address)
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2003-08-28 09:09 UTC by Marc Kaplan
Modified:	2005-08-24 10:25 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Fix sequence_number timeout (1.08 KB, patch) 2003-08-29 10:04 UTC, Ken Cross	no flags	Details
Full log of Winbind in action @ level 10. (50MB when inflated) (890.80 KB, application/octet-stream) 2004-08-01 18:47 UTC, Greg Folkert	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marc Kaplan 2003-08-28 09:09:24 UTC

This is a showstopper IMO, since it basically causes a denial of service for 
those in ADS domains.

Email from Ken Cross:
"Samba-folk:

The nsswitch/winbindd_ads.c module uses "ads_cached_connection" to maintain
an open connection to an LDAP server.  I have found at least one instance
where that isn't working too well.

When viewing the sequence number with wbinfo --sequence, I noticed that
servers would periodically show up "DISCONNECTED", then later (usually)
they'd be OK.  The problem seems to be in using the cached connection.  The
call to ads_USN in the sequence_number routine was returning "Can't contact
LDAP server".

I added a retry in the sequence_number function (blowing away the cached
connection) and the retry always seems to succeed.

I'm speculating here, but is it really valid to leave a connection open
indefinitely?  There can be long intervals between updating sequence
numbers.

Also, if the sequence_number routine fails at times, could other routines
using ads_cached_connection (11 of them) be having problems occasionally?
If so, a more general solution needs to be considered.

Ken"

So one moment I have this (LOTHLORIEN is an NT4 domain, the rest are ADS):
[root@ThunderBird lib]# wbinfo --sequence
LOTHLORIEN : 444
C-DOMAIN : 332974
B-DOMAIN : 423905
A-DOMAIN : 789953

15 minutes or so later, I have:
[root@ThunderBird lib]# wbinfo --sequence
LOTHLORIEN : 444
C-DOMAIN : DISCONNECTED
B-DOMAIN : DISCONNECTED
A-DOMAIN : DISCONNECTED

Here's a log snippet @ debug level 10:
When I first join, and I request sequence, I see:
process_request: request fn SHOW_SEQUENCE
[  861]: show sequence
refresh_sequence_number: A-DOMAIN time ok
refresh_sequence_number: A-DOMAIN seq number is now 788398


What's happening is that I'm getting:
process_request: request fn SHOW_SEQUENCE
[16992]: show sequence
refresh_sequence_number: A-DOMAIN time ok
refresh_sequence_number: A-DOMAIN seq number is now -1

There's not too much other interesting info in the log about this.

Comment 1 Ken Cross 2003-08-29 10:04:09 UTC

Created attachment 113 [details]
Fix sequence_number timeout

FWIW, here's the patch I did to retry the sequence_number.  However, I am still
concerned that other routines using ads_cached_connection could be having
similar problems, but may be more subtle and harder to detect.

Comment 2 Marc Kaplan 2003-08-30 09:50:03 UTC

Yes, it's not just the sequence num that is affected. Authentication fails to 
a Samba server in this state, which is big problem, since it can't succeed
again until winbindd is restarted.

Comment 3 Gerald (Jerry) Carter (dead mail address) 2003-09-04 21:47:16 UTC

I'm checking in a version of Ken's retry patchfor 
all major winbindd_ads operations.  I've yet to get a 
disconnected message but please check.  

I think this should be fixed now.  If not, then reopen.

Comment 4 Greg Folkert 2004-08-01 18:47:32 UTC

Created attachment 592 [details]
Full log of Winbind in action @ level 10. (50MB when inflated)

This log did not capture the timeout event, but it did catch the restart and 8
hours befor it.

winbind reset happened at 2004/07/31 22:29:01 and at about 23:06 (for a config
change)

Comment 5 Greg Folkert 2004-08-01 19:02:23 UTC

This is a major major annoyance, as it can be fixed relatively quickly, but the
fact that I am a contracted consultant to finish this, give me a bit more
"responsibility" for this large firm I am working on it for.

Just a note to say this *IS* still happening, even in 3.0.5 (security release
version not bugfix/enhancement release candidate)

This is on an RHES v3.0 Service release 2, with update MIT kerberos to v1.3.4,
samba v3.0.5, pam v0.77, pam_smb v1.1.7, pam_krb5 v2.0.10, e2fsprogs v


Winbind settings in smb.conf:

        realm = NETWORKMCS.COM
        security = ADS
        auth methods = winbind, sam
        obey pam restrictions = Yes
        username level = 3
        lanman auth = No
        ntlm auth = No
        client NTLMv2 auth = Yes
        client lanman auth = No
        client plaintext auth = No
        smb ports = 445
        disable netbios = Yes
        server signing = auto
        socket options = IPTOS_LOWDELAY TCP_NODELAY

        idmap uid = 10000-40000
        idmap gid = 10000-40000
        template homedir = /lf/data/home/%D/%U
        template shell = /bin/bash
        winbind separator = +
        winbind cache time = 20
        winbind nested groups = Yes



-----SNIPPET of console----
[root@mash drop_off]# l /lf/data/home/NETWORKCCI/
total 24
drwxr-xr-x    4 10847    10216        4096 Jul 26 12:48 cbungard01_
drwxr-xr-x    4 10635    10008        4096 Jul 26 13:09 hholkey
drwxr-xr-x    4 10849    10216        4096 Jul 26 12:46 hholkey1_
drwxr-xr-x    4 10659    10008        4096 Jul 26 12:50 jmonchek
drwxr-xr-x    4 10850    10216        4096 Jul 26 12:51 jmonchek01_
drwxr-xr-x    4 10722    10008        4096 Jul 26 12:43 mstewart


[root@mash drop_off]# wbinfo -p
Ping to winbindd succeeded on fd 4

[root@mash drop_off]# wbinfo --sequence
CC3 : DISCONNECTED
NETWORKMG : 74731
NETWORKCCI : 19309
NETWORKMCS : 18305
NETWORKDMC : 1255
CCGROUPNET : DISCONNECTED


(added by greg@gregfolkert.net: Date 2004/07/31 22:29:01)
[root@mash drop_off]#  /etc/init.d/winbind  restart

Shutting down Winbind services:                            [  OK  ]
Starting Winbind services:                                 [  OK  ]


[root@mash drop_off]# wbinfo --sequence
CC3 : 1
NETWORKMG : 74731
NETWORKCCI : 19309
NETWORKMCS : 18305
NETWORKDMC : 1255
CCGROUPNET : 6524523



[root@mash drop_off]# l /lf/data/home/NETWORKCCI/
total 24
drwxr-xr-x    4 NETWORKCCI+CBUNGARD01$ NETWORKCCI+Domain Computers
4096 Jul 26 12:48 cbungard01_
drwxr-xr-x    4 NETWORKCCI+hholkey NETWORKCCI+Domain Users     4096 Jul
26 13:09 hholkey
drwxr-xr-x    4 NETWORKCCI+HHOLKEY1$ NETWORKCCI+Domain Computers
4096 Jul 26 12:46 hholkey1_
drwxr-xr-x    4 NETWORKCCI+jmonchek NETWORKCCI+Domain Users     4096 Jul
26 12:50 jmonchek
drwxr-xr-x    4 NETWORKCCI+JMONCHEK01$ NETWORKCCI+Domain Computers
4096 Jul 26 12:51 jmonchek01_
drwxr-xr-x    4 NETWORKCCI+mstewart NETWORKCCI+Domain Users     4096 Jul
26 12:43 mstewart

Comment 6 Greg Folkert 2004-08-01 19:04:45 UTC

Also, please either re-open this bug, or add one for me.

This is the deal exactly. Same as reported the first time.

Comment 7 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:06:10 UTC

originally reported against one of the 3.0.0rc[1-4] releases.
Cleaning up non-production versions.

Comment 8 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:25:40 UTC

sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.