Bug 4009 - Samba 3.0.23[ab] winbind leaves sockets in CLOSE_WAIT; leaks file-descriptors
Summary: Samba 3.0.23[ab] winbind leaves sockets in CLOSE_WAIT; leaks file-descriptors
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.23b
Hardware: Sparc Solaris
: P3 critical
Target Milestone: none
Assignee: Guenther Deschner
QA Contact: Samba QA Contact
URL:
Keywords:
: 3834 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-08-08 21:19 UTC by William Charles
Modified: 2007-03-14 07:22 UTC (History)
2 users (show)

See Also:


Attachments
do not rely on ads_do_search_retry to do the unbind after the ads_USN() (1.79 KB, patch)
2006-08-10 14:59 UTC, Guenther Deschner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description William Charles 2006-08-08 21:19:13 UTC
I have observed what look to be a file-descriptor leak in Samba 3.0.23[ab] 'winbindd'. LDAP sockets are getting left in a 'CLOSE_WAIT', and the logfile repeatedly says this; I guess that two are related:

[2006/07/26 07:13:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 07:23:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 07:33:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 07:43:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 07:53:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 08:03:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)
  ads reopen failed after error Operations error
[2006/07/26 08:13:48, 1] libads/ldap_utils.c:ads_do_search_retry_internal(97)

This LDAP 'operations error' is being returned by our Windows 2003 DC to a Solaris 8 host (I will try confirm/deny other UNIX platforms as soon as).


The following LDAP query does work:

ldapsearch -h <domain controller> -b "" -x -s base "(objectclass=*)" highestCommittedUSN


But, to quote Guenther:

Ok, I can reproduce that behaviour. Just add "-Epr=1000" (for newer OpenLDAP libs) and I get:

"result: 1 Operations error
text: 00000000: LdapErr: DSID-0C090627, comment: In order to perform this
operation a successful bind must be completed on the connection., data 0, vece"

So we may either not do that query anonymously, or not do that query using
paged results (which is silly here anyway). I see if I can prepare a quick
fix for that.


Changing the code in 'libads/ldap.c' to not utilise paged results fixes the immediate issue with 'highestCommittedUSN', but still other winbind LDAP queries are also returning an 'operations error'; for example 'wbinfo --domain-users' and 'wbinfo --user-info' both fail.

FYI, this has been observed using OpenLDAP 2.3.24 and 2.3.25.
Comment 1 Gerald (Jerry) Carter (dead mail address) 2006-08-09 07:05:40 UTC
Guenther still doesn't understand why you get the operations
error.  No one else is reporting this.  I'm turning it over to 
him.
Comment 2 Guenther Deschner 2006-08-10 14:59:04 UTC
Created attachment 2090 [details]
do not rely on ads_do_search_retry to do the unbind after the ads_USN()

Can you please this patch?
Comment 3 William Charles 2006-08-10 22:00:41 UTC
OK, the patch certainly seems to have improved things. I am now able to query for the USN without seeing an 'Operations error'. But, it's not all good -- I do still see errors being returned for paged searches. For example, if I do 'wbinfo --domain-groups' I get this logged:

[2006/08/11 12:55:53, 3] libads/ldap.c:ads_connect(288)
  Connected to LDAP server 10.179.8.49
[2006/08/11 12:55:53, 3] libads/sasl.c:ads_sasl_spnego_bind(210)
  ads_sasl_spnego_bind: got OID=1 2 840 48018 1 2 2
[2006/08/11 12:55:53, 3] libads/sasl.c:ads_sasl_spnego_bind(210)
  ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2
[2006/08/11 12:55:53, 3] libads/sasl.c:ads_sasl_spnego_bind(210)
  ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2 3
[2006/08/11 12:55:53, 3] libads/sasl.c:ads_sasl_spnego_bind(210)
  ads_sasl_spnego_bind: got OID=1 3 6 1 4 1 311 2 2 10
[2006/08/11 12:55:53, 3] libads/sasl.c:ads_sasl_spnego_bind(219)
  ads_sasl_spnego_bind: got server principal name =sydeswdbp1$@XXX.YYY.ZZZ.COM
[2006/08/11 12:55:53, 3] libsmb/clikrb5.c:ads_cleanup_expired_creds(488)
  ads_cleanup_expired_creds: Ticket in ccache[MEMORY:winbind_ccache] expiration Fri, 11 Aug 2006 22:51:05 EST
[2006/08/11 12:55:53, 3] libads/ldap.c:ads_do_paged_search_args(580)
  ads_do_paged_search_args: ldap_search_with_timeout((&(objectCategory=group)(&(groupType:dn:1.2.840.113556.1.4.803:=-2147483648)(!(groupType:dn:1.2.840.113556.1.4.803:=1))))) -> Operations error


I have yet, however, to see a socket stuck in CLOSE_WAIT...
Comment 4 William Charles 2006-08-11 00:57:24 UTC
FYI, I have just managed to compile and briefly test 3.0.23b on SuSE Linux (SLES8), and as far as I can tell I'm getting the exact same LDAP 'Operations error' symptoms previously observed on Solaris. I'll try your patch on this platform too...
Comment 5 William Charles 2006-08-28 19:30:51 UTC
A quick Google revealed this, which looks similar and specifically describes a change by Microsoft in 2003 SP1 domain controllers:

http://blog.joeware.net/2006/03/15/259/

To quote a response from Microsoft:

This is not a bug, this is intentional. We do not allow almost all LDAP controls to be used before you bind (the only control we do allow is for extended DNs). It has nothing to do with you querying RootDSE, only to do with the fact that you are doing it before you have authenticated. So if you auth, then you can do this.
We did this first in SP1.
Comment 6 Guenther Deschner 2007-02-01 09:11:25 UTC
Finally got that reproduced. Very bad, working on it.
Comment 7 Guenther Deschner 2007-02-08 11:05:50 UTC
Fixed with -r21240

Please reopen if you still see an issue.
Comment 8 Guenther Deschner 2007-02-09 17:46:16 UTC
*** Bug 3834 has been marked as a duplicate of this bug. ***