Bug 6279 - Winbind daily crashes, attached patch to fix bug
Summary: Winbind daily crashes, attached patch to fix bug
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.2
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.2.11
Hardware: Other Linux
: P3 normal
Target Milestone: ---
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-20 08:35 UTC by Francis Brosnan Blázquez
Modified: 2009-04-23 02:24 UTC (History)
1 user (show)

See Also:


Attachments
Possible patch. (351 bytes, patch)
2009-04-20 09:43 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Francis Brosnan Blázquez 2009-04-20 08:35:54 UTC
Using winbind component to retrieve users from an ADS server causes winbind to daily crash. We have bypassed the problem by restarting the server at night. However, this does not solve the issue since it is usually required an another restart.

Attached it is the backtrace of the winbind server:
winbindd: error.c:272: ldap_parse_result: Assertion `r != ((void *)0)' failed.
[2009/04/05 02:11:01,  0] lib/fault.c:fault_report(40)
  ===============================================================
[2009/04/05 02:11:01,  0] lib/fault.c:fault_report(41)
  INTERNAL ERROR: Signal 6 in pid 20777 (3.2.5)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2009/04/05 02:11:01,  0] lib/fault.c:fault_report(43)
  
  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2009/04/05 02:11:01,  0] lib/fault.c:fault_report(44)
  ===============================================================
[2009/04/05 02:11:01,  0] lib/util.c:smb_panic(1663)
  PANIC (pid 20777): internal error
[2009/04/05 02:11:01,  0] lib/util.c:log_stack_trace(1767)
  BACKTRACE: 25 stack frames:
   #0 /usr/sbin/winbindd(log_stack_trace+0x1c) [0x4e1cbd]
   #1 /usr/sbin/winbindd(smb_panic+0x5b) [0x4e1dcb]
   #2 /usr/sbin/winbindd [0x4d1064]
   #3 /lib/libc.so.6 [0x7fd2c5d61f60]
   #4 /lib/libc.so.6(gsignal+0x35) [0x7fd2c5d61ed5]
   #5 /lib/libc.so.6(abort+0x183) [0x7fd2c5d633f3]
   #6 /lib/libc.so.6(__assert_fail+0xe9) [0x7fd2c5d5adc9]
   #7 /usr/lib/libldap_r-2.4.so.2 [0x7fd2c66b690d]
   #8 /usr/sbin/winbindd [0x5fa099]
   #9 /usr/sbin/winbindd(ads_do_search_all_args+0x87) [0x5fa4a9]
   #10 /usr/sbin/winbindd [0x5ffdc4]
   #11 /usr/sbin/winbindd(ads_do_search_retry+0x13) [0x600808]
   #12 /usr/sbin/winbindd(ads_search_retry+0x1e) [0x600851]
   #13 /usr/sbin/winbindd [0x478838]
   #14 /usr/sbin/winbindd [0x4638ea]
   #15 /usr/sbin/winbindd(winbindd_dual_list_users+0x6e) [0x47eb4b]
   #16 /usr/sbin/winbindd [0x47be1f]
   #17 /usr/sbin/winbindd [0x47bfcd]
   #18 /usr/sbin/winbindd(async_request+0x18a) [0x47d133]
   #19 /usr/sbin/winbindd(async_domain_request+0x49) [0x47d27d]
   #20 /usr/sbin/winbindd [0x45b018]
   #21 /usr/sbin/winbindd(rescan_trusted_domains+0x46) [0x45b358]
   #22 /usr/sbin/winbindd(main+0xc43) [0x452595]
   #23 /lib/libc.so.6(__libc_start_main+0xe6) [0x7fd2c5d4e1a6]
   #24 /usr/sbin/winbindd [0x4504f9]

The problem was first reported at:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522907

The problem is located at ads_do_paged_search_args (samba-3.2.11/source/libads/ldap.c:811):

   	if (rc) {
		DEBUG(3,("ads_do_paged_search_args: 
                 ldap_search_with_timeout(%s) -> %s\n", expr,
			 ldap_err2string(rc)));
		goto done;
	}
	rc = ldap_parse_result(ads->ldap.ld, *res, NULL, NULL, NULL,
					NULL, &rcontrols,  0);

This condition is not complete because it does not check res variable to be not NULL (returned by previous ldap_search_with_timeout), before calling to ldap_parse_result function.

The former function asserts if the result to be parsed is a null reference causing winbind instances to crash. 

The obvious solution is to improve the check to:

       	if (rc || res == NULL) {

Looking at the problem it is found that ldap_search_ext_s function may return a NULL LDAPMessage reference (res for our case) especially under timeout scenarios (which returns 0). This causes the expression "if (rc) {" to evaluate as proper result, causing a call to ldap_parse_result with a NULL res (LDAPMessage) at ldap.c:820.
Comment 1 Jeremy Allison 2009-04-20 09:27:08 UTC
If ldap_search_ext_s() can return LDAP_SUCCESS with a NULL *res then we're broken in many more ways than just this specific problem. What LDAP libraries are you using that can return LDAP_SUCCESS without a result pointer ?
Jeremy.
Comment 2 Jeremy Allison 2009-04-20 09:39:53 UTC
FYI. I've got a query into the OpenLDAP people to try and find out if ldap_search_ext_s() can return LDAP_SUCCESS with a null result pointer. If the server timed out it should be returning LDAP_TIMELIMIT_EXCEEDED. Why would it return LDAP_SUCCESS in this case (which is what you're claiming) ?
Jeremy.
Comment 3 Jeremy Allison 2009-04-20 09:43:22 UTC
Created attachment 4075 [details]
Possible patch.

Can you check this patch to see if it fixes the issue for you. This looks to be a better place to add the check, so ldap_search_with_timeout() always returns LDAP_TIMELIMIT_EXCEEDED in the error case.
Jeremy.
Comment 4 Francis Brosnan Blázquez 2009-04-20 10:53:46 UTC
Hi Jeremy,

Thanks for taking your time. 

>> What LDAP libraries are you using that can return LDAP_SUCCESS without 
>> a result pointer ?

We are using stable openldap debian version (libldap-2.4-2 2.4.11-1).

>> Why would it return LDAP_SUCCESS in this case (which is what you're 
>> claiming) ?

Ok, maybe it is a bug only applicable to openldap. It seems openldap is returning 0 (LDAP_SUCCESS were it is not) and secondly ldap_parse_result do an assert in the case of NULL pointer received (which is an unacceptable/fanatic behaviour for a library). Here is the relevant piece of code of ldap_search_ext_s from the debian version we are using:

[...]openldap_2.4.11.orig/libraries/libldap/search.c:140

	rc = ldap_result( ld, msgid, LDAP_MSG_ALL, timeout, res );

	if( rc <= 0 ) {
		/* error(-1) or timeout(0) */
		return( ld->ld_errno );
	}
[...]

As it is showed in the code, under timeout condition,  is returned ld->ld_errno found which may be not configured to a other value than 0 (in the case of error). I think this is what is causing the bug. 

Comment 5 Jeremy Allison 2009-04-20 10:56:24 UTC
So this does look like a bug in the openldap libraries. They should never return LDAP_SUCCESS (0) in the case of a timeout.

Can you confirm that the proposed patch I posted works to fix this ? If so I'll add it to all active Samba branches.

Thanks,

Jeremy.
Comment 6 Francis Brosnan Blázquez 2009-04-20 11:00:13 UTC
Hi Jeremy,

Ok, I'll forward your patch to debian samba maintainers. At this moment I have no other machine to test this configuration (build sources, etc). But taking the patch you provide, it is clear that returned NULL values won't cause problems anymore (no matter which value contains rc).

Cheers!
Comment 7 Jeremy Allison 2009-04-21 02:35:39 UTC
Any feedback on this bug ? If it's a real fix I'd like to get it into the source trees asap.

Jeremy.
Comment 8 Debian samba package maintainers (PUBLIC MAILING LIST) 2009-04-21 02:41:03 UTC
Jeremy,

I just asked Francis (in the relevant Debian bug log) to check with his setup if your patch fixes the problem as I think he's the best placed to do this. One should note that he was initially reporting daily crashes so it might take him at least one day to try reproducing the bug or not (plus the time to recompile packages and/or binaries).

For Debian packages, I'm considering your patch for inclusion in our stable release (ie patching our 3.2.5) as that seems important enough for being worth fixing.

As you mentioned that you were surprised by the behaviour of openldap here, do you think it would be worth a bug report to OpenLDAP as well?

--
Christian Perrier (from SambaXP conference hall)
Comment 9 Debian samba package maintainers (PUBLIC MAILING LIST) 2009-04-21 04:53:54 UTC
Francis built deb packages and installed them on his production server.

As of now, the last crash was yesterday:
20-04-2009 - 12:41:01: winbind check failed (error code=1), restarting..

but that was before applying the patch..:)

As francis said, we now need to wait for 1 or 2 days to check whether crashes are still experienced
Comment 10 Francis Brosnan Blázquez 2009-04-22 03:38:37 UTC
Hi,

Since yesterday (11:41:01) until now (10:36:00) we got winbind running without problems (nearly 23 hours without crashes). I think the patch fixes the problem. 

Maybe we should wait at least one day more to really confirm but having the data exposed by the bug report I find pretty obvious the bug is now solved.

Cheers!
Comment 11 Debian samba package maintainers (PUBLIC MAILING LIST) 2009-04-22 04:52:08 UTC
As Karolin is preparing 3.3.4 for...today, supposedly, it would be nice if it ends up in there.

I'll try to prod Jeremy about pushing it. I think that Francis not having winbind crashes anymore is probably enough for pushing the patch. After all, that's a very simple one.
Comment 12 Jeremy Allison 2009-04-22 05:14:58 UTC
Pushed to all branches.
Jeremy.
Comment 13 Francis Brosnan Blázquez 2009-04-23 02:24:51 UTC
Hi All,

Just to confirm definititely that winbind is running since two days ago without problems. 

Cheers!