Using winbind component to retrieve users from an ADS server causes winbind to daily crash. We have bypassed the problem by restarting the server at night. However, this does not solve the issue since it is usually required an another restart. Attached it is the backtrace of the winbind server: winbindd: error.c:272: ldap_parse_result: Assertion `r != ((void *)0)' failed. [2009/04/05 02:11:01, 0] lib/fault.c:fault_report(40) =============================================================== [2009/04/05 02:11:01, 0] lib/fault.c:fault_report(41) INTERNAL ERROR: Signal 6 in pid 20777 (3.2.5) Please read the Trouble-Shooting section of the Samba3-HOWTO [2009/04/05 02:11:01, 0] lib/fault.c:fault_report(43) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2009/04/05 02:11:01, 0] lib/fault.c:fault_report(44) =============================================================== [2009/04/05 02:11:01, 0] lib/util.c:smb_panic(1663) PANIC (pid 20777): internal error [2009/04/05 02:11:01, 0] lib/util.c:log_stack_trace(1767) BACKTRACE: 25 stack frames: #0 /usr/sbin/winbindd(log_stack_trace+0x1c) [0x4e1cbd] #1 /usr/sbin/winbindd(smb_panic+0x5b) [0x4e1dcb] #2 /usr/sbin/winbindd [0x4d1064] #3 /lib/libc.so.6 [0x7fd2c5d61f60] #4 /lib/libc.so.6(gsignal+0x35) [0x7fd2c5d61ed5] #5 /lib/libc.so.6(abort+0x183) [0x7fd2c5d633f3] #6 /lib/libc.so.6(__assert_fail+0xe9) [0x7fd2c5d5adc9] #7 /usr/lib/libldap_r-2.4.so.2 [0x7fd2c66b690d] #8 /usr/sbin/winbindd [0x5fa099] #9 /usr/sbin/winbindd(ads_do_search_all_args+0x87) [0x5fa4a9] #10 /usr/sbin/winbindd [0x5ffdc4] #11 /usr/sbin/winbindd(ads_do_search_retry+0x13) [0x600808] #12 /usr/sbin/winbindd(ads_search_retry+0x1e) [0x600851] #13 /usr/sbin/winbindd [0x478838] #14 /usr/sbin/winbindd [0x4638ea] #15 /usr/sbin/winbindd(winbindd_dual_list_users+0x6e) [0x47eb4b] #16 /usr/sbin/winbindd [0x47be1f] #17 /usr/sbin/winbindd [0x47bfcd] #18 /usr/sbin/winbindd(async_request+0x18a) [0x47d133] #19 /usr/sbin/winbindd(async_domain_request+0x49) [0x47d27d] #20 /usr/sbin/winbindd [0x45b018] #21 /usr/sbin/winbindd(rescan_trusted_domains+0x46) [0x45b358] #22 /usr/sbin/winbindd(main+0xc43) [0x452595] #23 /lib/libc.so.6(__libc_start_main+0xe6) [0x7fd2c5d4e1a6] #24 /usr/sbin/winbindd [0x4504f9] The problem was first reported at: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522907 The problem is located at ads_do_paged_search_args (samba-3.2.11/source/libads/ldap.c:811): if (rc) { DEBUG(3,("ads_do_paged_search_args: ldap_search_with_timeout(%s) -> %s\n", expr, ldap_err2string(rc))); goto done; } rc = ldap_parse_result(ads->ldap.ld, *res, NULL, NULL, NULL, NULL, &rcontrols, 0); This condition is not complete because it does not check res variable to be not NULL (returned by previous ldap_search_with_timeout), before calling to ldap_parse_result function. The former function asserts if the result to be parsed is a null reference causing winbind instances to crash. The obvious solution is to improve the check to: if (rc || res == NULL) { Looking at the problem it is found that ldap_search_ext_s function may return a NULL LDAPMessage reference (res for our case) especially under timeout scenarios (which returns 0). This causes the expression "if (rc) {" to evaluate as proper result, causing a call to ldap_parse_result with a NULL res (LDAPMessage) at ldap.c:820.
If ldap_search_ext_s() can return LDAP_SUCCESS with a NULL *res then we're broken in many more ways than just this specific problem. What LDAP libraries are you using that can return LDAP_SUCCESS without a result pointer ? Jeremy.
FYI. I've got a query into the OpenLDAP people to try and find out if ldap_search_ext_s() can return LDAP_SUCCESS with a null result pointer. If the server timed out it should be returning LDAP_TIMELIMIT_EXCEEDED. Why would it return LDAP_SUCCESS in this case (which is what you're claiming) ? Jeremy.
Created attachment 4075 [details] Possible patch. Can you check this patch to see if it fixes the issue for you. This looks to be a better place to add the check, so ldap_search_with_timeout() always returns LDAP_TIMELIMIT_EXCEEDED in the error case. Jeremy.
Hi Jeremy, Thanks for taking your time. >> What LDAP libraries are you using that can return LDAP_SUCCESS without >> a result pointer ? We are using stable openldap debian version (libldap-2.4-2 2.4.11-1). >> Why would it return LDAP_SUCCESS in this case (which is what you're >> claiming) ? Ok, maybe it is a bug only applicable to openldap. It seems openldap is returning 0 (LDAP_SUCCESS were it is not) and secondly ldap_parse_result do an assert in the case of NULL pointer received (which is an unacceptable/fanatic behaviour for a library). Here is the relevant piece of code of ldap_search_ext_s from the debian version we are using: [...]openldap_2.4.11.orig/libraries/libldap/search.c:140 rc = ldap_result( ld, msgid, LDAP_MSG_ALL, timeout, res ); if( rc <= 0 ) { /* error(-1) or timeout(0) */ return( ld->ld_errno ); } [...] As it is showed in the code, under timeout condition, is returned ld->ld_errno found which may be not configured to a other value than 0 (in the case of error). I think this is what is causing the bug.
So this does look like a bug in the openldap libraries. They should never return LDAP_SUCCESS (0) in the case of a timeout. Can you confirm that the proposed patch I posted works to fix this ? If so I'll add it to all active Samba branches. Thanks, Jeremy.
Hi Jeremy, Ok, I'll forward your patch to debian samba maintainers. At this moment I have no other machine to test this configuration (build sources, etc). But taking the patch you provide, it is clear that returned NULL values won't cause problems anymore (no matter which value contains rc). Cheers!
Any feedback on this bug ? If it's a real fix I'd like to get it into the source trees asap. Jeremy.
Jeremy, I just asked Francis (in the relevant Debian bug log) to check with his setup if your patch fixes the problem as I think he's the best placed to do this. One should note that he was initially reporting daily crashes so it might take him at least one day to try reproducing the bug or not (plus the time to recompile packages and/or binaries). For Debian packages, I'm considering your patch for inclusion in our stable release (ie patching our 3.2.5) as that seems important enough for being worth fixing. As you mentioned that you were surprised by the behaviour of openldap here, do you think it would be worth a bug report to OpenLDAP as well? -- Christian Perrier (from SambaXP conference hall)
Francis built deb packages and installed them on his production server. As of now, the last crash was yesterday: 20-04-2009 - 12:41:01: winbind check failed (error code=1), restarting.. but that was before applying the patch..:) As francis said, we now need to wait for 1 or 2 days to check whether crashes are still experienced
Hi, Since yesterday (11:41:01) until now (10:36:00) we got winbind running without problems (nearly 23 hours without crashes). I think the patch fixes the problem. Maybe we should wait at least one day more to really confirm but having the data exposed by the bug report I find pretty obvious the bug is now solved. Cheers!
As Karolin is preparing 3.3.4 for...today, supposedly, it would be nice if it ends up in there. I'll try to prod Jeremy about pushing it. I think that Francis not having winbind crashes anymore is probably enough for pushing the patch. After all, that's a very simple one.
Pushed to all branches. Jeremy.
Hi All, Just to confirm definititely that winbind is running since two days ago without problems. Cheers!