(Note: this may be same as #3545 but I am not sure. I have more information however). Problem ======= I run a stable Debian Sarge installation but using Samba 3.0.22. Using LDAP as backend. I have recently migrated from NT PDC to Samba PDC. But I am experiencing problems with samba hangs which causes the entire network (NT) to hang. The problem is intermittent. But I can now reproduce it by issuing smbclient commands more or less simultaneously from two processes. The problem is causing severe problems when it occures. How I reproduce =============== By doing the following commands I can reproduce the fault every time. /etc/init.d/samba restart # Test that all is ok smbclient -L presul1a -Uid%pw # In separate shell start a contious loop while true do; smbclient -L presul1a -Uid%pw; done # Trigg the hanging by doing one or a few concurrently # Doing this will hang smbd smbclient -L presul1a -Uid%pw # smbd not completely dead it spawns new smbd processes # and can work annonumusly like this smbclient -L presul1a -U% How to get it up running again ============================== * Do a kill -6 on all PIDS except for the two initially spawned processes. * /etc/init.d/slapd restart The kill -6 does not work all the time, I think it is due to that there is a lot of connections tried all the time and the same fault occurs again directly once the fault situation has been cleared. Strange side effect of slapd restart ------------------------------------ The restart of ldap server has a very interresting side effect; I cannot reproduce the fault after having done this! In fact, this procedure seems to prevent the fault in my tests: /etc/init.d/samba restart smbclient -L presul1a -Uid%pw /etc/init.d/slapd restart # In separate shell start a contious loop while true do; smbclient -L presul1a -Uid%pw; done # Seems to not cause the fault this time smbclient -L presul1a -Uid%pw More in depth logs and analysis =============================== When doing a single command like this smbclient -L presul1a -Uid%pw then I get a trace that looks like this (part of it): ------------- [2006/05/04 19:00:46, 3] smbd/sec_ctx.c:set_sec_ctx(288) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1 [2006/05/04 19:00:46, 5] auth/auth_util.c:debug_nt_user_token(433) NT user token: (NULL) [2006/05/04 19:00:46, 5] auth/auth_util.c:debug_unix_user_token(454) UNIX token of user 0 Primary group is 0 and contains 0 supplementary groups [2006/05/04 19:00:46, 10] lib/system_smbd.c:sys_getgrouplist(167) sys_getgrouplist: user [petos] [2006/05/04 19:00:46, 10] lib/system_smbd.c:sys_getgrouplist(176) sys_getgrouplist(): disabled winbindd for group lookup [user == petos] [2006/05/04 19:00:46, 3] smbd/sec_ctx.c:push_sec_ctx(256) push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2 [2006/05/04 19:00:46, 3] smbd/uid.c:push_conn_ctx(393) push_conn_ctx(0) : conn_ctx_stack_ndx = 1 [2006/05/04 19:00:46, 3] smbd/sec_ctx.c:set_sec_ctx(288) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 2 [2006/05/04 19:00:46, 5] auth/auth_util.c:debug_nt_user_token(433) NT user token: (NULL) [2006/05/04 19:00:46, 5] auth/auth_util.c:debug_unix_user_token(454) UNIX token of user 0 Primary group is 0 and contains 0 supplementary groups [2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(49) remove_duplicate_gids: Enter 15 gids [2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(67) remove_duplicate_gids: Exit 14 gids [2006/05/04 19:00:46, 3] smbd/sec_ctx.c:pop_sec_ctx(386) pop_sec_ctx (0, 0) - sec_ctx_stack_ndx = 1 [2006/05/04 19:00:46, 3] smbd/sec_ctx.c:push_sec_ctx(256) push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2 [2006/05/04 19:00:46, 3] smbd/uid.c:push_conn_ctx(393) ------------- When the fault occures (hanging) I get this once I also do a pkill -6 smbd: ------------- [2006/05/04 18:20:28, 3] smbd/sec_ctx.c:set_sec_ctx(288) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1 [2006/05/04 18:20:28, 5] auth/auth_util.c:debug_nt_user_token(433) NT user token: (NULL) [2006/05/04 18:20:28, 5] auth/auth_util.c:debug_unix_user_token(454) UNIX token of user 0 Primary group is 0 and contains 0 supplementary groups [2006/05/04 18:20:28, 10] lib/system_smbd.c:sys_getgrouplist(167) sys_getgrouplist: user [petos] [2006/05/04 18:20:28, 10] lib/system_smbd.c:sys_getgrouplist(176) sys_getgrouplist(): disabled winbindd for group lookup [user == petos] [2006/05/04 18:20:28, 3] smbd/sec_ctx.c:push_sec_ctx(256) push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2 [2006/05/04 18:20:28, 3] smbd/uid.c:push_conn_ctx(393) push_conn_ctx(0) : conn_ctx_stack_ndx = 1 [2006/05/04 18:20:28, 3] smbd/sec_ctx.c:set_sec_ctx(288) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 2 [2006/05/04 18:20:28, 5] auth/auth_util.c:debug_nt_user_token(433) NT user token: (NULL) [2006/05/04 18:20:28, 5] auth/auth_util.c:debug_unix_user_token(454) UNIX token of user 0 Primary group is 0 and contains 0 supplementary groups [2006/05/04 18:20:39, 0] lib/fault.c:fault_report(36) =============================================================== [2006/05/04 18:20:39, 0] lib/fault.c:fault_report(37) INTERNAL ERROR: Signal 6 in pid 15115 (3.0.22) Please read the Trouble-Shooting section of the Samba3-HOWTO [2006/05/04 18:20:39, 0] lib/fault.c:fault_report(39) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2006/05/04 18:20:39, 0] lib/fault.c:fault_report(40) =============================================================== [2006/05/04 18:20:39, 0] lib/util.c:smb_panic2(1554) PANIC: internal error [2006/05/04 18:20:39, 0] lib/util.c:smb_panic2(1562) BACKTRACE: 42 stack frames: #0 /usr/sbin/smbd(smb_panic2+0x111) [0x8212181] #1 /usr/sbin/smbd(smb_panic+0x1a) [0x821206a] #2 /usr/sbin/smbd [0x81fea14] #3 [0xffffe420] #4 /usr/lib/libldap_r.so.2 [0x4002acef] #5 /lib/libnss_ldap.so.2 [0x40745f8a] #6 /lib/libnss_ldap.so.2 [0x407465b2] #7 /lib/libnss_ldap.so.2 [0x40747093] #8 /lib/libnss_ldap.so.2 [0x407494f5] #9 /lib/libnss_ldap.so.2 [0x4074936a] #10 /lib/libnss_ldap.so.2 [0x40746552] #11 /lib/libnss_ldap.so.2 [0x40747093] #12 /lib/libnss_ldap.so.2(_nss_ldap_initgroups_dyn+0x14c) [0x407496cc] #13 /lib/tls/libc.so.6 [0x4022eb54] #14 /lib/tls/libc.so.6(initgroups+0x6b) [0x4022ea4b] #15 /usr/sbin/smbd [0x81f0bb4] #16 /usr/sbin/smbd [0x81f0d84] #17 /usr/sbin/smbd [0x81f0ed6] #18 /usr/sbin/smbd(pdb_default_enum_group_memberships+0x29) [0x81f1039] #19 /usr/sbin/smbd [0x81e1cf5] #20 /usr/sbin/smbd [0x81d4134] #21 /usr/sbin/smbd(pdb_enum_group_memberships+0x4e) [0x81d5e7e] #22 /usr/sbin/smbd [0x824f202] #23 /usr/sbin/smbd [0x824f35e] #24 /usr/sbin/smbd(make_server_info_sam+0x140) [0x824f650] #25 /usr/sbin/smbd [0x824a5f4] #26 /usr/sbin/smbd [0x824a8f1] #27 /usr/sbin/smbd [0x8247d9f] #28 /usr/sbin/smbd [0x8251a30] #29 /usr/sbin/smbd [0x8119d81] #30 /usr/sbin/smbd(ntlmssp_update+0x19b) [0x81192bb] #31 /usr/sbin/smbd(auth_ntlmssp_update+0x3d) [0x8251d5d] #32 /usr/sbin/smbd [0x80b7446] #33 /usr/sbin/smbd [0x80b78e6] #34 /usr/sbin/smbd(reply_sesssetup_and_X+0xd8e) [0x80b89fe] #35 /usr/sbin/smbd [0x80e0d18] #36 /usr/sbin/smbd [0x80e0f60] #37 /usr/sbin/smbd(process_smb+0x8c) [0x80e116c] #38 /usr/sbin/smbd(smbd_process+0x155) [0x80e2055] #39 /usr/sbin/smbd(main+0x563) [0x82a03d3] #40 /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x4019d974] #41 /usr/sbin/smbd [0x807ee91] ----------------- Comparing the two logs shows that they are identical up to the point of panic (caused by the kill -6). The log from the fully functioning invocation continues like this: [2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(49) remove_duplicate_gids: Enter 15 gids Together with surrounding the log entries and study of the code one may thus conclude the following functions is involved: sys_getgrouplist() getgrouplist_internals() initgroups() remove_duplicate_gids() The call to initgroups() is apperently never returning for some reason - the process hangs. The backtrace when issuing a kill -6 shows that we indeed are involved in the initgroups() call which takes the path via libnss_ldap.so to do its work. As we know, doing a slapd restart resolves the problem for us, but then also killing the smbd processes involved in the hanging. Also the peculiar side effect that the problem is not (or very hard) to reproduce once slapd is restarted and samba is kept running. Can it for this reason be that different forked smbd processes shares a common file handle to slapd and after the slapd restart each process gets its own handle....??? regards Petter
I think I have located the bug to the nss_ldap package (file libnss_ldap-2.3.2.so). I have registered in BugZilla at www.padl.com with #271 for this issue. I do not know if the fault is limited to debian/sarge. The bug is causes the smbd child processes to share the parents (and thus each others) nss_ldap socket to the LDAP server. And thus when two smbd child processes is forked close to each other the fault is likely to happen, making the clients to mess up the connection to the ldap server. The bug is that ldap-nss.c should have registered fork hooks via the pthread_atfork() function and close the child's socket after the fork. The reason for this function is not called is due to a week reference to the function, __pthread_atfork(), is used instead. But at least on debian/sarge this weak reference is NULL unless some code actually made a reference to the pthread_atfork() symbol. And thus no fork hooks is registered. This fault also explans the numerous (every smbd fork) log entries: smbd[18358]: nss_ldap: reconnecting to LDAP server... smbd[18358]: nss_ldap: reconnected to LDAP server after 1 attempt(s) Since the message implies exactly that confusion has occured, a search has been attempted but failed. A new reopen of conection succeeds. Possible workarounds: * Restart slapd after samba has been started. This closes the ldap connection in the master smbd process so that clients cannot share the socket, * or rebuild the libnss_ldap source package, like this (e.g:) LIBS=-lpthread apt-src build libnss-ldap This makes the macro HAVE_PTHREAD_ATFORK to be defined and then ldap-nss.c will explicitly call pthread_atfork(). This fix also takes care of the nss-ldap log entries!
closing since this is PADL's bug.