Hi, I have the following problem at a customer site: Symptoms: - I can do authentification, but logging onto a share hangs it, until timeout - I have 188 open smbd processes, with only 5-10 clients logged in - Almost every smbd process hangs in fcntl in tdb_brlock (waiting for a lock to disappear?) - approximatly 300 locks are assigned (attaching smbstatus output) - Happens every day or every other day. - Tracing the main daemon shows it runs out of new smbd processes to spawn (as far as I understand it) - Killing all smbd processes fixes it for a day or two. On child (with the lowest pid) hangs differently: #0 0xb7eaf189 in pthread_setcanceltype () from /lib/tls/libc.so.6 #1 0xb7e528a8 in adjtime () from /lib/tls/libc.so.6 #2 0xb7e5335e in tzset () from /lib/tls/libc.so.6 #3 0xb7e58406 in strftime_l () from /lib/tls/libc.so.6 #4 0xb7e5835f in strftime () from /lib/tls/libc.so.6 #5 0x081adf25 in timestring (hires=0) at lib/time.c:722 #6 0x081a9484 in dbghdr (level=0, file=0x8270f80 "lib/fault.c", func=0x8270f73 "fault_report", line=36) at lib/debug.c:947 #7 0x081aa34e in sig_fault (sig=6) at lib/fault.c:36 #8 <signal handler called> #9 0xb7df97c7 in raise () from /lib/tls/libc.so.6 #10 0xb7dfb06b in abort () from /lib/tls/libc.so.6 #11 0xb7e30545 in __fsetlocking () from /lib/tls/libc.so.6 #12 0xb7e36b97 in malloc_usable_size () from /lib/tls/libc.so.6 #13 0xb7e37032 in free () from /lib/tls/libc.so.6 #14 0x0817f5b1 in get_share_modes (conn=0x8342958, dev=2050, inode=2622823, pp_shares=0xbfcc30f4) at locking/locking.c:523 #15 0x080b7c01 in open_mode_check (conn=0x8342958, fname=0xbfcc3660 "Ambrasoft/Tafel Totaal/TSchlMenu.exe", dev=2050, inode=2622823, desired_access=<error type>, share_mode=64, p_flags=0xbfcc31d0, p_oplock_request=0xbfcc320c, p_all_current_opens_are_level_II=0xbfcc31cc) at smbd/open.c:611 #16 0x080b964f in open_file_shared1 (conn=0x8342958, fname=0xbfcc3660 "Ambrasoft/Tafel Totaal/TSchlMenu.exe", psbuf=0xbfcc3a60, desired_access=<error type>, share_mode=64, ofun=1, new_dos_mode=<error type>, oplock_request=3, Access=0xbfcc3ad0, paction=0xbfcc3acc) at smbd/open.c:1181 #17 0x08096fae in reply_ntcreate_and_X (conn=0x8342958, inbuf=0xb7afd008 "", outbuf=0xb79f5008 "", length=164, bufsize=131072) at smbd/nttrans.c:844 #18 0x080cb324 in switch_message (type=0, inbuf=0xb7afd008 "", outbuf=0xb79f5008 "", size=164, bufsize=131072) at smbd/process.c:968 #19 0x080cb6e1 in process_smb (inbuf=0xb7afd008 "", outbuf=0xb79f5008 "") at smbd/process.c:998 #20 0x080cc3e3 in smbd_process () at smbd/process.c:1558 #21 0x0821e473 in main (argc=2, argv=0xbfcc4ff4) at smbd/server.c:951 Most other childs: #0 0xb7e9255e in fcntl () from /lib/tls/libc.so.6 #1 0x081cd4bf in tdb_brlock (tdb=0x8324980, offset=<error type>, rw_type=0, lck_type=14, probe=0) at tdb/tdb.c:253 #2 0x081cd68b in tdb_lock (tdb=0x8324980, list=106, ltype=0) at tdb/tdb.c:307 #3 0x081ce551 in tdb_find_lock_hash (tdb=0x8324980, key={dptr = 0x82a39a0 "DOE", dsize = <error type>}, hash=<error type>, locktype=0, rec=0xbfcc2c24) at tdb/tdb.c:1045 #4 0x081ce660 in tdb_fetch (tdb=0x8324980, key={dptr = 0x82a39a0 "DOE", dsize = <error type>}) at tdb/tdb.c:1127 #5 0x0818081d in get_deferred_opens (conn=0x832da10, dev=2049, inode=1096225, pp_de_entries=0xbfcc3114) at locking/locking.c:975 #6 0x080b8d4d in delete_defered_open_entry_record (conn=0x832da10, dev=2049, inode=1096225) at smbd/open.c:803 #7 0x080b9998 in open_file_shared1 (conn=0x832da10, fname=0xbfcc3660 "Sjablonen/prf92.tmp", psbuf=0xbfcc3a60, desired_access=<error type>, share_mode=16, ofun=16, new_dos_mode=<error type>, oplock_request=3, Access=0xbfcc3ad0, paction=0xbfcc3acc) at smbd/open.c:1491 #8 0x08096fae in reply_ntcreate_and_X (conn=0x832da10, inbuf=0xb7afd008 "", outbuf=0xb79f5008 "", length=130, bufsize=131072) at smbd/nttrans.c:844 #9 0x080cb324 in switch_message (type=-512, inbuf=0xb7afd008 "", outbuf=0xb79f5008 "", size=130, bufsize=131072) at smbd/process.c:968 #10 0x080cb6e1 in process_smb (inbuf=0xb7afd008 "", outbuf=0xb79f5008 "") at smbd/process.c:998 #11 0x080cc3e3 in smbd_process () at smbd/process.c:1558 #12 0x0821e473 in main (argc=2, argv=0xbfcc4ff4) at smbd/server.c:951 However, this is a production environment and the customer is going to kill us if we don't fix it fast.. so I might have to kill the smbd processes again within a few hours.
Created attachment 1927 [details] Output of smbstatus
Oh, forgot to tell, backtrace is of samba 3.0.14 (have identical problem with 3.0.22, but this was easier for debugging right now)
I had to kill the process which allegedly caused the problem, now 231 smbd processes, but auth works again. Still a lot fcntl processes left. I'll make backtraces of all smbd processes and attach the file. Ohyeah, no relevant log messages found as it just hangs after a while (tried debuglevel 10)
Created attachment 1928 [details] Backtraces of all smbd processes
I've just killed 13022 which had the same problem (see full backtrace attachment) and all is running fine again, back to 9 smbd processes and a handful of locks. Downside is I just lost myself the testcase, 'til day after tomorrow I'm afraid..
This is probably caused by linking with the pthreads library. Do you know how that got linked in?
libpthread is linked in through libldap_r.so.2.
I do use the passdb ldap backend. Any suggestions?
Created attachment 1959 [details] Backtraces of all processes without /lib/tls
This is probably the relevant backtrace (without /lib/tls): 0xb7d35e94 in __pthread_sigsuspend () from /lib/libpthread.so.0 #0 0xb7d35e94 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0xb7d34d68 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0xb7d37716 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0xb7d3428c in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb7e5f9c2 in free () from /lib/libc.so.6 #5 0xb7e79428 in adjtime () from /lib/libc.so.6 #6 0xb7e79ecb in tzset () from /lib/libc.so.6 #7 0xb7e7ebe3 in strftime_l () from /lib/libc.so.6 #8 0xb7e7ea68 in strftime () from /lib/libc.so.6 #9 0x081adf25 in timestring (hires=0) at lib/time.c:722 #10 0x081a9484 in dbghdr (level=0, file=0x8270f80 "lib/fault.c", func=0x8270f73 "fault_report", line=36) at lib/debug.c:947 #11 0x081aa34e in sig_fault (sig=6) at lib/fault.c:36 #12 0xb7d38d26 in __pthread_sighandler () from /lib/libpthread.so.0 #13 <signal handler called> #14 0xb7e236f1 in kill () from /lib/libc.so.6 #15 0xb7d36030 in pthread_kill () from /lib/libpthread.so.0 #16 0xb7d363fb in raise () from /lib/libpthread.so.0 #17 0xb7e23464 in raise () from /lib/libc.so.6 #18 0xb7e249d8 in abort () from /lib/libc.so.6 #19 0xb7e5911f in __fsetlocking () from /lib/libc.so.6 #20 0xb7e5efb2 in malloc_usable_size () from /lib/libc.so.6 #21 0xb7e5f978 in free () from /lib/libc.so.6 #22 0x0817f5b1 in get_share_modes (conn=0x832e490, dev=2050, inode=6358381, pp_shares=0xbfed7c64) at locking/locking.c:523 #23 0x080b7c01 in open_mode_check (conn=0x832e490, fname=0xbfed81d0 "Ambrasoft/RekenSOM/rschlmenu.exe", dev=2050, inode=6358381, desired_access=<error type>, share_mode=32800, p_flags=0xbfed7d40, p_oplock_request=0xbfed7d7c, p_all_current_opens_are_level_II=0xbfed7d3c) at smbd/open.c:611 #24 0x080b964f in open_file_shared1 (conn=0x832e490, fname=0xbfed81d0 "Ambrasoft/RekenSOM/rschlmenu.exe", psbuf=0xbfed85d0, desired_access=<error type>, share_mode=32800, ofun=1, new_dos_mode=<error type>, oplock_request=3, Access=0xbfed8640, paction=0xbfed863c) at smbd/open.c:1181 #25 0x08096fae in reply_ntcreate_and_X (conn=0x832e490, inbuf=0xb7ae9008 "", outbuf=0xb79e1008 "", length=156, bufsize=131072) at smbd/nttrans.c:844 #26 0x080cb324 in switch_message (type=0, inbuf=0xb7ae9008 "", outbuf=0xb79e1008 "", size=156, bufsize=131072) at smbd/process.c:968 #27 0x080cb6e1 in process_smb (inbuf=0xb7ae9008 "", outbuf=0xb79e1008 "") at smbd/process.c:998 #28 0x080cc3e3 in smbd_process () at smbd/process.c:1558 #29 0x0821e473 in main (argc=2, argv=0xbfed9b34) at smbd/server.c:951
Turns out we have the same problem at another customer, attaching another backtrace, in case it's of any use... This may mean that there's a more serious problem when using the LDAP library. I administer 108 linux servers atm, of which 5 in this LDAP setup and about 60 with the MySQL setup, the rest in other setups. I currently only notice this problem when using the ldap passdb backend. on the other boxes libldap is linked in and thus also libpthread.so.0, but we don't use the ldap library there. Hope this helps give a bit more insight into the problem.
Created attachment 1961 [details] Backtrace of all smbd processes, at another customer site
If this really comes down to LDAP, something inside the chain nss_ldap->openldap->tls has to activate pthreads. Your smbds seem to block in free() that itself is calling pthread stuff. As the debian developers said this is not happening anywhere, we are looking at a miracle. Samba definitely does not call any pthread routines, I just grepped for pthread and found a single include line in ./nsswitch/winbind_nss_hpux.h, hardly used on your system :-) Please don't get me wrong, I'd _really_ love to see this fixed, but this would involve walking through a good piece of source code of your system that has nothing to do with Samba. Closing this as invalid, as I don't see how we inside the Samba Team can fix this. Sorry, Volker
Strangely, after removing pthreads from libldap2 and libnss-ldap and installing nscd, it gets stuck in a loop here: #0 0xb7e47dfe in write () from /lib/libc.so.6 #1 0x00000009 in ?? () #2 0x08334a70 in ?? () #3 0x081b2158 in x_fflush (f=0x8334a70) at lib/xfile.c:239 #4 0x081b2237 in x_fwrite (p=0x832ec20, size=<error type>, nmemb=<error type>, f=0x8334a70) at lib/xfile.c:168 #5 0x081b2302 in x_vfprintf (f=0x8334a70, format=0x8270cfc "[%s, %d%s] %s:%s(%d)\n", ap=0xbf870fc4 " K*\b") at lib/xfile.c:203 #6 0x081a9a5d in Debug1 (format_str=0x8270cfc "[%s, %d%s] %s:%s(%d)\n") at lib/debug.c:789 #7 0x081a94a9 in dbghdr (level=0, file=0x8270f80 "lib/fault.c", func=0x8270f73 "fault_report", line=36) at lib/debug.c:947 #8 0x081aa34e in sig_fault (sig=6) at lib/fault.c:36 #9 <signal handler called> #10 0xb7dbf6f1 in kill () from /lib/libc.so.6 #11 0xb7dbf455 in raise () from /lib/libc.so.6 #12 0xb7dc09d8 in abort () from /lib/libc.so.6 #13 0xb7df511f in __fsetlocking () from /lib/libc.so.6 #14 0xb7dfafb2 in malloc_usable_size () from /lib/libc.so.6 #15 0xb7dfb978 in free () from /lib/libc.so.6 #16 0x0817f5b1 in get_share_modes (conn=0x8333ba8, dev=2050, inode=3130783, pp_shares=0xbf871f64) at locking/locking.c:523 #17 0x080b7c01 in open_mode_check (conn=0x8333ba8, fname=0xbf8724d0 "Ambrasoft/RealSpeak/lhsp01.dll", dev=2050, inode=3130783, desired_access=<error type>, share_mode=32800, p_flags=0xbf872040, p_oplock_request=0xbf87207c, p_all_current_opens_are_level_II=0xbf87203c) at smbd/open.c:611 ---Type <return> to continue, or q <return> to quit--- #18 0x080b964f in open_file_shared1 (conn=0x8333ba8, fname=0xbf8724d0 "Ambrasoft/RealSpeak/lhsp01.dll", psbuf=0xbf8728d0, desired_access=<error type>, share_mode=32800, ofun=1, new_dos_mode=<error type>, oplock_request=3, Access=0xbf872940, paction=0xbf87293c) at smbd/open.c:1181 #19 0x08096fae in reply_ntcreate_and_X (conn=0x8333ba8, inbuf=0xb7759008 "", outbuf=0xb770c008 "", length=152, bufsize=131072) at smbd/nttrans.c:844 #20 0x080cb324 in switch_message (type=0, inbuf=0xb7759008 "", outbuf=0xb770c008 "", size=152, bufsize=131072) at smbd/process.c:968 #21 0x080cb6e1 in process_smb (inbuf=0xb7759008 "", outbuf=0xb770c008 "") at smbd/process.c:998 #22 0x080cc3e3 in smbd_process () at smbd/process.c:1558 #23 0x0821e473 in main (argc=2, argv=0xbf873e34) at smbd/server.c:951
Another quick question: Is the locking.tdb on a reiser filesystem? Volker
No, ext3 filesystem.
You say it does loop in that backtrace? So this means it continually spits out debug statements? Or is it fully stuck? Volker
No, write gets called over and over again. (I set the breakpoint to write there and it kept breaking) Now at least I get some better results, and a log message in my syslog: Signal 6, this message two times in the log: smbd[24936]: [2006/06/15 13:34:27, 0] lib/fault.c:fault_report(36) smbd[24936]: =============================================================== smbd[24936]: [2006/06/15 13:34:27, 0] lib/fault.c:fault_report(37) smbd[24936]: INTERNAL ERROR: Signal 6 in pid 24936 (3.0.14a-Debian) smbd[24936]: Please read the appendix Bugs of the Samba HOWTO collection smbd[24936]: [2006/06/15 13:34:27, 0] lib/fault.c:fault_report(39) smbd[24936]: =============================================================== smbd[24936]: [2006/06/15 13:34:27, 0] lib/util.c:smb_panic2(1474) smbd[24936]: smb_panic: clobber_region() last called from [get_share_modes(501)] smbd[24936]: [2006/06/15 13:34:27, 0] lib/util.c:smb_panic2(1485) smbd[24936]: smb_panic(): calling panic action [/usr/share/samba/panic-action 24936] smbd[24936]: [2006/06/15 13:34:28, 0] lib/util.c:smb_panic2(1493) smbd[24936]: smb_panic(): action returned status 0 smbd[24936]: [2006/06/15 13:34:28, 0] lib/util.c:smb_panic2(1495) smbd[24936]: PANIC: internal error smbd[24936]: [2006/06/15 13:34:28, 0] lib/util.c:smb_panic2(1503) smbd[24936]: BACKTRACE: 18 stack frames: smbd[24936]: #0 /usr/sbin/smbd(smb_panic2+0x9d) [0x81bc0bd] smbd[24936]: #1 /usr/sbin/smbd(smb_panic+0x11) [0x81bc2b1] smbd[24936]: #2 /usr/sbin/smbd [0x81aa3dd] smbd[24936]: #3 [0xffffe420] smbd[24936]: #4 /lib/libc.so.6(abort+0x108) [0xb7dc09d8] smbd[24936]: #5 /lib/libc.so.6 [0xb7df511f] smbd[24936]: #6 /lib/libc.so.6 [0xb7dfafb2] smbd[24936]: #7 /lib/libc.so.6(__libc_free+0x78) [0xb7dfb978] smbd[24936]: #8 /usr/sbin/smbd(get_share_modes+0xc1) [0x817f5b1] smbd[24936]: #9 /usr/sbin/smbd [0x80b7c01] smbd[24936]: #10 /usr/sbin/smbd(open_file_shared1+0x58f) [0x80b964f] smbd[24936]: #11 /usr/sbin/smbd(reply_ntcreate_and_X+0xa8e) [0x8096fae] smbd[24936]: #12 /usr/sbin/smbd [0x80cb324] smbd[24936]: #13 /usr/sbin/smbd(process_smb+0x171) [0x80cb6e1] smbd[24936]: #14 /usr/sbin/smbd(smbd_process+0x1a3) [0x80cc3e3] smbd[24936]: #15 /usr/sbin/smbd(main+0x7e3) [0x821e473] smbd[24936]: #16 /lib/libc.so.6(__libc_start_main+0x9e) [0xb7dab3be] smbd[24936]: #17 /usr/sbin/smbd [0x8077991] smbd[24936]:
Hmm. The only real chance I see to further debug this is to compile with debugging symbols (-g) and run smbd under valgrind. I know this will probably not work in production, because valgrind usually slows stuff down by a factor of more than 10, but if you have a reproducable case this might show the real problem. Volker
This line : #13 0xb7df511f in __fsetlocking () from /lib/libc.so.6 in the backtrace shows something strange. Why is this being called from free() ? Do you still have pthreads linked in ? If not, looks like memory corruption to me. Run with the glibc MALLOC_CHECK_ turned up high (man malloc for details) and try again. Jeremy,
Yes, this was without pthreads linked in (checked it thoroughly). I've tried upgrading to 3.0.23rc2 for now, retrying with that one, see what happens. In the weekend I'll set up another test case see if I can reproduce it and capture what's going on.. give valgrind a go..
This weekend I've done some testing.. made the system reproducable. It seemed to crash before the NT Create AndX response (as we can see in the backtrace also). It only happened when starting a particular software package with 8 computers at the same time (as happens in classrooms, number may vary). We started by making it reproducable with a batch file, waiting for a start.txt file to appear and then start the software package. This crashed samba in 80% of the cases. Tried 3.0.14a, 3.0.22 and 3.0.23rc3 of which 3.0.14a and 3.0.23rc3 with and without pthread library. Though you'd like to know that this hang happens with 3.0.14a both with and without pthreads. With 3.0.22 I only tried with pthreads, it also hanged. With 3.0.23rc3 both with and without pthreads no hangs occur (at least with my testing), so it appears that it is fixed. Thanks for your efforts.
Okay, if you got it reproducable, can you upload a full sniff of it? Does it also happen with debug level 10? If yes, please also upload one. Maybe with the sniff we can get it down to some very few operations. Volker P.S: Thank you for persisting.... :-) As I said, I'd like to nail that one. If it does not happen with 3.0.23, I want to know *why* :-)
Created attachment 1973 [details] Level 10 debug log w/ 3.0.14a and pthread
Created attachment 1974 [details] Backtrace corresponding with the log
Created attachment 1975 [details] libpcap format network dump file (binary)
Just uploaded three files: * Network dump * Level 10 smbd debug log * Backtrace corresponding with the dump and log Note that calls to ambra.bat and start.txt should be ignored, this is my torture batchfile. The batch-file Ambra.bat: @echo off :10 if exist N:\start.txt goto 20 goto 10 :20 S: cd "\ambrasoft\tafel totaal" :30 start /w tschlmenu.exe goto 10 Logs and traces made with: touch start.txt;(tail -f /var/log/samba/samba.log >log.smbd.hang &);tethereal -F libpcap -i eth1 -w netdump.log port not 22;killall tail;nice -n -20 ./genbacktrace.sh; rm start.txt With this I tried to make the network dump and logs as relevant as possible. Good luck! Wilco