nmbd core dumps during runtime, there are beside nmbd.tdb/dat some zero-sized files (f.e wins.24436) left in var/locks
Created attachment 1991 [details] nmbd log
Created attachment 1992 [details] core dump
happend again, WINS is dead till restart of nmbd [2006/06/28 11:34:43, 0] lib/util.c:log_stack_trace(1699) BACKTRACE: 12 stack frames: #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e] #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147] #2 /server/samba/sbin/nmbd [0x80084dff] #3 [0xffffe420] #4 /lib/tls/libc.so.6(malloc+0x94) [0x40115364] #5 /server/samba/sbin/nmbd [0x80034ef3] #6 /server/samba/sbin/nmbd [0x8003921c] #7 /server/samba/sbin/nmbd(tdb_traverse+0x10d) [0x800aedcd] #8 /server/samba/sbin/nmbd(initiate_wins_processing+0x8d) [0x80039bcd] #9 /server/samba/sbin/nmbd(main+0x57c) [0x8001e82c] #10 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0] #11 /server/samba/sbin/nmbd [0x8001cf32]
This looks like a core dump in the malloc libraries. Can you do a couple of things : 1). Give me the output of ldd run against the nmbd binary (the easy part). 2). Run under valgrind - this will definately find the problem but might take too long. In addition I need to see your smb.conf and also the configure line you used (or which rpm you downloaded). Jeremy.
linux-gate.so.1 => (0xffffe000) libcrypt.so.1 => /lib/libcrypt.so.1 (0x4001e000) libresolv.so.2 => /lib/libresolv.so.2 (0x4004f000) libnsl.so.1 => /lib/libnsl.so.1 (0x40061000) libdl.so.2 => /lib/libdl.so.2 (0x40076000) libldap-2.2.so.7 => /server/openldap/lib/libldap-2.2.so.7 (0x40079000) liblber-2.2.so.7 => /server/openldap/lib/liblber-2.2.so.7 (0x400a4000) libc.so.6 => /lib/tls/libc.so.6 (0x400b1000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) exactly the same ouput as the nmbd of 3.0.22 (running fine)
another crash with level 10 enabled: [2006/06/28 17:23:26, 4] nmbd/nmbd_packets.c:send_mailslot(1921) send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100. send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100. =============================================================== [2006/06/28 17:23:26, 0] lib/fault.c:fault_report(42) INTERNAL ERROR: Signal 11 in pid 7601 (3.0.23rc3) Please read the Trouble-Shooting section of the Samba3-HOWTO [2006/06/28 17:23:26, 0] lib/fault.c:fault_report(44) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2006/06/28 17:23:26, 0] lib/fault.c:fault_report(45) =============================================================== [2006/06/28 17:23:26, 0] lib/util.c:smb_panic(1592) PANIC (pid 7601): internal error [2006/06/28 17:23:26, 0] lib/util.c:log_stack_trace(1699) BACKTRACE: 15 stack frames: #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e] #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147] #2 /server/samba/sbin/nmbd [0x80084dff] #3 [0xffffe420] #4 /lib/tls/libc.so.6(__libc_free+0x8b) [0x4011361b] #5 /server/samba/sbin/nmbd(x_vfprintf+0x6b) [0x8008d06b] #6 /server/samba/sbin/nmbd(Debug1+0x119) [0x80084279] #7 /server/samba/sbin/nmbd [0x800842d9] #8 /server/samba/sbin/nmbd(dbgtext+0xd1) [0x80083cb1] #9 /server/samba/sbin/nmbd(send_mailslot+0x3fd) [0x800301ad] #10 /server/samba/sbin/nmbd(process_logon_packet+0x6db) [0x8003092b] #11 /server/samba/sbin/nmbd(run_packet_queue+0xf7d) [0x8002ed0d] #12 /server/samba/sbin/nmbd(main+0x4f9) [0x8001e7a9] #13 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0] #14 /server/samba/sbin/nmbd [0x8001cf32]
configuration line CFLAGS="-O2 -march=pentium3" CPPFLAGS=-I/server/openldap/include LDFLAGS=-L/server/openldap/lib ./configure --prefix=/server/samba --with-mandir=/usr/local/man --without-winbind --without-libsmbclient --without-libsmbsharemodes --without-libmsrpc --with-acl-support
Global part of smb.conf [global] workgroup = I-BN security = user encrypt passwords = yes wins support = yes domain logons = yes logon script = user.cmd logon path = \\server\Profiles\%U logon home = logon drive = domain master = yes time server = yes passdb backend = ldapsam ldap...
Ok, if you're building this yourself can you please build with -g, not with -O and then add the following to the global section of your smb.conf : panic action = "/bin/sleep 90000" If nmbd then panics it will hang around waiting for the sleep to terminate. IOnce you spot a "sleep" process, find the parent (nmbd) and attach to it using gdb and then type : gdb /usr/local/samba/sbin/nmbd ....startup stuff... attach <pid> bt to get a backtrace. With -g this will give all the local variables so we can get a better idea of what's going wrong. In addition, as this seems to be a malloc issue, you might try setting the following environment variable before running nmbd (with the panic action set). export MALLOC_CHECK_=2 /usr/local/samba/sbin/nmbd this will cause an immediate panic once memory corruption is detected (but isn't as accurate as valgrind). Thanks a lot for testing this for us. Jeremy
Can you check for me if your wins tdb is corrupted by running tdbdump over wins.tdb and see if it can read all the records correctly ? You can ensure it's in a known good state by running tdbbackup over it, and replacing it qith the .bak file. Jeremy.
first valgrind shoot without re-compilation ==26739== ==26739== Use of uninitialised value of size 4 ==26739== at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd) ==26739== by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd) ==26739== by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd) ==26739== by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd) ==26739== by 0x1E82B: main (in /server/samba/sbin/nmbd) ==26739== ==26739== Invalid write of size 4 ==26739== at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd) ==26739== by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd) ==26739== by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd) ==26739== by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd) ==26739== by 0x1E82B: main (in /server/samba/sbin/nmbd) ==26739== Address 0x434B325B is not stack'd, malloc'd or (recently) free'd ==26739== ==26739== ERROR SUMMARY: 134 errors from 6 contexts (suppressed: 27 from 2) ==26739== malloc/free: in use at exit: 548,383 bytes in 338 blocks. ==26739== malloc/free: 274,477 allocs, 274,139 frees, 73,567,087 bytes allocated. ==26739== For counts of detected errors, rerun with: -v ==26739== searching for pointers to 338 not-freed blocks. ==26739== checked 475,936 bytes. ==26739== ==26739== ==26739== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27 ==26739== at 0x481B6D6: malloc (vg_replace_malloc.c:149) ==26739== by 0x49137E5: vasprintf (in /lib/tls/libc.so.6) ==26739== by 0x8D02B: x_vfprintf (in /server/samba/sbin/nmbd) ==26739== by 0x84278: Debug1 (in /server/samba/sbin/nmbd) ==26739== by 0x842D8: bufr_print (in /server/samba/sbin/nmbd) ==26739== by 0x84306: dbgflush (in /server/samba/sbin/nmbd) ==26739== by 0x85272: dump_core (in /server/samba/sbin/nmbd) ==26739== by 0x991F4: smb_panic (in /server/samba/sbin/nmbd) ==26739== by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd) ==26739== by 0x48E07F7: (within /lib/tls/libc.so.6) ==26739== by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd) ==26739== by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd) ==26739== ==26739== ==26739== 559 bytes in 1 blocks are definitely lost in loss record 17 of 27 ==26739== at 0x481B6D6: malloc (vg_replace_malloc.c:149) ==26739== by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6) ==26739== by 0x9901B: log_stack_trace (in /server/samba/sbin/nmbd) ==26739== by 0x99146: smb_panic (in /server/samba/sbin/nmbd) ==26739== by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd) ==26739== by 0x48E07F7: (within /lib/tls/libc.so.6) ==26739== by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd) ==26739== by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd) ==26739== by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd) ==26739== by 0x1E82B: main (in /server/samba/sbin/nmbd) ==26739== ==26739== LEAK SUMMARY: ==26739== definitely lost: 560 bytes in 2 blocks. ==26739== possibly lost: 0 bytes in 0 blocks. ==26739== still reachable: 547,823 bytes in 336 blocks. ==26739== suppressed: 0 bytes in 0 blocks. ==26739== Reachable blocks (those to which a pointer was found) are not shown. ==26739== To see them, rerun with: --show-reachable=yes
I will check later, but I suppose tdb corruption too, because 3 other servers run stable. If I delete wins.tdb, the same problem with the new created wins.tdb occures again. So I supposed it has to be specific to the wins.dat, but the only obviously difference to other servers is a very long line with many IP addresses for an entry.
Dammit I'd *kill* for line number info on that valgrind trace.... Can you recomile with -g (and no -O) and reproduce it ? Jermey.
Got it ! Found the bug..... Expect a fix shortly ! Thanks for this.... Jeremy
Created attachment 1996 [details] Patch. This should fix the double-free and uninitialized access. Please test and let me know ! Jeremy.
==19937== Use of uninitialised value of size 4 ==19937== at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296) ==19937== by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106) ==19937== by 0xC9E20: tdb_traverse (tdb.c:1403) ==19937== by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194) ==19937== by 0x1E7C7: process (nmbd.c:512) ==19937== by 0x1F32B: main (nmbd.c:801) ==19937== ==19937== Invalid write of size 4 ==19937== at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296) ==19937== by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106) ==19937== by 0xC9E20: tdb_traverse (tdb.c:1403) ==19937== by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194) ==19937== by 0x1E7C7: process (nmbd.c:512) ==19937== by 0x1F32B: main (nmbd.c:801) ==19937== Address 0x450048 is not stack'd, malloc'd or (recently) free'd ==19937== ==19937== ERROR SUMMARY: 139 errors from 6 contexts (suppressed: 27 from 2) ==19937== malloc/free: in use at exit: 548,605 bytes in 340 blocks. ==19937== malloc/free: 260,191 allocs, 259,851 frees, 69,696,131 bytes allocated. ==19937== For counts of detected errors, rerun with: -v ==19937== searching for pointers to 340 not-freed blocks. ==19937== checked 471,912 bytes. ==19937== ==19937== ==19937== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27 ==19937== at 0x481B6D6: malloc (vg_replace_malloc.c:149) ==19937== by 0x49137E5: vasprintf (in /lib/tls/libc.so.6) ==19937== by 0xA252D: x_vfprintf (xfile.c:226) ==19937== by 0x987AA: Debug1 (debug.c:825) ==19937== by 0x98817: bufr_print (debug.c:846) ==19937== by 0x98937: dbgflush (debug.c:908) ==19937== by 0x992F1: dump_core (fault.c:177) ==19937== by 0xAFF31: smb_panic (util.c:1611) ==19937== by 0x98D81: fault_report (fault.c:47) ==19937== by 0x98D92: sig_fault (fault.c:70) ==19937== by 0x48E07F7: (within /lib/tls/libc.so.6) ==19937== by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106) ==19937== ==19937== ==19937== 641 bytes in 1 blocks are definitely lost in loss record 18 of 27 ==19937== at 0x481B6D6: malloc (vg_replace_malloc.c:149) ==19937== by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6) ==19937== by 0xAFF7A: log_stack_trace (util.c:1696) ==19937== by 0xAFD8C: smb_panic (util.c:1593) ==19937== by 0x98D81: fault_report (fault.c:47) ==19937== by 0x98D92: sig_fault (fault.c:70) ==19937== by 0x48E07F7: (within /lib/tls/libc.so.6) ==19937== by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106) ==19937== by 0xC9E20: tdb_traverse (tdb.c:1403) ==19937== by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194) ==19937== by 0x1E7C7: process (nmbd.c:512) ==19937== by 0x1F32B: main (nmbd.c:801) ==19937== ==19937== LEAK SUMMARY: ==19937== definitely lost: 642 bytes in 2 blocks. ==19937== possibly lost: 0 bytes in 0 blocks. ==19937== still reachable: 547,963 bytes in 338 blocks. ==19937== suppressed: 0 bytes in 0 blocks.
Just for completeness here again with -g Thank you, I'll test the patch and let you know. Daniel
Thanks - that confirms the two changes I made (zeroing out the malloc'ed structure and removing the double-free) - they would cause the valgrind errors you posted. Jeremy.
After 10h hours of runtime I got no crash, so I think the bug is resolved. We use RC3 on 4 production servers for 2 days now, runs very smoothly, no oplock breaks till now, smbstatus orphans (incl. byte range locking) gone away, smells like 3.0.23 is going to become a good release, crossing fingers. Daniel