Bug 3876 - nmbd core dump with RC3
Summary: nmbd core dump with RC3
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: nmbd (show other bugs)
Version: 3.0.23
Hardware: x86 Linux
: P3 normal
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-28 03:05 UTC by Daniel Beschorner (dead mail address)
Modified: 2006-06-29 04:19 UTC (History)
0 users

See Also:


Attachments
nmbd log (13.31 KB, text/plain)
2006-06-28 03:05 UTC, Daniel Beschorner (dead mail address)
no flags Details
core dump (59.25 KB, application/x-gzip)
2006-06-28 03:05 UTC, Daniel Beschorner (dead mail address)
no flags Details
Patch. (549 bytes, patch)
2006-06-28 15:37 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Beschorner (dead mail address) 2006-06-28 03:05:05 UTC
nmbd core dumps during runtime, there are beside nmbd.tdb/dat some zero-sized files (f.e wins.24436) left in var/locks
Comment 1 Daniel Beschorner (dead mail address) 2006-06-28 03:05:30 UTC
Created attachment 1991 [details]
nmbd log
Comment 2 Daniel Beschorner (dead mail address) 2006-06-28 03:05:50 UTC
Created attachment 1992 [details]
core dump
Comment 3 Daniel Beschorner (dead mail address) 2006-06-28 04:56:39 UTC
happend again, WINS is dead till restart of nmbd

[2006/06/28 11:34:43, 0] lib/util.c:log_stack_trace(1699)
  BACKTRACE: 12 stack frames:
   #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e]
   #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147]
   #2 /server/samba/sbin/nmbd [0x80084dff]
   #3 [0xffffe420]
   #4 /lib/tls/libc.so.6(malloc+0x94) [0x40115364]
   #5 /server/samba/sbin/nmbd [0x80034ef3]
   #6 /server/samba/sbin/nmbd [0x8003921c]
   #7 /server/samba/sbin/nmbd(tdb_traverse+0x10d) [0x800aedcd]
   #8 /server/samba/sbin/nmbd(initiate_wins_processing+0x8d) [0x80039bcd]
   #9 /server/samba/sbin/nmbd(main+0x57c) [0x8001e82c]
   #10 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0]
   #11 /server/samba/sbin/nmbd [0x8001cf32]
Comment 4 Jeremy Allison 2006-06-28 10:26:56 UTC
This looks like a core dump in the malloc libraries. Can you do a couple of things :

1). Give me the output of ldd run against the nmbd binary (the easy part).
2). Run under valgrind - this will definately find the problem but might take too long.

In addition I need to see your smb.conf and also the configure line you used (or which rpm you downloaded).

Jeremy.
Comment 5 Daniel Beschorner (dead mail address) 2006-06-28 10:39:38 UTC
linux-gate.so.1 =>  (0xffffe000)
libcrypt.so.1 => /lib/libcrypt.so.1 (0x4001e000)
libresolv.so.2 => /lib/libresolv.so.2 (0x4004f000)
libnsl.so.1 => /lib/libnsl.so.1 (0x40061000)
libdl.so.2 => /lib/libdl.so.2 (0x40076000)
libldap-2.2.so.7 => /server/openldap/lib/libldap-2.2.so.7 (0x40079000)
liblber-2.2.so.7 => /server/openldap/lib/liblber-2.2.so.7 (0x400a4000)
libc.so.6 => /lib/tls/libc.so.6 (0x400b1000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

exactly the same ouput as the nmbd of 3.0.22 (running fine)
Comment 6 Daniel Beschorner (dead mail address) 2006-06-28 10:41:46 UTC
another crash with level 10 enabled:

[2006/06/28 17:23:26, 4] nmbd/nmbd_packets.c:send_mailslot(1921)
  send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100.
  send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100.
===============================================================
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 11 in pid 7601 (3.0.23rc3)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2006/06/28 17:23:26, 0] lib/util.c:smb_panic(1592)
  PANIC (pid 7601): internal error
[2006/06/28 17:23:26, 0] lib/util.c:log_stack_trace(1699)
  BACKTRACE: 15 stack frames:
   #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e]
   #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147]
   #2 /server/samba/sbin/nmbd [0x80084dff]
   #3 [0xffffe420]
   #4 /lib/tls/libc.so.6(__libc_free+0x8b) [0x4011361b]
   #5 /server/samba/sbin/nmbd(x_vfprintf+0x6b) [0x8008d06b]
   #6 /server/samba/sbin/nmbd(Debug1+0x119) [0x80084279]
   #7 /server/samba/sbin/nmbd [0x800842d9]
   #8 /server/samba/sbin/nmbd(dbgtext+0xd1) [0x80083cb1]
   #9 /server/samba/sbin/nmbd(send_mailslot+0x3fd) [0x800301ad]
   #10 /server/samba/sbin/nmbd(process_logon_packet+0x6db) [0x8003092b]
   #11 /server/samba/sbin/nmbd(run_packet_queue+0xf7d) [0x8002ed0d]
   #12 /server/samba/sbin/nmbd(main+0x4f9) [0x8001e7a9]
   #13 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0]
   #14 /server/samba/sbin/nmbd [0x8001cf32]
Comment 7 Daniel Beschorner (dead mail address) 2006-06-28 10:44:02 UTC
configuration line

CFLAGS="-O2 -march=pentium3" CPPFLAGS=-I/server/openldap/include LDFLAGS=-L/server/openldap/lib ./configure --prefix=/server/samba --with-mandir=/usr/local/man --without-winbind --without-libsmbclient --without-libsmbsharemodes --without-libmsrpc --with-acl-support
Comment 8 Daniel Beschorner (dead mail address) 2006-06-28 10:49:33 UTC
Global part of smb.conf

[global]
   workgroup = I-BN
   security = user
   encrypt passwords = yes
   wins support = yes
   domain logons = yes
   logon script = user.cmd
   logon path = \\server\Profiles\%U
   logon home =
   logon drive =
   domain master = yes
   time server = yes
   passdb backend = ldapsam
   ldap...
Comment 9 Jeremy Allison 2006-06-28 12:05:52 UTC
Ok, if you're building this yourself can you please build with -g, not with -O and then add the following to the global section of your smb.conf :

panic action = "/bin/sleep 90000"

If nmbd then panics it will hang around waiting for the sleep to terminate. IOnce you spot a "sleep" process, find the parent (nmbd) and attach to it using gdb and then type :

gdb /usr/local/samba/sbin/nmbd
....startup stuff...
attach <pid>
bt

to get a backtrace. With -g this will give all the local variables so we can get a better idea of what's going wrong.

In addition, as this seems to be a malloc issue, you might try setting the following environment variable before running nmbd (with the panic action set).

export MALLOC_CHECK_=2
/usr/local/samba/sbin/nmbd

this will cause an immediate panic once memory corruption is detected (but isn't as accurate as valgrind).

Thanks a lot for testing this for us.

Jeremy
Comment 10 Jeremy Allison 2006-06-28 14:15:43 UTC
Can you check for me if your wins tdb is corrupted by running tdbdump over wins.tdb and see if it can read all the records correctly ?

You can ensure it's in a known good state by running tdbbackup over it, and replacing it qith the .bak file.

Jeremy.
Comment 11 Daniel Beschorner (dead mail address) 2006-06-28 14:16:56 UTC
first valgrind shoot without re-compilation

==26739==
==26739== Use of uninitialised value of size 4
==26739==    at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==
==26739== Invalid write of size 4
==26739==    at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==  Address 0x434B325B is not stack'd, malloc'd or (recently) free'd
==26739==
==26739== ERROR SUMMARY: 134 errors from 6 contexts (suppressed: 27 from 2)
==26739== malloc/free: in use at exit: 548,383 bytes in 338 blocks.
==26739== malloc/free: 274,477 allocs, 274,139 frees, 73,567,087 bytes allocated.
==26739== For counts of detected errors, rerun with: -v
==26739== searching for pointers to 338 not-freed blocks.
==26739== checked 475,936 bytes.
==26739==
==26739==
==26739== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27
==26739==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==26739==    by 0x49137E5: vasprintf (in /lib/tls/libc.so.6)
==26739==    by 0x8D02B: x_vfprintf (in /server/samba/sbin/nmbd)
==26739==    by 0x84278: Debug1 (in /server/samba/sbin/nmbd)
==26739==    by 0x842D8: bufr_print (in /server/samba/sbin/nmbd)
==26739==    by 0x84306: dbgflush (in /server/samba/sbin/nmbd)
==26739==    by 0x85272: dump_core (in /server/samba/sbin/nmbd)
==26739==    by 0x991F4: smb_panic (in /server/samba/sbin/nmbd)
==26739==    by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd)
==26739==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==
==26739==
==26739== 559 bytes in 1 blocks are definitely lost in loss record 17 of 27
==26739==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==26739==    by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6)
==26739==    by 0x9901B: log_stack_trace (in /server/samba/sbin/nmbd)
==26739==    by 0x99146: smb_panic (in /server/samba/sbin/nmbd)
==26739==    by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd)
==26739==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==
==26739== LEAK SUMMARY:
==26739==    definitely lost: 560 bytes in 2 blocks.
==26739==      possibly lost: 0 bytes in 0 blocks.
==26739==    still reachable: 547,823 bytes in 336 blocks.
==26739==         suppressed: 0 bytes in 0 blocks.
==26739== Reachable blocks (those to which a pointer was found) are not shown.
==26739== To see them, rerun with: --show-reachable=yes
Comment 12 Daniel Beschorner (dead mail address) 2006-06-28 14:21:19 UTC
I will check later, but I suppose tdb corruption too, because 3 other servers run stable.
If I delete wins.tdb, the same problem with the new created wins.tdb occures again.
So I supposed it has to be specific to the wins.dat, but the only obviously difference to other servers is a very long line with many IP addresses for an entry.
Comment 13 Jeremy Allison 2006-06-28 14:25:05 UTC
Dammit I'd *kill* for line number info on that valgrind trace....

Can you recomile with -g (and no -O) and reproduce it ?

Jermey.
Comment 14 Jeremy Allison 2006-06-28 15:07:27 UTC
Got it ! Found the bug.....
Expect a fix shortly !

Thanks for this....

Jeremy
Comment 15 Jeremy Allison 2006-06-28 15:37:54 UTC
Created attachment 1996 [details]
Patch.

This should fix the double-free and uninitialized access.
Please test and let me know !
Jeremy.
Comment 16 Daniel Beschorner (dead mail address) 2006-06-28 18:29:16 UTC
==19937== Use of uninitialised value of size 4
==19937==    at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==
==19937== Invalid write of size 4
==19937==    at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==  Address 0x450048 is not stack'd, malloc'd or (recently) free'd
==19937==
==19937== ERROR SUMMARY: 139 errors from 6 contexts (suppressed: 27 from 2)
==19937== malloc/free: in use at exit: 548,605 bytes in 340 blocks.
==19937== malloc/free: 260,191 allocs, 259,851 frees, 69,696,131 bytes allocated.
==19937== For counts of detected errors, rerun with: -v
==19937== searching for pointers to 340 not-freed blocks.
==19937== checked 471,912 bytes.
==19937==
==19937==
==19937== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27
==19937==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==19937==    by 0x49137E5: vasprintf (in /lib/tls/libc.so.6)
==19937==    by 0xA252D: x_vfprintf (xfile.c:226)
==19937==    by 0x987AA: Debug1 (debug.c:825)
==19937==    by 0x98817: bufr_print (debug.c:846)
==19937==    by 0x98937: dbgflush (debug.c:908)
==19937==    by 0x992F1: dump_core (fault.c:177)
==19937==    by 0xAFF31: smb_panic (util.c:1611)
==19937==    by 0x98D81: fault_report (fault.c:47)
==19937==    by 0x98D92: sig_fault (fault.c:70)
==19937==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==
==19937==
==19937== 641 bytes in 1 blocks are definitely lost in loss record 18 of 27
==19937==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==19937==    by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6)
==19937==    by 0xAFF7A: log_stack_trace (util.c:1696)
==19937==    by 0xAFD8C: smb_panic (util.c:1593)
==19937==    by 0x98D81: fault_report (fault.c:47)
==19937==    by 0x98D92: sig_fault (fault.c:70)
==19937==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==
==19937== LEAK SUMMARY:
==19937==    definitely lost: 642 bytes in 2 blocks.
==19937==      possibly lost: 0 bytes in 0 blocks.
==19937==    still reachable: 547,963 bytes in 338 blocks.
==19937==         suppressed: 0 bytes in 0 blocks.
Comment 17 Daniel Beschorner (dead mail address) 2006-06-28 18:30:40 UTC
Just for completeness here again with -g
Thank you, I'll test the patch and let you know.

Daniel
Comment 18 Jeremy Allison 2006-06-28 18:54:23 UTC
Thanks - that confirms the two changes I made (zeroing out the malloc'ed structure and removing the double-free) - they would cause the valgrind errors you posted.
Jeremy.
Comment 19 Daniel Beschorner (dead mail address) 2006-06-29 04:19:32 UTC
After 10h hours of runtime I got no crash, so I think the bug is resolved.

We use RC3 on 4 production servers for 2 days now, runs very smoothly, no oplock breaks till now, smbstatus orphans (incl. byte range locking) gone away, smells like 3.0.23 is going to become a good release, crossing fingers.

Daniel