3876 – nmbd core dump with RC3

Bug 3876 - nmbd core dump with RC3

Summary: nmbd core dump with RC3

Status:	CLOSED FIXED

Alias:	None

Product:	Samba 3.0
Classification:	Unclassified
Component:	nmbd (show other bugs)
Version:	3.0.23
Hardware:	x86 Linux

Importance:	P3 normal
Target Milestone:	none
Assignee:	Samba Bugzilla Account
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-06-28 03:05 UTC by Daniel Beschorner (dead mail address)
Modified:	2006-06-29 04:19 UTC (History)
CC List:	0 users

See Also:

Attachments
nmbd log (13.31 KB, text/plain) 2006-06-28 03:05 UTC, Daniel Beschorner (dead mail address)	no flags	Details
core dump (59.25 KB, application/x-gzip) 2006-06-28 03:05 UTC, Daniel Beschorner (dead mail address)	no flags	Details
Patch. (549 bytes, patch) 2006-06-28 15:37 UTC, Jeremy Allison	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Beschorner (dead mail address) 2006-06-28 03:05:05 UTC

nmbd core dumps during runtime, there are beside nmbd.tdb/dat some zero-sized files (f.e wins.24436) left in var/locks

Comment 1 Daniel Beschorner (dead mail address) 2006-06-28 03:05:30 UTC

Created attachment 1991 [details]
nmbd log

Comment 2 Daniel Beschorner (dead mail address) 2006-06-28 03:05:50 UTC

Created attachment 1992 [details]
core dump

Comment 3 Daniel Beschorner (dead mail address) 2006-06-28 04:56:39 UTC

happend again, WINS is dead till restart of nmbd

[2006/06/28 11:34:43, 0] lib/util.c:log_stack_trace(1699)
  BACKTRACE: 12 stack frames:
   #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e]
   #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147]
   #2 /server/samba/sbin/nmbd [0x80084dff]
   #3 [0xffffe420]
   #4 /lib/tls/libc.so.6(malloc+0x94) [0x40115364]
   #5 /server/samba/sbin/nmbd [0x80034ef3]
   #6 /server/samba/sbin/nmbd [0x8003921c]
   #7 /server/samba/sbin/nmbd(tdb_traverse+0x10d) [0x800aedcd]
   #8 /server/samba/sbin/nmbd(initiate_wins_processing+0x8d) [0x80039bcd]
   #9 /server/samba/sbin/nmbd(main+0x57c) [0x8001e82c]
   #10 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0]
   #11 /server/samba/sbin/nmbd [0x8001cf32]

Comment 4 Jeremy Allison 2006-06-28 10:26:56 UTC

This looks like a core dump in the malloc libraries. Can you do a couple of things :

1). Give me the output of ldd run against the nmbd binary (the easy part).
2). Run under valgrind - this will definately find the problem but might take too long.

In addition I need to see your smb.conf and also the configure line you used (or which rpm you downloaded).

Jeremy.

Comment 5 Daniel Beschorner (dead mail address) 2006-06-28 10:39:38 UTC

linux-gate.so.1 =>  (0xffffe000)
libcrypt.so.1 => /lib/libcrypt.so.1 (0x4001e000)
libresolv.so.2 => /lib/libresolv.so.2 (0x4004f000)
libnsl.so.1 => /lib/libnsl.so.1 (0x40061000)
libdl.so.2 => /lib/libdl.so.2 (0x40076000)
libldap-2.2.so.7 => /server/openldap/lib/libldap-2.2.so.7 (0x40079000)
liblber-2.2.so.7 => /server/openldap/lib/liblber-2.2.so.7 (0x400a4000)
libc.so.6 => /lib/tls/libc.so.6 (0x400b1000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

exactly the same ouput as the nmbd of 3.0.22 (running fine)

Comment 6 Daniel Beschorner (dead mail address) 2006-06-28 10:41:46 UTC

another crash with level 10 enabled:

[2006/06/28 17:23:26, 4] nmbd/nmbd_packets.c:send_mailslot(1921)
  send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100.
  send_mailslot: Sending to mailslot \MAILSLOT\NET\GETDC501 from SERVER<00> IP 192.168.100.100 to GENERAL<00> IP 192.168.100.
===============================================================
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 11 in pid 7601 (3.0.23rc3)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2006/06/28 17:23:26, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2006/06/28 17:23:26, 0] lib/util.c:smb_panic(1592)
  PANIC (pid 7601): internal error
[2006/06/28 17:23:26, 0] lib/util.c:log_stack_trace(1699)
  BACKTRACE: 15 stack frames:
   #0 /server/samba/sbin/nmbd(log_stack_trace+0x2e) [0x8009900e]
   #1 /server/samba/sbin/nmbd(smb_panic+0x57) [0x80099147]
   #2 /server/samba/sbin/nmbd [0x80084dff]
   #3 [0xffffe420]
   #4 /lib/tls/libc.so.6(__libc_free+0x8b) [0x4011361b]
   #5 /server/samba/sbin/nmbd(x_vfprintf+0x6b) [0x8008d06b]
   #6 /server/samba/sbin/nmbd(Debug1+0x119) [0x80084279]
   #7 /server/samba/sbin/nmbd [0x800842d9]
   #8 /server/samba/sbin/nmbd(dbgtext+0xd1) [0x80083cb1]
   #9 /server/samba/sbin/nmbd(send_mailslot+0x3fd) [0x800301ad]
   #10 /server/samba/sbin/nmbd(process_logon_packet+0x6db) [0x8003092b]
   #11 /server/samba/sbin/nmbd(run_packet_queue+0xf7d) [0x8002ed0d]
   #12 /server/samba/sbin/nmbd(main+0x4f9) [0x8001e7a9]
   #13 /lib/tls/libc.so.6(__libc_start_main+0xe0) [0x400c64b0]
   #14 /server/samba/sbin/nmbd [0x8001cf32]

Comment 7 Daniel Beschorner (dead mail address) 2006-06-28 10:44:02 UTC

configuration line

CFLAGS="-O2 -march=pentium3" CPPFLAGS=-I/server/openldap/include LDFLAGS=-L/server/openldap/lib ./configure --prefix=/server/samba --with-mandir=/usr/local/man --without-winbind --without-libsmbclient --without-libsmbsharemodes --without-libmsrpc --with-acl-support

Comment 8 Daniel Beschorner (dead mail address) 2006-06-28 10:49:33 UTC

Global part of smb.conf

[global]
   workgroup = I-BN
   security = user
   encrypt passwords = yes
   wins support = yes
   domain logons = yes
   logon script = user.cmd
   logon path = \\server\Profiles\%U
   logon home =
   logon drive =
   domain master = yes
   time server = yes
   passdb backend = ldapsam
   ldap...

Comment 9 Jeremy Allison 2006-06-28 12:05:52 UTC

Ok, if you're building this yourself can you please build with -g, not with -O and then add the following to the global section of your smb.conf :

panic action = "/bin/sleep 90000"

If nmbd then panics it will hang around waiting for the sleep to terminate. IOnce you spot a "sleep" process, find the parent (nmbd) and attach to it using gdb and then type :

gdb /usr/local/samba/sbin/nmbd
....startup stuff...
attach <pid>
bt

to get a backtrace. With -g this will give all the local variables so we can get a better idea of what's going wrong.

In addition, as this seems to be a malloc issue, you might try setting the following environment variable before running nmbd (with the panic action set).

export MALLOC_CHECK_=2
/usr/local/samba/sbin/nmbd

this will cause an immediate panic once memory corruption is detected (but isn't as accurate as valgrind).

Thanks a lot for testing this for us.

Jeremy

Comment 10 Jeremy Allison 2006-06-28 14:15:43 UTC

Can you check for me if your wins tdb is corrupted by running tdbdump over wins.tdb and see if it can read all the records correctly ?

You can ensure it's in a known good state by running tdbbackup over it, and replacing it qith the .bak file.

Jeremy.

Comment 11 Daniel Beschorner (dead mail address) 2006-06-28 14:16:56 UTC

first valgrind shoot without re-compilation

==26739==
==26739== Use of uninitialised value of size 4
==26739==    at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==
==26739== Invalid write of size 4
==26739==    at 0x355B3: remove_name_from_wins_namelist (in /server/samba/sbin/nmbd)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==  Address 0x434B325B is not stack'd, malloc'd or (recently) free'd
==26739==
==26739== ERROR SUMMARY: 134 errors from 6 contexts (suppressed: 27 from 2)
==26739== malloc/free: in use at exit: 548,383 bytes in 338 blocks.
==26739== malloc/free: 274,477 allocs, 274,139 frees, 73,567,087 bytes allocated.
==26739== For counts of detected errors, rerun with: -v
==26739== searching for pointers to 338 not-freed blocks.
==26739== checked 475,936 bytes.
==26739==
==26739==
==26739== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27
==26739==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==26739==    by 0x49137E5: vasprintf (in /lib/tls/libc.so.6)
==26739==    by 0x8D02B: x_vfprintf (in /server/samba/sbin/nmbd)
==26739==    by 0x84278: Debug1 (in /server/samba/sbin/nmbd)
==26739==    by 0x842D8: bufr_print (in /server/samba/sbin/nmbd)
==26739==    by 0x84306: dbgflush (in /server/samba/sbin/nmbd)
==26739==    by 0x85272: dump_core (in /server/samba/sbin/nmbd)
==26739==    by 0x991F4: smb_panic (in /server/samba/sbin/nmbd)
==26739==    by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd)
==26739==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==
==26739==
==26739== 559 bytes in 1 blocks are definitely lost in loss record 17 of 27
==26739==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==26739==    by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6)
==26739==    by 0x9901B: log_stack_trace (in /server/samba/sbin/nmbd)
==26739==    by 0x99146: smb_panic (in /server/samba/sbin/nmbd)
==26739==    by 0x84DFE: sig_fault (in /server/samba/sbin/nmbd)
==26739==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==26739==    by 0x392E4: wins_processing_traverse_fn (in /server/samba/sbin/nmbd)
==26739==    by 0xAEDCC: tdb_traverse (in /server/samba/sbin/nmbd)
==26739==    by 0x39BCC: initiate_wins_processing (in /server/samba/sbin/nmbd)
==26739==    by 0x1E82B: main (in /server/samba/sbin/nmbd)
==26739==
==26739== LEAK SUMMARY:
==26739==    definitely lost: 560 bytes in 2 blocks.
==26739==      possibly lost: 0 bytes in 0 blocks.
==26739==    still reachable: 547,823 bytes in 336 blocks.
==26739==         suppressed: 0 bytes in 0 blocks.
==26739== Reachable blocks (those to which a pointer was found) are not shown.
==26739== To see them, rerun with: --show-reachable=yes

Comment 12 Daniel Beschorner (dead mail address) 2006-06-28 14:21:19 UTC

I will check later, but I suppose tdb corruption too, because 3 other servers run stable.
If I delete wins.tdb, the same problem with the new created wins.tdb occures again.
So I supposed it has to be specific to the wins.dat, but the only obviously difference to other servers is a very long line with many IP addresses for an entry.

Comment 13 Jeremy Allison 2006-06-28 14:25:05 UTC

Dammit I'd *kill* for line number info on that valgrind trace....

Can you recomile with -g (and no -O) and reproduce it ?

Jermey.

Comment 14 Jeremy Allison 2006-06-28 15:07:27 UTC

Got it ! Found the bug.....
Expect a fix shortly !

Thanks for this....

Jeremy

Comment 15 Jeremy Allison 2006-06-28 15:37:54 UTC

Created attachment 1996 [details]
Patch.

This should fix the double-free and uninitialized access.
Please test and let me know !
Jeremy.

Comment 16 Daniel Beschorner (dead mail address) 2006-06-28 18:29:16 UTC

==19937== Use of uninitialised value of size 4
==19937==    at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==
==19937== Invalid write of size 4
==19937==    at 0x3A0D4: remove_name_from_wins_namelist (nmbd_winsserver.c:296)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==  Address 0x450048 is not stack'd, malloc'd or (recently) free'd
==19937==
==19937== ERROR SUMMARY: 139 errors from 6 contexts (suppressed: 27 from 2)
==19937== malloc/free: in use at exit: 548,605 bytes in 340 blocks.
==19937== malloc/free: 260,191 allocs, 259,851 frees, 69,696,131 bytes allocated.
==19937== For counts of detected errors, rerun with: -v
==19937== searching for pointers to 340 not-freed blocks.
==19937== checked 471,912 bytes.
==19937==
==19937==
==19937== 1 bytes in 1 blocks are definitely lost in loss record 1 of 27
==19937==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==19937==    by 0x49137E5: vasprintf (in /lib/tls/libc.so.6)
==19937==    by 0xA252D: x_vfprintf (xfile.c:226)
==19937==    by 0x987AA: Debug1 (debug.c:825)
==19937==    by 0x98817: bufr_print (debug.c:846)
==19937==    by 0x98937: dbgflush (debug.c:908)
==19937==    by 0x992F1: dump_core (fault.c:177)
==19937==    by 0xAFF31: smb_panic (util.c:1611)
==19937==    by 0x98D81: fault_report (fault.c:47)
==19937==    by 0x98D92: sig_fault (fault.c:70)
==19937==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==
==19937==
==19937== 641 bytes in 1 blocks are definitely lost in loss record 18 of 27
==19937==    at 0x481B6D6: malloc (vg_replace_malloc.c:149)
==19937==    by 0x4982AE3: backtrace_symbols (in /lib/tls/libc.so.6)
==19937==    by 0xAFF7A: log_stack_trace (util.c:1696)
==19937==    by 0xAFD8C: smb_panic (util.c:1593)
==19937==    by 0x98D81: fault_report (fault.c:47)
==19937==    by 0x98D92: sig_fault (fault.c:70)
==19937==    by 0x48E07F7: (within /lib/tls/libc.so.6)
==19937==    by 0x3EB61: wins_processing_traverse_fn (nmbd_winsserver.c:2106)
==19937==    by 0xC9E20: tdb_traverse (tdb.c:1403)
==19937==    by 0x3F082: initiate_wins_processing (nmbd_winsserver.c:2194)
==19937==    by 0x1E7C7: process (nmbd.c:512)
==19937==    by 0x1F32B: main (nmbd.c:801)
==19937==
==19937== LEAK SUMMARY:
==19937==    definitely lost: 642 bytes in 2 blocks.
==19937==      possibly lost: 0 bytes in 0 blocks.
==19937==    still reachable: 547,963 bytes in 338 blocks.
==19937==         suppressed: 0 bytes in 0 blocks.

Comment 17 Daniel Beschorner (dead mail address) 2006-06-28 18:30:40 UTC

Just for completeness here again with -g
Thank you, I'll test the patch and let you know.

Daniel

Comment 18 Jeremy Allison 2006-06-28 18:54:23 UTC

Thanks - that confirms the two changes I made (zeroing out the malloc'ed structure and removing the double-free) - they would cause the valgrind errors you posted.
Jeremy.

Comment 19 Daniel Beschorner (dead mail address) 2006-06-29 04:19:32 UTC

After 10h hours of runtime I got no crash, so I think the bug is resolved.

We use RC3 on 4 production servers for 2 days now, runs very smoothly, no oplock breaks till now, smbstatus orphans (incl. byte range locking) gone away, smells like 3.0.23 is going to become a good release, crossing fingers.

Daniel