Bug 3617 - nmbd crashes with signal 11 if another server announces itself as a local master browser
nmbd crashes with signal 11 if another server announces itself as a local mas...
Status: RESOLVED WORKSFORME
Product: Samba 3.0
Classification: Unclassified
Component: nmbd
3.0.22
x86 Linux
: P3 major
: none
Assigned To: Samba Bugzilla Account
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-03-20 05:29 UTC by Sebastian Held
Modified: 2008-06-10 08:39 UTC (History)
1 user (show)

See Also:


Attachments
logfile (7.06 KB, text/plain)
2006-03-20 06:44 UTC, Sebastian Held
no flags Details
nmbd errorlog (5.50 KB, text/plain)
2006-04-05 01:59 UTC, Sebastian Held
no flags Details
nmbd errorlog (6.07 KB, text/plain)
2006-06-20 00:57 UTC, Sebastian Held
no flags Details
useless backtrace (2.95 KB, text/plain)
2007-09-04 04:08 UTC, Sebastian Held
no flags Details
nmbd errorlog (2.80 KB, text/plain)
2007-09-04 04:10 UTC, Sebastian Held
no flags Details
useless backtrace (core file) (3.59 KB, text/plain)
2007-09-04 04:15 UTC, Sebastian Held
no flags Details
log.nmbd (3.13 KB, text/plain)
2007-12-04 04:06 UTC, Sebastian Held
no flags Details
valgrind logfile (212.68 KB, application/octet-stream)
2007-12-21 05:24 UTC, Sebastian Held
no flags Details
proposed patch (921 bytes, patch)
2007-12-21 06:34 UTC, Volker Lendecke
no flags Details
Belt and braces (929 bytes, patch)
2007-12-21 13:34 UTC, Jeremy Allison
no flags Details
valgrind report with two patches applied (52.30 KB, application/octet-stream)
2008-01-02 07:51 UTC, Sebastian Held
no flags Details
valgrind report and backtrace (6.84 KB, text/plain)
2008-02-29 06:16 UTC, Sebastian Held
no flags Details
Patch (1.83 KB, patch)
2008-02-29 07:47 UTC, Jeremy Allison
no flags Details
Proper patch. (1.81 KB, patch)
2008-02-29 07:50 UTC, Jeremy Allison
no flags Details
Patch for the invalid 4 byte reads. (864 bytes, patch)
2008-02-29 08:52 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Held 2006-03-20 05:29:54 UTC
WINS service stops working, because nmbd gets signal 11 segment violation.
This seems to be related to another server announces itself as the local master browser. See attached log.
How can help with this issue? The backtrace seems not very informative... (but I've installed samba3-debuginfo-3.0.21c-26.rpm)
Comment 1 Sebastian Held 2006-03-20 06:44:02 UTC
Created attachment 1812 [details]
logfile
Comment 2 Sebastian Held 2006-04-05 01:57:01 UTC
same with 3.0.22
Comment 3 Sebastian Held 2006-04-05 01:59:55 UTC
Created attachment 1844 [details]
nmbd errorlog
Comment 4 Mark Condic 2006-05-24 17:44:04 UTC
Using 3.0.22 (13.1 & 14.1) with two masters running, samba stops responding.  There are no error entry's in any logs.
Comment 5 Jeremy Allison 2006-05-26 10:55:49 UTC
Is there a chance you can try 3.0.23RC1 to see if this fixes the problem ? There have been some changes in this area.
Jeremy.
Comment 6 Jeremy Allison 2006-05-26 11:01:01 UTC
The other option is to add the line :

panic action = /bin/sleep 9999999

to the [global] section of the smb.conf. When nmbd crashes it will then stay around waiting for the child process sleep to finish, allowing you to attach to it with gdb and get a decent backtrace.

Thanks,

Jeremy.
Comment 7 Sebastian Held 2006-06-02 04:38:58 UTC
(In reply to comment #6)
> panic action = /bin/sleep 9999999
I did that, but since my last posting I'm not able to reproduce the failure (currently running 3.0.22)
Comment 8 Sebastian Held 2006-06-20 00:53:36 UTC
backtrace using samba3-3.0.22-26 (SerNet):
(gdb) bt
#0  0xb7d8cbfe in __waitpid_nocancel () from /lib/tls/libc.so.6
#1  0xb7d3983a in do_system () from /lib/tls/libc.so.6
#2  0xb7e26ded in system () from /lib/tls/libpthread.so.0
#3  0x080db9f0 in smb_panic2 (why=0xfffffe00 <Address 0xfffffe00 out of bounds>, decrement_pid_count=1) at util.c:1545
#4  0x080db965 in smb_panic (why=0xfffffe00 <Address 0xfffffe00 out of bounds>) at util.c:1506
#5  0x080c8afb in fault_report (sig=-512) at fault.c:42
#6  0x080c8b70 in sig_fault (sig=-512) at fault.c:65
#7  <signal handler called>
#8  0x0807e06f in remove_response_record (subrec=0x81f1b08, rrec=0x8187cd8) at nmbd_responserecordsdb.c:84
#9  0x0807bd59 in retransmit_or_expire_response_records (t=1150738393) at nmbd_packets.c:1637
#10 0x0806e1f0 in unbecome_local_master_browser (subrec=0x81f1b08, work=0x81f3e18, force_new_election=0) at nmbd_become_lmb.c:315
#11 0x0806e5c2 in become_local_master_fail2 (subrec=0x81f1b08, rrec=0x81b8218, fail_name=0x81f4f14) at nmbd_become_lmb.c:429
#12 0x08077294 in register_name_response (subrec=0x81f1b08, rrec=0x81b8218, p=0x81f4860) at nmbd_nameregister.c:157
#13 0x0807bb74 in process_nmb_response (p=0x81f4860) at nmbd_packets.c:1553
#14 0x0807bbca in run_packet_queue () at nmbd_packets.c:1573
#15 0x0806c343 in process () at nmbd.c:431
#16 0x0806cd09 in main (argc=0, argv=0xbffffd84) at nmbd.c:834
(gdb) 

The nmbd log will be attached. Now that I'm sure the error is not due to hardware changes, I will give 3.0.23RC a try.
Comment 9 Sebastian Held 2006-06-20 00:57:53 UTC
Created attachment 1969 [details]
nmbd errorlog
Comment 10 Jeremy Allison 2006-06-20 01:17:26 UTC
Why is nmbd linked with pthread ? What linux distro are you using.
Please add the output from ldd /usr/sbin/nmbd so I can see if this is linked with pthread.so.
Jeremy.
Comment 11 Sebastian Held 2006-06-20 01:24:36 UTC
rom:/var/log/samba # ldd /usr/sbin/nmbd
        linux-gate.so.1 =>  (0xffffe000)
        libcrypt.so.1 => /lib/libcrypt.so.1 (0x40030000)
        libresolv.so.2 => /lib/libresolv.so.2 (0x40062000)
        libnsl.so.1 => /lib/libnsl.so.1 (0x40076000)
        libdl.so.2 => /lib/libdl.so.2 (0x4008c000)
        librt.so.1 => /lib/tls/librt.so.1 (0x40090000)
        libpopt.so.0 => /usr/lib/libpopt.so.0 (0x40098000)
        libcom_err.so.2 => /lib/libcom_err.so.2 (0x400a0000)
        libcrypto.so.0.9.7 => /usr/lib/libcrypto.so.0.9.7 (0x400a3000)
        libldap-2.2.so.7 => /usr/lib/libldap-2.2.so.7 (0x40195000)
        liblber-2.2.so.7 => /usr/lib/liblber-2.2.so.7 (0x401c8000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0x401d5000)
        libc.so.6 => /lib/tls/libc.so.6 (0x401e6000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
        libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0x402fc000)
        libssl.so.0.9.7 => /usr/lib/libssl.so.0.9.7 (0x40312000)

rom:/var/log/samba # cat /etc/SuSE-release
SuSE Linux 9.2 (i586)
VERSION = 9.2

I'm using SerNet packages, but this one is recompiled on this computer. But recompilation was started only due to the above mentioned bug...
Comment 12 Jeremy Allison 2006-06-20 10:48:07 UTC
This :

libpthread.so.0 => /lib/tls/libpthread.so.0 (0x401d5000)

is a problem I think. It shouldn't be there. I wonder what is pulling it in ?

Jeremy.
Comment 13 Sebastian Held 2006-06-21 00:44:09 UTC
why is this a problem?

other libs depend on pthread, too:
rom:~ # ldd /lib/tls/librt.so.1
  [...]
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0x40146000)
  [...]
rom:~ #  

Is there anything I can do, to shed some light on why it links to pthreads?
Comment 14 Jeremy Allison 2006-06-21 10:46:23 UTC
How did you do the ./configure ? What options ?
Jeremy.
Comment 15 Sebastian Held 2006-06-22 01:17:42 UTC
using samba3-3.0.22-26.src.rpm from SerNet:

./configure --enable-cups --libdir=/usr/lib/samba --localstatedir=/var/lib/samba --mandir=/usr/share/man --prefix=/usr --sbindir=/usr/sbin --sysconfdir=/etc/samba --with-acl-
support --with-aio-support --with-automount --with-configdir=/etc/samba --with-lockdir=/var/lib/samba --with-logfilebase=/var/log/samba --with-libdir=/usr/lib/samba --with-msdfs
--with-pam --with-pam_smbpass --with-piddir=/var/run/samba --with-privatedir=/etc/samba --with-profiling-data --with-quotas --with-smbmount --with-swatdir=/usr/share/samba/swat -
-with-syslog --with-tdbsam --with-utmp --with-vfs --with-winbind --with-ads --with-krb5 --with-smbwrapper --with-shared-modules=idmap_rid,idmap_ad,vfs_catia
Comment 16 Sebastian Held 2006-10-18 04:44:31 UTC
It happened again: samba3-3.0.23b-30 (SerNet)

nmbd.log:
[2006/10/18 11:29:04, 0] nmbd/nmbd_incomingdgrams.c:process_local_master_announce(309)
  process_local_master_announce: Server ZIYAL at IP 192.168.90.53 is announcing itself as a local master browser for workgroup HFT and we think we are master. Forcing election.
[2006/10/18 11:29:04, 0] nmbd/nmbd_become_lmb.c:unbecome_local_master_success(149)
  *****

  Samba name server ROM has stopped being a local master browser for workgroup HFT on subnet 192.168.90.52

  *****
[2006/10/18 11:29:15, 0] nmbd/nmbd_nameregister.c:register_name_response(130)
  register_name_response: server at IP 192.168.90.53 rejected our name registration of HFT<1d> IP 192.168.90.52 with error code 6.
[2006/10/18 11:29:15, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(417)
  become_local_master_fail2: failed to register name HFT<1d> on subnet 192.168.90.52. Failed to become a local master browser.
[2006/10/18 11:29:15, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(417)
  become_local_master_fail2: failed to register name HFT<1d> on subnet 192.168.90.52. Failed to become a local master browser.
[2006/10/18 11:29:15, 0] nmbd/nmbd_namelistdb.c:standard_success_release(419)
  standard_success_release: Name release for name __MSBROWSE__<01> IP 192.168.90.52 on subnet 192.168.90.52. Name was not found on subnet.
[2006/10/18 11:29:15, 0] nmbd/nmbd_namelistdb.c:standard_fail_register(305)
  standard_fail_register: Failed to register/refresh name HFT<1d> on subnet 192.168.90.52
[2006/10/18 11:29:15, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2006/10/18 11:29:15, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 11 in pid 5915 (3.0.23b-SerNet-SuSE)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2006/10/18 11:29:15, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2006/10/18 11:29:15, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2006/10/18 11:29:15, 0] lib/util.c:smb_panic(1592)
  PANIC (pid 5915): internal error
[2006/10/18 11:29:15, 0] lib/util.c:log_stack_trace(1699)
  BACKTRACE: 15 stack frames:
   #0 /usr/sbin/nmbd(log_stack_trace+0x22) [0x80dd1eb]
   #1 /usr/sbin/nmbd(smb_panic+0x6f) [0x80dd08a]
   #2 /usr/sbin/nmbd [0x80cb5bf]
   #3 /usr/sbin/nmbd [0x80cb5d0]
   #4 [0xffffe420]
   #5 /usr/sbin/nmbd(retransmit_or_expire_response_records+0x163) [0x807bf29]
   #6 /usr/sbin/nmbd(unbecome_local_master_browser+0x181) [0x806e6e0]
   #7 /usr/sbin/nmbd [0x806eab2]
   #8 /usr/sbin/nmbd [0x8077454]
   #9 /usr/sbin/nmbd [0x807bd44]
   #10 /usr/sbin/nmbd(run_packet_queue+0x4c) [0x807bd9a]
   #11 /usr/sbin/nmbd [0x806c767]
   #12 /usr/sbin/nmbd(main+0x779) [0x806d1a4]
   #13 /lib/tls/libc.so.6(__libc_start_main+0xe4) [0xb7d22b14]
   #14 /usr/sbin/nmbd [0x806b571]
[2006/10/18 11:29:15, 0] lib/util.c:smb_panic(1600)
  smb_panic(): calling panic action [/bin/sleep 9999999]

(gdb) bt
#0  0xb7d94bfe in __waitpid_nocancel () from /lib/tls/libc.so.6
#1  0xb7d4183a in do_system () from /lib/tls/libc.so.6
#2  0xb7e2eded in system () from /lib/tls/libpthread.so.0
#3  0x080dd102 in smb_panic (why=0x0) at util.c:1601
#4  0x080cb5bf in fault_report (sig=-512) at fault.c:47
#5  0x080cb5d0 in sig_fault (sig=-512) at fault.c:70
#6  <signal handler called>
#7  0x0807e24f in remove_response_record (subrec=0x81f5bd8, rrec=0x818bec0) at nmbd_responserecordsdb.c:82
#8  0x0807bf29 in retransmit_or_expire_response_records (t=1161163755) at nmbd_packets.c:1639
#9  0x0806e6e0 in unbecome_local_master_browser (subrec=0x81f5bd8, work=0x81f6360, force_new_election=0) at nmbd_become_lmb.c:315
#10 0x0806eab2 in become_local_master_fail2 (subrec=0x81f5bd8, rrec=0x81bc2e8, fail_name=0x81f677c) at nmbd_become_lmb.c:429
#11 0x08077454 in register_name_response (subrec=0x81f5bd8, rrec=0x81bc2e8, p=0x81fc280) at nmbd_nameregister.c:157
#12 0x0807bd44 in process_nmb_response (p=0x81fc280) at nmbd_packets.c:1555
#13 0x0807bd9a in run_packet_queue () at nmbd_packets.c:1575
#14 0x0806c767 in process () at nmbd.c:396
#15 0x0806d1a4 in main (argc=0, argv=0xbffffd84) at nmbd.c:806
Comment 17 Jeremy Allison 2006-10-18 17:35:51 UTC
Great debug backtrace thanks ! It looks like it's occurring on this line :

        /* Ensure we can delete. */
        rrec->packet->locked = False;

Is it possible for you to use gdb to print out the contents of rrec here ? I don't think rrec->packet should ever be zero, but I'd like to check.

Thanks,

Jeremy.
Comment 18 Sebastian Held 2006-10-19 04:07:06 UTC
Now using 3.0.23c (SerNet).

(gdb) bt
#0  0x40264bfe in __waitpid_nocancel () from /lib/tls/libc.so.6
#1  0x4021183a in do_system () from /lib/tls/libc.so.6
#2  0x401d7ded in system () from /lib/tls/libpthread.so.0
#3  0x080dd212 in smb_panic (why=0x0) at util.c:1601
#4  0x080cb6bf in fault_report (sig=-512) at fault.c:47
#5  0x080cb6d0 in sig_fault (sig=-512) at fault.c:70
#6  <signal handler called>
#7  0x0807e29f in remove_response_record (subrec=0x81d0998, rrec=0x819efc8) at nmbd_responserecordsdb.c:82
#8  0x0807bf79 in retransmit_or_expire_response_records (t=1161247756) at nmbd_packets.c:1639
#9  0x0806e6f0 in unbecome_local_master_browser (subrec=0x81d0998, work=0x81d1120, force_new_election=0) at nmbd_become_lmb.c:315
#10 0x0806eac2 in become_local_master_fail2 (subrec=0x81d0998, rrec=0x81bf5c8, fail_name=0x81d1864) at nmbd_become_lmb.c:429
#11 0x080774a4 in register_name_response (subrec=0x81d0998, rrec=0x81bf5c8, p=0x81d4988) at nmbd_nameregister.c:157
#12 0x0807bd94 in process_nmb_response (p=0x81d4988) at nmbd_packets.c:1555
#13 0x0807bdea in run_packet_queue () at nmbd_packets.c:1575
#14 0x0806c777 in process () at nmbd.c:396
#15 0x0806d1b4 in main (argc=0, argv=0xbffff654) at nmbd.c:806

(gdb) up
[...]
gdb) print rrec
$1 = (struct response_record *) 0x819efc8
(gdb) print rrec->packet
$2 = (struct packet_struct *) 0x0

(gdb) p rrec->next
$3 = (struct response_record *) 0x819f088
(gdb) p rrec->prev
$4 = (struct response_record *) 0x819f010
(gdb) p rrec->response_id
$5 = 0
(gdb) p rrec->resp_fn
$6 = 0
(gdb) p rrec->timeout_fn
$7 = 0
(gdb) p rrec->success_fn
$8 = 0
(gdb) p rrec->fail_fn
$9 = 0
(gdb) p rrec->userdata
$10 = (struct userdata_struct *) 0x0
(gdb) p rrec->num_msgs
$11 = 0
(gdb) p rrec->repeat_time
$12 = 0
(gdb) p rrec->repeat_interval
$13 = 0
(gdb) p rrec->repeat_count
$14 = 0
(gdb) p rrec->in_expiration_processing
$15 = 1
Comment 19 Jeremy Allison 2006-10-19 05:26:21 UTC
Ok, that helps a *lot*. Should have a fix shortly, thanks !
Comment 20 Sebastian Held 2006-11-20 04:27:26 UTC
Now using 3.0.23d (SerNet).

(gdb) bt
#0  0x40264bfe in __waitpid_nocancel () from /lib/tls/libc.so.6
#1  0x4021183a in do_system () from /lib/tls/libc.so.6
#2  0x401d7ded in system () from /lib/tls/libpthread.so.0
#3  0x080dd412 in smb_panic (why=0x0) at util.c:1608
#4  0x080cb80f in fault_report (sig=-512) at fault.c:47
#5  0x080cb820 in sig_fault (sig=-512) at fault.c:70
#6  <signal handler called>
#7  0x0807e30f in remove_response_record (subrec=0x81d1998, rrec=0x81a0090) at nmbd_responserecordsdb.c:82
#8  0x0807bfe9 in retransmit_or_expire_response_records (t=1164017649) at nmbd_packets.c:1639
#9  0x0806e760 in unbecome_local_master_browser (subrec=0x81d1998, work=0x81d2120, force_new_election=0) at nmbd_become_lmb.c:315
#10 0x0806eb32 in become_local_master_fail2 (subrec=0x81d1998, rrec=0x819fe88, fail_name=0x81d6994) at nmbd_become_lmb.c:429
#11 0x08077514 in register_name_response (subrec=0x81d1998, rrec=0x819fe88, p=0x81d72d8) at nmbd_nameregister.c:157
#12 0x0807be04 in process_nmb_response (p=0x81d72d8) at nmbd_packets.c:1555
#13 0x0807be5a in run_packet_queue () at nmbd_packets.c:1575
#14 0x0806c7e7 in process () at nmbd.c:396
#15 0x0806d224 in main (argc=0, argv=0xbffff654) at nmbd.c:806

=== log excerpt ===
[2006/11/20 11:14:09, 0] nmbd/nmbd_namelistdb.c:standard_success_release(419)
  standard_success_release: Name release for name __MSBROWSE__<01> IP 192.168.90.52 on subnet 192.168.90.52. Name was not found on subnet.
[2006/11/20 11:14:09, 0] nmbd/nmbd_namelistdb.c:standard_fail_register(305)
  standard_fail_register: Failed to register/refresh name HFT<1d> on subnet 192.168.90.52
[2006/11/20 11:14:09, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2006/11/20 11:14:09, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 11 in pid 29867 (3.0.23d-SerNet-SuSE)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2006/11/20 11:14:09, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2006/11/20 11:14:09, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2006/11/20 11:14:09, 0] lib/util.c:smb_panic(1599)
  PANIC (pid 29867): internal error
Comment 21 Volker Lendecke 2007-01-12 05:20:10 UTC
Jeremy, I tried to reproduce this with current code but I could not. You said you would have a fix quickly, but I could not find any reference. If it is fixed, can you close this bug or tell me more about the way to reproduce it?

Thanks,

Volker
Comment 22 Sebastian Held 2007-09-04 04:06:27 UTC
It happened again :(
Now running openSUSE 10.2 (i586) and samba3-3.0.25c-33 from SerNet. I'm pretty sure that this damn workstation ziyal is somewhat misconfigured, but that must not take nmbd down, right?

I will attach the error log from nmbd as well as the (useless) backtrace.

kind regards,
Sebastian
Comment 23 Sebastian Held 2007-09-04 04:08:29 UTC
Created attachment 2915 [details]
useless backtrace
Comment 24 Sebastian Held 2007-09-04 04:10:29 UTC
Created attachment 2916 [details]
nmbd errorlog
Comment 25 Sebastian Held 2007-09-04 04:15:36 UTC
Created attachment 2917 [details]
useless backtrace (core file)
Comment 26 Sebastian Held 2007-12-04 04:05:18 UTC
samba3-3.0.27a-35 on openSUSE 10.2:

(gdb) bt
#0  0xb7f768b2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb7ccb060 in raise () from /lib/libc.so.6
#2  0xb7ccc801 in abort () from /lib/libc.so.6
#3  0x8008d9d9 in dump_core () at lib/fault.c:192
#4  0x800a1bdd in smb_panic (why=0x800e5eb8 "internal error") at lib/util.c:1649
#5  0x8008de59 in sig_fault (sig=11) at lib/fault.c:47
#6  <signal handler called>
#7  0x800340bb in remove_response_record (subrec=0x80161d08, rrec=0x801201a8) at nmbd/nmbd_responserecordsdb.c:82
#8  0x8002e41d in retransmit_or_expire_response_records (t=1196760635) at nmbd/nmbd_packets.c:1645
#9  0x80023313 in unbecome_local_master_browser (subrec=0x80161d08, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315
#10 0x8002342a in become_local_master_fail2 (subrec=0x80161d08, rrec=0x80120018, fail_name=0x80166c4c) at nmbd/nmbd_become_lmb.c:429
#11 0x8002ce1d in register_name_response (subrec=0x80161d08, rrec=0x80120018, p=0x801610c8) at nmbd/nmbd_nameregister.c:157
#12 0x8002f8c9 in run_packet_queue () at nmbd/nmbd_packets.c:1561
#13 0x80021935 in main (argc=0, argv=0xbfbb6534) at nmbd/nmbd.c:394


inspecting stackframe #7:
(gdb) frame 7
#7  0x800340bb in remove_response_record (subrec=0x80161d08, rrec=0x801201a8) at nmbd/nmbd_responserecordsdb.c:82
82      in nmbd/nmbd_responserecordsdb.c
(gdb) p *rrec
$3 = {next = 0x80128300, prev = 0x80158c68, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0,
  repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 1}
p *subrec
$4 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d48 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80165da8, responselist = 0x0,
  namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 13,
  dgram_sock = 14}
Comment 27 Sebastian Held 2007-12-04 04:06:34 UTC
Created attachment 3021 [details]
log.nmbd
Comment 28 Sebastian Held 2007-12-04 04:13:09 UTC
Proposed patch:

Add
if (rrec->packet) {
before line 82 and add
}
after line 83.
Comment 29 Jeremy Allison 2007-12-04 12:06:23 UTC
Hmmm. How can rrec->packet == NULL ? How do we get to this state ?
I want to understand that before I add the patch. Thanks for all your work on this.
Jeremy.

Comment 30 Sebastian Held 2007-12-16 04:36:39 UTC
3.0.28-SerNet-SuSE:

(gdb) bt
#0  0xb7fd98b2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb7d905c3 in __waitpid_nocancel () from /lib/libc.so.6
#2  0xb7d3ad3b in do_system () from /lib/libc.so.6
#3  0x800a1c01 in smb_panic (why=0x800e5fb8 "internal error") at lib/util.c:1639
#4  0x8008df39 in sig_fault (sig=11) at lib/fault.c:47
#5  <signal handler called>
#6  0x8003412b in remove_response_record (subrec=0x80161ce0, rrec=0x80138560) at nmbd/nmbd_responserecordsdb.c:82
#7  0x8002e41d in retransmit_or_expire_response_records (t=1197717028) at nmbd/nmbd_packets.c:1645
#8  0x80023313 in unbecome_local_master_browser (subrec=0x80161ce0, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315
#9  0x8002342a in become_local_master_fail2 (subrec=0x80161ce0, rrec=0x80117c38, fail_name=0x80165fc4) at nmbd/nmbd_become_lmb.c:429
#10 0x8002ce1d in register_name_response (subrec=0x80161ce0, rrec=0x80117c38, p=0x80167f68) at nmbd/nmbd_nameregister.c:157
#11 0x8002f8c9 in run_packet_queue () at nmbd/nmbd_packets.c:1561
#12 0x80021935 in main (argc=0, argv=0xbfe7e7e4) at nmbd/nmbd.c:394

(gdb) frame 6
#6  0x8003412b in remove_response_record (subrec=0x80161ce0, rrec=0x80138560) at nmbd/nmbd_responserecordsdb.c:82

(gdb) p *subrec
$2 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18,
  responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902},
  nmb_sock = 11, dgram_sock = 12}

(gdb) p *rrec
$3 = {next = 0x80150b10, prev = 0x8011ffa8, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0,
  num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 1}

(gdb) frame 8
#8  0x80023313 in unbecome_local_master_browser (subrec=0x80161ce0, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315

(gdb) p *subrec
$5 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18,
  responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902},
  nmb_sock = 11, dgram_sock = 12}

(gdb) p *work
$6 = {next = 0x0, prev = 0x0, subnet = 0x80161ce0, serverlist = 0x801645b0, mst_state = MST_UNBECOMING_MASTER, dom_state = DOMAIN_MST,
  log_state = LOGON_SRV, work_group = "HFT", '\0' <repeats 60 times>, token = 1, local_master_browser_name = "\000OM", '\0' <repeats 60 times>,
  lastannounce_time = 1197717015, announce_interval = 240, needannounce = 0, death_time = 0, RunningElection = 1, needelection = 0, ElectionCount = 3,
  ElectionCriterion = 1090588559, dmb_name = {name = '\0' <repeats 15 times>, scope = '\0' <repeats 63 times>, name_type = 0}, dmb_addr = {s_addr = 0}}

(gdb) frame 9
#9  0x8002342a in become_local_master_fail2 (subrec=0x80161ce0, rrec=0x80117c38, fail_name=0x80165fc4) at nmbd/nmbd_become_lmb.c:429

(gdb) p *subrec
$7 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18,
  responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902},
  nmb_sock = 11, dgram_sock = 12}

(gdb) p *rrec
$8 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0,
  num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 0}

(gdb) p *fail_name
$9 = {name = "sr/sbin/nmbd(smb", scope = "_panic+0x78) [0x800a1b96]\000/usr/sbin/nmbd [0x8008df39]\000/lib/libc.", name_type = 909012851}

(gdb) frame 10
#10 0x8002ce1d in register_name_response (subrec=0x80161ce0, rrec=0x80117c38, p=0x80167f68) at nmbd/nmbd_nameregister.c:157

(gdb) p *subrec
$12 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18,
  responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902},
  nmb_sock = 11, dgram_sock = 12}

(gdb) p *rrec
$13 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0,
  num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 0}

(gdb) p *p
$14 = {next = 0x0, prev = 0x0, locked = 0, ip = {s_addr = 895114118}, port = 137, fd = 11, timestamp = 1197717028, packet_type = NMB_PACKET, packet = {
    nmb = {header = {name_trn_id = 16197, opcode = 5, response = 1, nm_flags = {bcast = 0, recursion_available = 1, recursion_desired = 1, trunc = 0,
          authoritative = 1}, rcode = 6, qdcount = 0, ancount = 1, nscount = 0, arcount = 0}, question = {question_name = {name = '\0' <repeats 15 times>,
          scope = '\0' <repeats 63 times>, name_type = 0}, question_type = 0, question_class = 0}, answers = 0x80169490, nsrecs = 0x0, additional = 0x0},
    dgram = {header = {msg_type = 16197, flags = {node_type = 5, first = 1, more = 0}, dgm_id = 1, source_ip = {s_addr = 1}, source_port = 0,
        dgm_length = 1, packet_offset = 6}, source_name = {name = "\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000",
        scope = '\0' <repeats 63 times>, name_type = 0}, dest_name = {name = '\0' <repeats 15 times>,
        scope = "\000\000\000\000\000\000\000\000\220\224\026\200", '\0' <repeats 51 times>, name_type = 0}, datasize = 0,
      data = '\0' <repeats 575 times>}}}

Comment 31 Jeremy Allison 2007-12-17 16:51:03 UTC
This is corrupted memory. The problem won't be solved by adding the suggested patch, as that is only protecting against accesses in memory that's already been corrupted.

This gives it away for sure. From frame 10:

gdb) p *rrec
$13 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0,
timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0,
  num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0,
in_expiration_processing = 0}

This cannot be a valid response record pointer. The resp_fn, timeout_fn, success_fn, fail_fn cannot all be NULL in any codepath. This is at the start of the backtrace, so this nmbd already has a broken response record list.

Running under valgrind would catch the problem for sure, but might be too slow.
If you can't use valgrind, can you try setting :

export MALLOC_CHECK_=2
/usr/sbin/nmbd

and seeing if this will catch the bug ? It's definately triggered by the second LMB announce, but I need to see 

Jeremy.
Comment 32 Sebastian Held 2007-12-21 05:24:22 UTC
Created attachment 3065 [details]
valgrind logfile
Comment 33 Volker Lendecke 2007-12-21 06:34:36 UTC
Created attachment 3066 [details]
proposed patch

Thanks for that valgrind log. Can you see if it still happens if you apply the attached patch? Not 100% certain, but I suspect a bug in that linked list handling that we now have macros for.

Volker

P.S: This must be *ancient* code....
Comment 34 Jeremy Allison 2007-12-21 13:34:56 UTC
Created attachment 3070 [details]
Belt and braces

This *is* ancient code. I think this patch may also help. Please test.
Jeremy.
Comment 35 Sebastian Held 2008-01-02 07:50:11 UTC
I applied both patches. New valgrind report follows (truncated to last 1000 lines).
Comment 36 Sebastian Held 2008-01-02 07:51:21 UTC
Created attachment 3086 [details]
valgrind report with two patches applied
Comment 37 Jeremy Allison 2008-01-02 13:04:17 UTC
The valgrind traces aren't matching what I've got in my source tree, and the impossible is still occurring (a rrec is still on a linked list that DLIST_REMOVE should have explicitly removed it from). I'd like us to sync up on codebases so I know *exactly* what you are running.

Currently you're running 3.0.28-SerNet-SuSE, yes ? Volker, can we sync up so that 
we can be sure of what code is running on his box ? I'm going to push the patches 
we've already added as I'm sure they're good.

Jeremy.

Comment 38 Jeremy Allison 2008-01-02 13:31:48 UTC
If I send you a copy of my nmbd/ directory can you use that to build ?
Jeremy.
Comment 39 Sebastian Held 2008-01-03 04:49:34 UTC
I'm sorry to waste your time.
I thought I applied the patches correctly (modified the .spec-file from the Sernet sources). After stripping the first path element from the patches, they applied cleanly...
But I'm sure, I can build from a nmbd-source tree.
Comment 40 Sebastian Held 2008-02-09 06:05:23 UTC
I tried several times to run nmbd under supervision of valgrind, but cannot get a meaningful backtrace:

Failed to read a valid object file image from memory.
0x048258f1 in _vgrZU_libcZdsoZa_memcpy (dst=0xbed748b8, src=0x61cfdbc, len=84)
    at mc_replace_strmem.c:402
402     MEMCPY(m_libc_soname, memcpy)
(gdb) bt
#0  0x048258f1 in _vgrZU_libcZdsoZa_memcpy (dst=0xbed748b8, src=0x61cfdbc,
    len=84) at mc_replace_strmem.c:402
#1  0x0002a766 in upcase_name ()
#2  0x0002a896 in find_name_on_subnet ()
#3  0x0002aec1 in standard_fail_register ()
#4  0x0002ce39 in register_name_response ()
#5  0x0002f8ce in run_packet_queue ()
#6  0x00021935 in main ()

Some forums report: a bug in the linux kernel is responsible for the problems with backtraces... I already installed the newest kernel 2.6.18.8-0.8-default which should not suffer from the problem - no luck. Disabled appamor - no luck.

Sorry, I do not know how to proceed to get a valid backtrace.
Comment 41 Sebastian Held 2008-02-29 06:15:59 UTC
got it working...
please find attached the first valgrind report.
Comment 42 Sebastian Held 2008-02-29 06:16:59 UTC
Created attachment 3154 [details]
valgrind report and backtrace

Syscall param write(buf) points to uninitialised byte(s)
Comment 43 Jeremy Allison 2008-02-29 07:47:10 UTC
Created attachment 3155 [details]
Patch

Thanks very much for this. Here is a patch for the "Invalid read of size 1" errors.
Jeremy.
Comment 44 Jeremy Allison 2008-02-29 07:50:43 UTC
Created attachment 3156 [details]
Proper patch.

Proper patch to fix the "Invalid read of size 1" errors. Sorry for the earlier one.
Jermey.
Comment 45 Jeremy Allison 2008-02-29 08:52:37 UTC
Created attachment 3157 [details]
Patch for the invalid 4 byte reads.

Patch to fix the "Invalid read of size 4" errors.
Jeremy.
Comment 46 Sebastian Held 2008-06-10 08:39:01 UTC
Currently I'm running 3.0.30 and do not experience this bug anymore. I think it's solved.