WINS service stops working, because nmbd gets signal 11 segment violation. This seems to be related to another server announces itself as the local master browser. See attached log. How can help with this issue? The backtrace seems not very informative... (but I've installed samba3-debuginfo-3.0.21c-26.rpm)
Created attachment 1812 [details] logfile
same with 3.0.22
Created attachment 1844 [details] nmbd errorlog
Using 3.0.22 (13.1 & 14.1) with two masters running, samba stops responding. There are no error entry's in any logs.
Is there a chance you can try 3.0.23RC1 to see if this fixes the problem ? There have been some changes in this area. Jeremy.
The other option is to add the line : panic action = /bin/sleep 9999999 to the [global] section of the smb.conf. When nmbd crashes it will then stay around waiting for the child process sleep to finish, allowing you to attach to it with gdb and get a decent backtrace. Thanks, Jeremy.
(In reply to comment #6) > panic action = /bin/sleep 9999999 I did that, but since my last posting I'm not able to reproduce the failure (currently running 3.0.22)
backtrace using samba3-3.0.22-26 (SerNet): (gdb) bt #0 0xb7d8cbfe in __waitpid_nocancel () from /lib/tls/libc.so.6 #1 0xb7d3983a in do_system () from /lib/tls/libc.so.6 #2 0xb7e26ded in system () from /lib/tls/libpthread.so.0 #3 0x080db9f0 in smb_panic2 (why=0xfffffe00 <Address 0xfffffe00 out of bounds>, decrement_pid_count=1) at util.c:1545 #4 0x080db965 in smb_panic (why=0xfffffe00 <Address 0xfffffe00 out of bounds>) at util.c:1506 #5 0x080c8afb in fault_report (sig=-512) at fault.c:42 #6 0x080c8b70 in sig_fault (sig=-512) at fault.c:65 #7 <signal handler called> #8 0x0807e06f in remove_response_record (subrec=0x81f1b08, rrec=0x8187cd8) at nmbd_responserecordsdb.c:84 #9 0x0807bd59 in retransmit_or_expire_response_records (t=1150738393) at nmbd_packets.c:1637 #10 0x0806e1f0 in unbecome_local_master_browser (subrec=0x81f1b08, work=0x81f3e18, force_new_election=0) at nmbd_become_lmb.c:315 #11 0x0806e5c2 in become_local_master_fail2 (subrec=0x81f1b08, rrec=0x81b8218, fail_name=0x81f4f14) at nmbd_become_lmb.c:429 #12 0x08077294 in register_name_response (subrec=0x81f1b08, rrec=0x81b8218, p=0x81f4860) at nmbd_nameregister.c:157 #13 0x0807bb74 in process_nmb_response (p=0x81f4860) at nmbd_packets.c:1553 #14 0x0807bbca in run_packet_queue () at nmbd_packets.c:1573 #15 0x0806c343 in process () at nmbd.c:431 #16 0x0806cd09 in main (argc=0, argv=0xbffffd84) at nmbd.c:834 (gdb) The nmbd log will be attached. Now that I'm sure the error is not due to hardware changes, I will give 3.0.23RC a try.
Created attachment 1969 [details] nmbd errorlog
Why is nmbd linked with pthread ? What linux distro are you using. Please add the output from ldd /usr/sbin/nmbd so I can see if this is linked with pthread.so. Jeremy.
rom:/var/log/samba # ldd /usr/sbin/nmbd linux-gate.so.1 => (0xffffe000) libcrypt.so.1 => /lib/libcrypt.so.1 (0x40030000) libresolv.so.2 => /lib/libresolv.so.2 (0x40062000) libnsl.so.1 => /lib/libnsl.so.1 (0x40076000) libdl.so.2 => /lib/libdl.so.2 (0x4008c000) librt.so.1 => /lib/tls/librt.so.1 (0x40090000) libpopt.so.0 => /usr/lib/libpopt.so.0 (0x40098000) libcom_err.so.2 => /lib/libcom_err.so.2 (0x400a0000) libcrypto.so.0.9.7 => /usr/lib/libcrypto.so.0.9.7 (0x400a3000) libldap-2.2.so.7 => /usr/lib/libldap-2.2.so.7 (0x40195000) liblber-2.2.so.7 => /usr/lib/liblber-2.2.so.7 (0x401c8000) libpthread.so.0 => /lib/tls/libpthread.so.0 (0x401d5000) libc.so.6 => /lib/tls/libc.so.6 (0x401e6000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0x402fc000) libssl.so.0.9.7 => /usr/lib/libssl.so.0.9.7 (0x40312000) rom:/var/log/samba # cat /etc/SuSE-release SuSE Linux 9.2 (i586) VERSION = 9.2 I'm using SerNet packages, but this one is recompiled on this computer. But recompilation was started only due to the above mentioned bug...
This : libpthread.so.0 => /lib/tls/libpthread.so.0 (0x401d5000) is a problem I think. It shouldn't be there. I wonder what is pulling it in ? Jeremy.
why is this a problem? other libs depend on pthread, too: rom:~ # ldd /lib/tls/librt.so.1 [...] libpthread.so.0 => /lib/tls/libpthread.so.0 (0x40146000) [...] rom:~ # Is there anything I can do, to shed some light on why it links to pthreads?
How did you do the ./configure ? What options ? Jeremy.
using samba3-3.0.22-26.src.rpm from SerNet: ./configure --enable-cups --libdir=/usr/lib/samba --localstatedir=/var/lib/samba --mandir=/usr/share/man --prefix=/usr --sbindir=/usr/sbin --sysconfdir=/etc/samba --with-acl- support --with-aio-support --with-automount --with-configdir=/etc/samba --with-lockdir=/var/lib/samba --with-logfilebase=/var/log/samba --with-libdir=/usr/lib/samba --with-msdfs --with-pam --with-pam_smbpass --with-piddir=/var/run/samba --with-privatedir=/etc/samba --with-profiling-data --with-quotas --with-smbmount --with-swatdir=/usr/share/samba/swat - -with-syslog --with-tdbsam --with-utmp --with-vfs --with-winbind --with-ads --with-krb5 --with-smbwrapper --with-shared-modules=idmap_rid,idmap_ad,vfs_catia
It happened again: samba3-3.0.23b-30 (SerNet) nmbd.log: [2006/10/18 11:29:04, 0] nmbd/nmbd_incomingdgrams.c:process_local_master_announce(309) process_local_master_announce: Server ZIYAL at IP 192.168.90.53 is announcing itself as a local master browser for workgroup HFT and we think we are master. Forcing election. [2006/10/18 11:29:04, 0] nmbd/nmbd_become_lmb.c:unbecome_local_master_success(149) ***** Samba name server ROM has stopped being a local master browser for workgroup HFT on subnet 192.168.90.52 ***** [2006/10/18 11:29:15, 0] nmbd/nmbd_nameregister.c:register_name_response(130) register_name_response: server at IP 192.168.90.53 rejected our name registration of HFT<1d> IP 192.168.90.52 with error code 6. [2006/10/18 11:29:15, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(417) become_local_master_fail2: failed to register name HFT<1d> on subnet 192.168.90.52. Failed to become a local master browser. [2006/10/18 11:29:15, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(417) become_local_master_fail2: failed to register name HFT<1d> on subnet 192.168.90.52. Failed to become a local master browser. [2006/10/18 11:29:15, 0] nmbd/nmbd_namelistdb.c:standard_success_release(419) standard_success_release: Name release for name __MSBROWSE__<01> IP 192.168.90.52 on subnet 192.168.90.52. Name was not found on subnet. [2006/10/18 11:29:15, 0] nmbd/nmbd_namelistdb.c:standard_fail_register(305) standard_fail_register: Failed to register/refresh name HFT<1d> on subnet 192.168.90.52 [2006/10/18 11:29:15, 0] lib/fault.c:fault_report(41) =============================================================== [2006/10/18 11:29:15, 0] lib/fault.c:fault_report(42) INTERNAL ERROR: Signal 11 in pid 5915 (3.0.23b-SerNet-SuSE) Please read the Trouble-Shooting section of the Samba3-HOWTO [2006/10/18 11:29:15, 0] lib/fault.c:fault_report(44) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2006/10/18 11:29:15, 0] lib/fault.c:fault_report(45) =============================================================== [2006/10/18 11:29:15, 0] lib/util.c:smb_panic(1592) PANIC (pid 5915): internal error [2006/10/18 11:29:15, 0] lib/util.c:log_stack_trace(1699) BACKTRACE: 15 stack frames: #0 /usr/sbin/nmbd(log_stack_trace+0x22) [0x80dd1eb] #1 /usr/sbin/nmbd(smb_panic+0x6f) [0x80dd08a] #2 /usr/sbin/nmbd [0x80cb5bf] #3 /usr/sbin/nmbd [0x80cb5d0] #4 [0xffffe420] #5 /usr/sbin/nmbd(retransmit_or_expire_response_records+0x163) [0x807bf29] #6 /usr/sbin/nmbd(unbecome_local_master_browser+0x181) [0x806e6e0] #7 /usr/sbin/nmbd [0x806eab2] #8 /usr/sbin/nmbd [0x8077454] #9 /usr/sbin/nmbd [0x807bd44] #10 /usr/sbin/nmbd(run_packet_queue+0x4c) [0x807bd9a] #11 /usr/sbin/nmbd [0x806c767] #12 /usr/sbin/nmbd(main+0x779) [0x806d1a4] #13 /lib/tls/libc.so.6(__libc_start_main+0xe4) [0xb7d22b14] #14 /usr/sbin/nmbd [0x806b571] [2006/10/18 11:29:15, 0] lib/util.c:smb_panic(1600) smb_panic(): calling panic action [/bin/sleep 9999999] (gdb) bt #0 0xb7d94bfe in __waitpid_nocancel () from /lib/tls/libc.so.6 #1 0xb7d4183a in do_system () from /lib/tls/libc.so.6 #2 0xb7e2eded in system () from /lib/tls/libpthread.so.0 #3 0x080dd102 in smb_panic (why=0x0) at util.c:1601 #4 0x080cb5bf in fault_report (sig=-512) at fault.c:47 #5 0x080cb5d0 in sig_fault (sig=-512) at fault.c:70 #6 <signal handler called> #7 0x0807e24f in remove_response_record (subrec=0x81f5bd8, rrec=0x818bec0) at nmbd_responserecordsdb.c:82 #8 0x0807bf29 in retransmit_or_expire_response_records (t=1161163755) at nmbd_packets.c:1639 #9 0x0806e6e0 in unbecome_local_master_browser (subrec=0x81f5bd8, work=0x81f6360, force_new_election=0) at nmbd_become_lmb.c:315 #10 0x0806eab2 in become_local_master_fail2 (subrec=0x81f5bd8, rrec=0x81bc2e8, fail_name=0x81f677c) at nmbd_become_lmb.c:429 #11 0x08077454 in register_name_response (subrec=0x81f5bd8, rrec=0x81bc2e8, p=0x81fc280) at nmbd_nameregister.c:157 #12 0x0807bd44 in process_nmb_response (p=0x81fc280) at nmbd_packets.c:1555 #13 0x0807bd9a in run_packet_queue () at nmbd_packets.c:1575 #14 0x0806c767 in process () at nmbd.c:396 #15 0x0806d1a4 in main (argc=0, argv=0xbffffd84) at nmbd.c:806
Great debug backtrace thanks ! It looks like it's occurring on this line : /* Ensure we can delete. */ rrec->packet->locked = False; Is it possible for you to use gdb to print out the contents of rrec here ? I don't think rrec->packet should ever be zero, but I'd like to check. Thanks, Jeremy.
Now using 3.0.23c (SerNet). (gdb) bt #0 0x40264bfe in __waitpid_nocancel () from /lib/tls/libc.so.6 #1 0x4021183a in do_system () from /lib/tls/libc.so.6 #2 0x401d7ded in system () from /lib/tls/libpthread.so.0 #3 0x080dd212 in smb_panic (why=0x0) at util.c:1601 #4 0x080cb6bf in fault_report (sig=-512) at fault.c:47 #5 0x080cb6d0 in sig_fault (sig=-512) at fault.c:70 #6 <signal handler called> #7 0x0807e29f in remove_response_record (subrec=0x81d0998, rrec=0x819efc8) at nmbd_responserecordsdb.c:82 #8 0x0807bf79 in retransmit_or_expire_response_records (t=1161247756) at nmbd_packets.c:1639 #9 0x0806e6f0 in unbecome_local_master_browser (subrec=0x81d0998, work=0x81d1120, force_new_election=0) at nmbd_become_lmb.c:315 #10 0x0806eac2 in become_local_master_fail2 (subrec=0x81d0998, rrec=0x81bf5c8, fail_name=0x81d1864) at nmbd_become_lmb.c:429 #11 0x080774a4 in register_name_response (subrec=0x81d0998, rrec=0x81bf5c8, p=0x81d4988) at nmbd_nameregister.c:157 #12 0x0807bd94 in process_nmb_response (p=0x81d4988) at nmbd_packets.c:1555 #13 0x0807bdea in run_packet_queue () at nmbd_packets.c:1575 #14 0x0806c777 in process () at nmbd.c:396 #15 0x0806d1b4 in main (argc=0, argv=0xbffff654) at nmbd.c:806 (gdb) up [...] gdb) print rrec $1 = (struct response_record *) 0x819efc8 (gdb) print rrec->packet $2 = (struct packet_struct *) 0x0 (gdb) p rrec->next $3 = (struct response_record *) 0x819f088 (gdb) p rrec->prev $4 = (struct response_record *) 0x819f010 (gdb) p rrec->response_id $5 = 0 (gdb) p rrec->resp_fn $6 = 0 (gdb) p rrec->timeout_fn $7 = 0 (gdb) p rrec->success_fn $8 = 0 (gdb) p rrec->fail_fn $9 = 0 (gdb) p rrec->userdata $10 = (struct userdata_struct *) 0x0 (gdb) p rrec->num_msgs $11 = 0 (gdb) p rrec->repeat_time $12 = 0 (gdb) p rrec->repeat_interval $13 = 0 (gdb) p rrec->repeat_count $14 = 0 (gdb) p rrec->in_expiration_processing $15 = 1
Ok, that helps a *lot*. Should have a fix shortly, thanks !
Now using 3.0.23d (SerNet). (gdb) bt #0 0x40264bfe in __waitpid_nocancel () from /lib/tls/libc.so.6 #1 0x4021183a in do_system () from /lib/tls/libc.so.6 #2 0x401d7ded in system () from /lib/tls/libpthread.so.0 #3 0x080dd412 in smb_panic (why=0x0) at util.c:1608 #4 0x080cb80f in fault_report (sig=-512) at fault.c:47 #5 0x080cb820 in sig_fault (sig=-512) at fault.c:70 #6 <signal handler called> #7 0x0807e30f in remove_response_record (subrec=0x81d1998, rrec=0x81a0090) at nmbd_responserecordsdb.c:82 #8 0x0807bfe9 in retransmit_or_expire_response_records (t=1164017649) at nmbd_packets.c:1639 #9 0x0806e760 in unbecome_local_master_browser (subrec=0x81d1998, work=0x81d2120, force_new_election=0) at nmbd_become_lmb.c:315 #10 0x0806eb32 in become_local_master_fail2 (subrec=0x81d1998, rrec=0x819fe88, fail_name=0x81d6994) at nmbd_become_lmb.c:429 #11 0x08077514 in register_name_response (subrec=0x81d1998, rrec=0x819fe88, p=0x81d72d8) at nmbd_nameregister.c:157 #12 0x0807be04 in process_nmb_response (p=0x81d72d8) at nmbd_packets.c:1555 #13 0x0807be5a in run_packet_queue () at nmbd_packets.c:1575 #14 0x0806c7e7 in process () at nmbd.c:396 #15 0x0806d224 in main (argc=0, argv=0xbffff654) at nmbd.c:806 === log excerpt === [2006/11/20 11:14:09, 0] nmbd/nmbd_namelistdb.c:standard_success_release(419) standard_success_release: Name release for name __MSBROWSE__<01> IP 192.168.90.52 on subnet 192.168.90.52. Name was not found on subnet. [2006/11/20 11:14:09, 0] nmbd/nmbd_namelistdb.c:standard_fail_register(305) standard_fail_register: Failed to register/refresh name HFT<1d> on subnet 192.168.90.52 [2006/11/20 11:14:09, 0] lib/fault.c:fault_report(41) =============================================================== [2006/11/20 11:14:09, 0] lib/fault.c:fault_report(42) INTERNAL ERROR: Signal 11 in pid 29867 (3.0.23d-SerNet-SuSE) Please read the Trouble-Shooting section of the Samba3-HOWTO [2006/11/20 11:14:09, 0] lib/fault.c:fault_report(44) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2006/11/20 11:14:09, 0] lib/fault.c:fault_report(45) =============================================================== [2006/11/20 11:14:09, 0] lib/util.c:smb_panic(1599) PANIC (pid 29867): internal error
Jeremy, I tried to reproduce this with current code but I could not. You said you would have a fix quickly, but I could not find any reference. If it is fixed, can you close this bug or tell me more about the way to reproduce it? Thanks, Volker
It happened again :( Now running openSUSE 10.2 (i586) and samba3-3.0.25c-33 from SerNet. I'm pretty sure that this damn workstation ziyal is somewhat misconfigured, but that must not take nmbd down, right? I will attach the error log from nmbd as well as the (useless) backtrace. kind regards, Sebastian
Created attachment 2915 [details] useless backtrace
Created attachment 2916 [details] nmbd errorlog
Created attachment 2917 [details] useless backtrace (core file)
samba3-3.0.27a-35 on openSUSE 10.2: (gdb) bt #0 0xb7f768b2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb7ccb060 in raise () from /lib/libc.so.6 #2 0xb7ccc801 in abort () from /lib/libc.so.6 #3 0x8008d9d9 in dump_core () at lib/fault.c:192 #4 0x800a1bdd in smb_panic (why=0x800e5eb8 "internal error") at lib/util.c:1649 #5 0x8008de59 in sig_fault (sig=11) at lib/fault.c:47 #6 <signal handler called> #7 0x800340bb in remove_response_record (subrec=0x80161d08, rrec=0x801201a8) at nmbd/nmbd_responserecordsdb.c:82 #8 0x8002e41d in retransmit_or_expire_response_records (t=1196760635) at nmbd/nmbd_packets.c:1645 #9 0x80023313 in unbecome_local_master_browser (subrec=0x80161d08, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315 #10 0x8002342a in become_local_master_fail2 (subrec=0x80161d08, rrec=0x80120018, fail_name=0x80166c4c) at nmbd/nmbd_become_lmb.c:429 #11 0x8002ce1d in register_name_response (subrec=0x80161d08, rrec=0x80120018, p=0x801610c8) at nmbd/nmbd_nameregister.c:157 #12 0x8002f8c9 in run_packet_queue () at nmbd/nmbd_packets.c:1561 #13 0x80021935 in main (argc=0, argv=0xbfbb6534) at nmbd/nmbd.c:394 inspecting stackframe #7: (gdb) frame 7 #7 0x800340bb in remove_response_record (subrec=0x80161d08, rrec=0x801201a8) at nmbd/nmbd_responserecordsdb.c:82 82 in nmbd/nmbd_responserecordsdb.c (gdb) p *rrec $3 = {next = 0x80128300, prev = 0x80158c68, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 1} p *subrec $4 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d48 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80165da8, responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 13, dgram_sock = 14}
Created attachment 3021 [details] log.nmbd
Proposed patch: Add if (rrec->packet) { before line 82 and add } after line 83.
Hmmm. How can rrec->packet == NULL ? How do we get to this state ? I want to understand that before I add the patch. Thanks for all your work on this. Jeremy.
3.0.28-SerNet-SuSE: (gdb) bt #0 0xb7fd98b2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb7d905c3 in __waitpid_nocancel () from /lib/libc.so.6 #2 0xb7d3ad3b in do_system () from /lib/libc.so.6 #3 0x800a1c01 in smb_panic (why=0x800e5fb8 "internal error") at lib/util.c:1639 #4 0x8008df39 in sig_fault (sig=11) at lib/fault.c:47 #5 <signal handler called> #6 0x8003412b in remove_response_record (subrec=0x80161ce0, rrec=0x80138560) at nmbd/nmbd_responserecordsdb.c:82 #7 0x8002e41d in retransmit_or_expire_response_records (t=1197717028) at nmbd/nmbd_packets.c:1645 #8 0x80023313 in unbecome_local_master_browser (subrec=0x80161ce0, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315 #9 0x8002342a in become_local_master_fail2 (subrec=0x80161ce0, rrec=0x80117c38, fail_name=0x80165fc4) at nmbd/nmbd_become_lmb.c:429 #10 0x8002ce1d in register_name_response (subrec=0x80161ce0, rrec=0x80117c38, p=0x80167f68) at nmbd/nmbd_nameregister.c:157 #11 0x8002f8c9 in run_packet_queue () at nmbd/nmbd_packets.c:1561 #12 0x80021935 in main (argc=0, argv=0xbfe7e7e4) at nmbd/nmbd.c:394 (gdb) frame 6 #6 0x8003412b in remove_response_record (subrec=0x80161ce0, rrec=0x80138560) at nmbd/nmbd_responserecordsdb.c:82 (gdb) p *subrec $2 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18, responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 11, dgram_sock = 12} (gdb) p *rrec $3 = {next = 0x80150b10, prev = 0x8011ffa8, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 1} (gdb) frame 8 #8 0x80023313 in unbecome_local_master_browser (subrec=0x80161ce0, work=0x80162108, force_new_election=0) at nmbd/nmbd_become_lmb.c:315 (gdb) p *subrec $5 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18, responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 11, dgram_sock = 12} (gdb) p *work $6 = {next = 0x0, prev = 0x0, subnet = 0x80161ce0, serverlist = 0x801645b0, mst_state = MST_UNBECOMING_MASTER, dom_state = DOMAIN_MST, log_state = LOGON_SRV, work_group = "HFT", '\0' <repeats 60 times>, token = 1, local_master_browser_name = "\000OM", '\0' <repeats 60 times>, lastannounce_time = 1197717015, announce_interval = 240, needannounce = 0, death_time = 0, RunningElection = 1, needelection = 0, ElectionCount = 3, ElectionCriterion = 1090588559, dmb_name = {name = '\0' <repeats 15 times>, scope = '\0' <repeats 63 times>, name_type = 0}, dmb_addr = {s_addr = 0}} (gdb) frame 9 #9 0x8002342a in become_local_master_fail2 (subrec=0x80161ce0, rrec=0x80117c38, fail_name=0x80165fc4) at nmbd/nmbd_become_lmb.c:429 (gdb) p *subrec $7 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18, responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 11, dgram_sock = 12} (gdb) p *rrec $8 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 0} (gdb) p *fail_name $9 = {name = "sr/sbin/nmbd(smb", scope = "_panic+0x78) [0x800a1b96]\000/usr/sbin/nmbd [0x8008df39]\000/lib/libc.", name_type = 909012851} (gdb) frame 10 #10 0x8002ce1d in register_name_response (subrec=0x80161ce0, rrec=0x80117c38, p=0x80167f68) at nmbd/nmbd_nameregister.c:157 (gdb) p *subrec $12 = {next = 0x0, prev = 0x0, subnet_name = 0x80161d20 "134.91.90.52", type = NORMAL_SUBNET, workgrouplist = 0x80162108, namelist = 0x80167e18, responselist = 0x0, namelist_changed = 1, work_changed = 0, bcast_ip = {s_addr = 4284111750}, mask_ip = {s_addr = 16777215}, myip = {s_addr = 878336902}, nmb_sock = 11, dgram_sock = 12} (gdb) p *rrec $13 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 0} (gdb) p *p $14 = {next = 0x0, prev = 0x0, locked = 0, ip = {s_addr = 895114118}, port = 137, fd = 11, timestamp = 1197717028, packet_type = NMB_PACKET, packet = { nmb = {header = {name_trn_id = 16197, opcode = 5, response = 1, nm_flags = {bcast = 0, recursion_available = 1, recursion_desired = 1, trunc = 0, authoritative = 1}, rcode = 6, qdcount = 0, ancount = 1, nscount = 0, arcount = 0}, question = {question_name = {name = '\0' <repeats 15 times>, scope = '\0' <repeats 63 times>, name_type = 0}, question_type = 0, question_class = 0}, answers = 0x80169490, nsrecs = 0x0, additional = 0x0}, dgram = {header = {msg_type = 16197, flags = {node_type = 5, first = 1, more = 0}, dgm_id = 1, source_ip = {s_addr = 1}, source_port = 0, dgm_length = 1, packet_offset = 6}, source_name = {name = "\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000", scope = '\0' <repeats 63 times>, name_type = 0}, dest_name = {name = '\0' <repeats 15 times>, scope = "\000\000\000\000\000\000\000\000\220\224\026\200", '\0' <repeats 51 times>, name_type = 0}, datasize = 0, data = '\0' <repeats 575 times>}}}
This is corrupted memory. The problem won't be solved by adding the suggested patch, as that is only protecting against accesses in memory that's already been corrupted. This gives it away for sure. From frame 10: gdb) p *rrec $13 = {next = 0x8011ffa8, prev = 0x80164568, response_id = 0, resp_fn = 0, timeout_fn = 0, success_fn = 0, fail_fn = 0, packet = 0x0, userdata = 0x0, num_msgs = 0, repeat_time = 0, repeat_interval = 0, repeat_count = 0, in_expiration_processing = 0} This cannot be a valid response record pointer. The resp_fn, timeout_fn, success_fn, fail_fn cannot all be NULL in any codepath. This is at the start of the backtrace, so this nmbd already has a broken response record list. Running under valgrind would catch the problem for sure, but might be too slow. If you can't use valgrind, can you try setting : export MALLOC_CHECK_=2 /usr/sbin/nmbd and seeing if this will catch the bug ? It's definately triggered by the second LMB announce, but I need to see Jeremy.
Created attachment 3065 [details] valgrind logfile
Created attachment 3066 [details] proposed patch Thanks for that valgrind log. Can you see if it still happens if you apply the attached patch? Not 100% certain, but I suspect a bug in that linked list handling that we now have macros for. Volker P.S: This must be *ancient* code....
Created attachment 3070 [details] Belt and braces This *is* ancient code. I think this patch may also help. Please test. Jeremy.
I applied both patches. New valgrind report follows (truncated to last 1000 lines).
Created attachment 3086 [details] valgrind report with two patches applied
The valgrind traces aren't matching what I've got in my source tree, and the impossible is still occurring (a rrec is still on a linked list that DLIST_REMOVE should have explicitly removed it from). I'd like us to sync up on codebases so I know *exactly* what you are running. Currently you're running 3.0.28-SerNet-SuSE, yes ? Volker, can we sync up so that we can be sure of what code is running on his box ? I'm going to push the patches we've already added as I'm sure they're good. Jeremy.
If I send you a copy of my nmbd/ directory can you use that to build ? Jeremy.
I'm sorry to waste your time. I thought I applied the patches correctly (modified the .spec-file from the Sernet sources). After stripping the first path element from the patches, they applied cleanly... But I'm sure, I can build from a nmbd-source tree.
I tried several times to run nmbd under supervision of valgrind, but cannot get a meaningful backtrace: Failed to read a valid object file image from memory. 0x048258f1 in _vgrZU_libcZdsoZa_memcpy (dst=0xbed748b8, src=0x61cfdbc, len=84) at mc_replace_strmem.c:402 402 MEMCPY(m_libc_soname, memcpy) (gdb) bt #0 0x048258f1 in _vgrZU_libcZdsoZa_memcpy (dst=0xbed748b8, src=0x61cfdbc, len=84) at mc_replace_strmem.c:402 #1 0x0002a766 in upcase_name () #2 0x0002a896 in find_name_on_subnet () #3 0x0002aec1 in standard_fail_register () #4 0x0002ce39 in register_name_response () #5 0x0002f8ce in run_packet_queue () #6 0x00021935 in main () Some forums report: a bug in the linux kernel is responsible for the problems with backtraces... I already installed the newest kernel 2.6.18.8-0.8-default which should not suffer from the problem - no luck. Disabled appamor - no luck. Sorry, I do not know how to proceed to get a valid backtrace.
got it working... please find attached the first valgrind report.
Created attachment 3154 [details] valgrind report and backtrace Syscall param write(buf) points to uninitialised byte(s)
Created attachment 3155 [details] Patch Thanks very much for this. Here is a patch for the "Invalid read of size 1" errors. Jeremy.
Created attachment 3156 [details] Proper patch. Proper patch to fix the "Invalid read of size 1" errors. Sorry for the earlier one. Jermey.
Created attachment 3157 [details] Patch for the invalid 4 byte reads. Patch to fix the "Invalid read of size 4" errors. Jeremy.
Currently I'm running 3.0.30 and do not experience this bug anymore. I think it's solved.