Created attachment 9454 [details] Core dump file Sometimes a smbd process segfaults on my Samba 4.1 member server: smbd: ../source3/lib/msg_channel.c:298: msg_channel_trigger: Assertion `num_msgs > 0' failed. [2013/11/20 04:09:58.618436, 0, pid=30356] ../lib/util/fault.c:72(fault_report) =============================================================== [2013/11/20 04:09:58.618555, 0, pid=30356] ../lib/util/fault.c:73(fault_report) INTERNAL ERROR: Signal 6 in pid 30356 (4.1.0) Please read the Trouble-Shooting section of the Samba HOWTO [2013/11/20 04:09:58.618643, 0, pid=30356] ../lib/util/fault.c:75(fault_report) =============================================================== [2013/11/20 04:09:58.618704, 0, pid=30356] ../source3/lib/util.c:785(smb_panic_s3) PANIC (pid 30356): internal error [2013/11/20 04:09:58.654962, 0, pid=30356] ../source3/lib/util.c:896(log_stack_trace) BACKTRACE: 24 stack frames: #0 /usr/local/samba/lib/libsmbconf.so.0(log_stack_trace+0x1f) [0x7fe4601efc06] #1 /usr/local/samba/lib/libsmbconf.so.0(smb_panic_s3+0x6d) [0x7fe4601efa75] #2 /usr/local/samba/lib/libsamba-util.so.0(smb_panic+0x28) [0x7fe46205ccfb] #3 /usr/local/samba/lib/libsamba-util.so.0(+0x1c9fb) [0x7fe46205c9fb] #4 /usr/local/samba/lib/libsamba-util.so.0(+0x1ca10) [0x7fe46205ca10] #5 /lib64/libpthread.so.0(+0x3dc9e0f500) [0x7fe46228c500] #6 /lib64/libc.so.6(gsignal+0x35) [0x7fe45ea988e5] #7 /lib64/libc.so.6(abort+0x175) [0x7fe45ea9a0c5] #8 /lib64/libc.so.6(+0x3dc962ba0e) [0x7fe45ea91a0e] #9 /lib64/libc.so.6(__assert_perror_fail+0) [0x7fe45ea91ad0] #10 /usr/local/samba/lib/libsmbconf.so.0(+0x317fe) [0x7fe4601fb7fe] #11 /usr/local/samba/lib/samba/libtevent.so.0(tevent_common_loop_immediate+0x1f9) [0x7fe461287ee4] #12 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x57) [0x7fe46020c197] #13 /usr/local/samba/lib/libsmbconf.so.0(+0x42844) [0x7fe46020c844] #14 /usr/local/samba/lib/samba/libtevent.so.0(_tevent_loop_once+0xfc) [0x7fe461286fa9] #15 /usr/local/samba/lib/samba/libsmbd_base.so(smbd_process+0x1321) [0x7fe461800a1e] #16 /usr/sbin/smbd(+0x9c38) [0x7fe4626c5c38] #17 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x544) [0x7fe46020c684] #18 /usr/local/samba/lib/libsmbconf.so.0(+0x4295a) [0x7fe46020c95a] #19 /usr/local/samba/lib/samba/libtevent.so.0(_tevent_loop_once+0xfc) [0x7fe461286fa9] #20 /usr/sbin/smbd(+0xa8d7) [0x7fe4626c68d7] #21 /usr/sbin/smbd(main+0x15d1) [0x7fe4626c7ff9] #22 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7fe45ea84cdd] #23 /usr/sbin/smbd(+0x5809) [0x7fe4626c1809] [2013/11/20 04:09:58.655441, 0, pid=30356] ../source3/lib/util.c:797(smb_panic_s3) smb_panic(): calling panic action [/usr/local/bin/panic-action 30356] [2013/11/20 04:10:00.677661, 0, pid=30356] ../source3/lib/util.c:805(smb_panic_s3) smb_panic(): action returned status 0 [2013/11/20 04:10:00.677866, 0, pid=30356] ../source3/lib/dumpcore.c:317(dump_core) dumping core in /var/log/samba//cores/smbd [2013/11/20 04:10:00.689181, 1, pid=98402] ../source3/smbd/process.c:480(receive_smb_talloc) receive_smb_raw_talloc failed for client ipv4:10.1.1.167:1244 read error = NT_STATUS_CONNECTION_RESET. [2013/11/20 04:10:00.983995, 1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored) messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE [2013/11/20 04:10:01.002675, 1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored) messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE [2013/11/20 04:10:01.003536, 1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored) messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE [2013/11/20 04:10:01.004056, 1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored) messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE ..... I can't say, what operation causes this. And it appears very rarely.
Created attachment 9455 [details] gdb backtrace
Created attachment 9456 [details] smb.conf
Got it reproduced. I would *really* like to see your workload....
*** Bug 10282 has been marked as a duplicate of this bug. ***
Just FYI: I've got a patch that is in private autobuild. If that survives it, I'll upload it here.
(In reply to comment #3) > Got it reproduced. I would *really* like to see your workload.... What especially? (In reply to comment #5) > Just FYI: I've got a patch that is in private autobuild. If that survives it, > I'll upload it here. Great. Do you have an idea, when this bug occours? I had this only twice in two weeks and I have no idea what causes this. So I can't force to reproduce it. I just have to wait.
(In reply to comment #3) > Got it reproduced. I would *really* like to see your workload.... FWIW, my crash observation (anecdote?): this occurred when Windows 8.1's offline files was syncing the mapped folders of all users on a machine. (A number of those folders & files were owned by users who were logged in & using those folders on other machines.)
Created attachment 9466 [details] Patch for 4.1.1 This patch is on top of plain 4.1.1. If you have the patch for bug 10250 already applied, then this will conflict. 10250 is implicitly fixed by this as well I guess. Christian, you've shown interest in this defect on irc, so I'll give it to you for review :-)
(In reply to comment #6) > (In reply to comment #3) > > Got it reproduced. I would *really* like to see your workload.... > > What especially? Well, you seem to hit pretty tight race conditions. Must be a busy server. > Great. > Do you have an idea, when this bug occours? I had this only twice in two weeks > and I have no idea what causes this. So I can't force to reproduce it. I just > have to wait. The reproducer I posted for master to samba-technical is a unit test which directly goes into the msg_channel API. I would love to know how to reproduce this with pure SMB client actions, but probably that's very difficult to trigger. It's good to have these asserts though, this one at least pointed me at problems in the code that I could reproduce artificially.
Created attachment 9467 [details] Nagios graphs (In reply to comment #9) > > > Got it reproduced. I would *really* like to see your workload.... > > > > What especially? > > Well, you seem to hit pretty tight race conditions. Must be a busy server. We have 15 W2k/XP workstations with Acronis True Image Advanced Workstation 11.5. This machines are storing every night their images on that Samba server. During the week (like on last wednesday when we last hit this) only incremental images are stored. So it's not much what happens on the share. Find attached some Nagios graphs, if it's interesting. > The reproducer I posted for master to samba-technical is a unit test which > directly goes into the msg_channel API. I would love to know how to reproduce > this with pure SMB client actions, but probably that's very difficult to > trigger. It's good to have these asserts though, this one at least pointed me > at problems in the code that I could reproduce artificially. If there's anything else I could capture, try, etc. to help, just let me know.
(In reply to comment #8) > Created attachment 9466 [details] > Patch for 4.1.1 > > This patch is on top of plain 4.1.1. If you have the patch for bug 10250 > already applied, then this will conflict. 10250 is implicitly fixed by this as > well I guess. I reverted the patch of bug 10250 and applied the one you appended to this bug report. I'll provide a first feedback on monday.
(In reply to comment #11) > (In reply to comment #8) > > Created attachment 9466 [details] [details] > > Patch for 4.1.1 > > > > This patch is on top of plain 4.1.1. If you have the patch for bug 10250 > > already applied, then this will conflict. 10250 is implicitly fixed by this as > > well I guess. > > I reverted the patch of bug 10250 and applied the one you appended to this bug > report. > > I'll provide a first feedback on monday. The patch here fixes the segfaults from bug report #10250, too. The "internal error" segfaults, I mentioned in this bug report here, I haven't had since I applied the patch, too. But I think this information isn't worth much, as it would not be easy to hit this race condition, as you have said. But if it passes your tests, it should be fine. And if it would appear again, I'll update the the bug report, of course.
After applying the patch (https://bugzilla.samba.org/attachment.cgi?id=9466), I confirm the problem is solved. Thanks!
Created attachment 9484 [details] Patch for 4.1
Comment on attachment 9484 [details] Patch for 4.1 Looks good, do we also need this for 4.0?
Created attachment 9485 [details] Patch for 4.0
Comment on attachment 9485 [details] Patch for 4.0 Ah:-)
Karo, please add to 4.1 and 4.0. Thanks, Volker
Pushed to autobuild-v4-1-test and autobuild-v4-0-test.
Pushed to v4-1-test and v4-0-test. Closing out bug report. Thanks!
*** Bug 9903 has been marked as a duplicate of this bug. ***