Bug 10284 - Segfaults: internal error
Summary: Segfaults: internal error
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.1.1
Hardware: x64 Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
: 9903 10282 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-11-21 07:57 UTC by Marc Muehlfeld
Modified: 2019-06-11 21:24 UTC (History)
4 users (show)

See Also:


Attachments
Core dump file (1.14 MB, application/x-gzip)
2013-11-21 07:57 UTC, Marc Muehlfeld
no flags Details
gdb backtrace (8.76 KB, text/plain)
2013-11-21 07:59 UTC, Marc Muehlfeld
no flags Details
smb.conf (2.50 KB, text/plain)
2013-11-21 07:59 UTC, Marc Muehlfeld
no flags Details
Patch for 4.1.1 (6.27 KB, text/plain)
2013-11-22 08:34 UTC, Volker Lendecke
no flags Details
Nagios graphs (133.42 KB, image/png)
2013-11-22 10:17 UTC, Marc Muehlfeld
no flags Details
Patch for 4.1 (6.38 KB, patch)
2013-11-27 10:01 UTC, Volker Lendecke
metze: review+
Details
Patch for 4.0 (6.38 KB, patch)
2013-11-27 10:12 UTC, Volker Lendecke
metze: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Muehlfeld 2013-11-21 07:57:53 UTC
Created attachment 9454 [details]
Core dump file

Sometimes a smbd process segfaults on my Samba 4.1 member server:



smbd: ../source3/lib/msg_channel.c:298: msg_channel_trigger: Assertion `num_msgs > 0' failed.
[2013/11/20 04:09:58.618436,  0, pid=30356] ../lib/util/fault.c:72(fault_report)
  ===============================================================
[2013/11/20 04:09:58.618555,  0, pid=30356] ../lib/util/fault.c:73(fault_report)
  INTERNAL ERROR: Signal 6 in pid 30356 (4.1.0)
  Please read the Trouble-Shooting section of the Samba HOWTO
[2013/11/20 04:09:58.618643,  0, pid=30356] ../lib/util/fault.c:75(fault_report)
  ===============================================================
[2013/11/20 04:09:58.618704,  0, pid=30356] ../source3/lib/util.c:785(smb_panic_s3)
  PANIC (pid 30356): internal error
[2013/11/20 04:09:58.654962,  0, pid=30356] ../source3/lib/util.c:896(log_stack_trace)
  BACKTRACE: 24 stack frames:
   #0 /usr/local/samba/lib/libsmbconf.so.0(log_stack_trace+0x1f) [0x7fe4601efc06]
   #1 /usr/local/samba/lib/libsmbconf.so.0(smb_panic_s3+0x6d) [0x7fe4601efa75]
   #2 /usr/local/samba/lib/libsamba-util.so.0(smb_panic+0x28) [0x7fe46205ccfb]
   #3 /usr/local/samba/lib/libsamba-util.so.0(+0x1c9fb) [0x7fe46205c9fb]
   #4 /usr/local/samba/lib/libsamba-util.so.0(+0x1ca10) [0x7fe46205ca10]
   #5 /lib64/libpthread.so.0(+0x3dc9e0f500) [0x7fe46228c500]
   #6 /lib64/libc.so.6(gsignal+0x35) [0x7fe45ea988e5]
   #7 /lib64/libc.so.6(abort+0x175) [0x7fe45ea9a0c5]
   #8 /lib64/libc.so.6(+0x3dc962ba0e) [0x7fe45ea91a0e]
   #9 /lib64/libc.so.6(__assert_perror_fail+0) [0x7fe45ea91ad0]
   #10 /usr/local/samba/lib/libsmbconf.so.0(+0x317fe) [0x7fe4601fb7fe]
   #11 /usr/local/samba/lib/samba/libtevent.so.0(tevent_common_loop_immediate+0x1f9) [0x7fe461287ee4]
   #12 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x57) [0x7fe46020c197]
   #13 /usr/local/samba/lib/libsmbconf.so.0(+0x42844) [0x7fe46020c844]
   #14 /usr/local/samba/lib/samba/libtevent.so.0(_tevent_loop_once+0xfc) [0x7fe461286fa9]
   #15 /usr/local/samba/lib/samba/libsmbd_base.so(smbd_process+0x1321) [0x7fe461800a1e]
   #16 /usr/sbin/smbd(+0x9c38) [0x7fe4626c5c38]
   #17 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x544) [0x7fe46020c684]
   #18 /usr/local/samba/lib/libsmbconf.so.0(+0x4295a) [0x7fe46020c95a]
   #19 /usr/local/samba/lib/samba/libtevent.so.0(_tevent_loop_once+0xfc) [0x7fe461286fa9]
   #20 /usr/sbin/smbd(+0xa8d7) [0x7fe4626c68d7]
   #21 /usr/sbin/smbd(main+0x15d1) [0x7fe4626c7ff9]
   #22 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7fe45ea84cdd]
   #23 /usr/sbin/smbd(+0x5809) [0x7fe4626c1809]
[2013/11/20 04:09:58.655441,  0, pid=30356] ../source3/lib/util.c:797(smb_panic_s3)
  smb_panic(): calling panic action [/usr/local/bin/panic-action 30356]
[2013/11/20 04:10:00.677661,  0, pid=30356] ../source3/lib/util.c:805(smb_panic_s3)
  smb_panic(): action returned status 0
[2013/11/20 04:10:00.677866,  0, pid=30356] ../source3/lib/dumpcore.c:317(dump_core)
  dumping core in /var/log/samba//cores/smbd
[2013/11/20 04:10:00.689181,  1, pid=98402] ../source3/smbd/process.c:480(receive_smb_talloc)
  receive_smb_raw_talloc failed for client ipv4:10.1.1.167:1244 read error = NT_STATUS_CONNECTION_RESET.
[2013/11/20 04:10:00.983995,  1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored)
  messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE
[2013/11/20 04:10:01.002675,  1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored)
  messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE
[2013/11/20 04:10:01.003536,  1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored)
  messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE
[2013/11/20 04:10:01.004056,  1, pid=98403] ../source3/lib/dbwrap/dbwrap_watch.c:345(dbwrap_watch_record_stored)
  messaging_send to 30356 failed: NT_STATUS_INVALID_HANDLE
.....


I can't say, what operation causes this. And it appears very rarely.
Comment 1 Marc Muehlfeld 2013-11-21 07:59:11 UTC
Created attachment 9455 [details]
gdb backtrace
Comment 2 Marc Muehlfeld 2013-11-21 07:59:41 UTC
Created attachment 9456 [details]
smb.conf
Comment 3 Volker Lendecke 2013-11-21 14:57:17 UTC
Got it reproduced. I would *really* like to see your workload....
Comment 4 Christian Ambach 2013-11-21 18:35:52 UTC
*** Bug 10282 has been marked as a duplicate of this bug. ***
Comment 5 Volker Lendecke 2013-11-21 21:23:01 UTC
Just FYI: I've got a patch that is in private autobuild. If that survives it, I'll upload it here.
Comment 6 Marc Muehlfeld 2013-11-21 22:20:52 UTC
(In reply to comment #3)
> Got it reproduced. I would *really* like to see your workload....

What especially?




(In reply to comment #5)
> Just FYI: I've got a patch that is in private autobuild. If that survives it,
> I'll upload it here.

Great.
Do you have an idea, when this bug occours? I had this only twice in two weeks and I have no idea what causes this. So I can't force to reproduce it. I just have to wait.
Comment 7 Nick Semenkovich 2013-11-21 22:40:38 UTC
(In reply to comment #3)
> Got it reproduced. I would *really* like to see your workload....

FWIW, my crash observation (anecdote?): this occurred when Windows 8.1's offline files was syncing the mapped folders of all users on a machine.

(A number of those folders & files were owned by users who were logged in & using those folders on other machines.)
Comment 8 Volker Lendecke 2013-11-22 08:34:47 UTC
Created attachment 9466 [details]
Patch for 4.1.1

This patch is on top of plain 4.1.1. If you have the patch for bug 10250 already applied, then this will conflict. 10250 is implicitly fixed by this as well I guess.

Christian, you've shown interest in this defect on irc, so I'll give it to you for review :-)
Comment 9 Volker Lendecke 2013-11-22 09:43:24 UTC
(In reply to comment #6)
> (In reply to comment #3)
> > Got it reproduced. I would *really* like to see your workload....
> 
> What especially?

Well, you seem to hit pretty tight race conditions. Must be a busy server.

> Great.
> Do you have an idea, when this bug occours? I had this only twice in two weeks
> and I have no idea what causes this. So I can't force to reproduce it. I just
> have to wait.

The reproducer I posted for master to samba-technical is a unit test which directly goes into the msg_channel API. I would love to know how to reproduce this with pure SMB client actions, but probably that's very difficult to trigger. It's good to have these asserts though, this one at least pointed me at problems in the code that I could reproduce artificially.
Comment 10 Marc Muehlfeld 2013-11-22 10:17:01 UTC
Created attachment 9467 [details]
Nagios graphs

(In reply to comment #9)
> > > Got it reproduced. I would *really* like to see your workload....
> > 
> > What especially?
> 
> Well, you seem to hit pretty tight race conditions. Must be a busy server.

We have 15 W2k/XP workstations with Acronis True Image Advanced Workstation 11.5. This machines are storing every night their images on that Samba server. During the week (like on last wednesday when we last hit this) only incremental images are stored. So it's not much what happens on the share. Find attached some Nagios graphs, if it's interesting.





> The reproducer I posted for master to samba-technical is a unit test which
> directly goes into the msg_channel API. I would love to know how to reproduce
> this with pure SMB client actions, but probably that's very difficult to
> trigger. It's good to have these asserts though, this one at least pointed me
> at problems in the code that I could reproduce artificially.

If there's anything else I could capture, try, etc. to help, just let me know.
Comment 11 Marc Muehlfeld 2013-11-22 19:59:03 UTC
(In reply to comment #8)
> Created attachment 9466 [details]
> Patch for 4.1.1
> 
> This patch is on top of plain 4.1.1. If you have the patch for bug 10250
> already applied, then this will conflict. 10250 is implicitly fixed by this as
> well I guess.

I reverted the patch of bug 10250 and applied the one you appended to this bug report.

I'll provide a first feedback on monday.
Comment 12 Marc Muehlfeld 2013-11-26 07:41:49 UTC
(In reply to comment #11)
> (In reply to comment #8)
> > Created attachment 9466 [details] [details]
> > Patch for 4.1.1
> > 
> > This patch is on top of plain 4.1.1. If you have the patch for bug 10250
> > already applied, then this will conflict. 10250 is implicitly fixed by this as
> > well I guess.
> 
> I reverted the patch of bug 10250 and applied the one you appended to this bug
> report.
> 
> I'll provide a first feedback on monday.

The patch here fixes the segfaults from bug report #10250, too.


The "internal error" segfaults, I mentioned in this bug report here, I haven't had since I applied the patch, too. But I think this information isn't worth much, as it would not be easy to hit this race condition, as you have said. But if it passes your tests, it should be fine. And if it would appear again, I'll update the the bug report, of course.
Comment 13 Kinglok, Fong 2013-11-26 10:00:08 UTC
After applying the patch (https://bugzilla.samba.org/attachment.cgi?id=9466),  I confirm the problem is solved.

Thanks!
Comment 14 Volker Lendecke 2013-11-27 10:01:33 UTC
Created attachment 9484 [details]
Patch for 4.1
Comment 15 Stefan Metzmacher 2013-11-27 10:10:53 UTC
Comment on attachment 9484 [details]
Patch for 4.1

Looks good, do we also need this for 4.0?
Comment 16 Volker Lendecke 2013-11-27 10:12:28 UTC
Created attachment 9485 [details]
Patch for 4.0
Comment 17 Stefan Metzmacher 2013-11-27 10:13:02 UTC
Comment on attachment 9485 [details]
Patch for 4.0

Ah:-)
Comment 18 Volker Lendecke 2013-11-27 11:11:27 UTC
Karo, please add to 4.1 and 4.0.

Thanks,

Volker
Comment 19 Karolin Seeger 2013-11-28 10:17:30 UTC
Pushed to autobuild-v4-1-test and autobuild-v4-0-test.
Comment 20 Karolin Seeger 2013-11-29 08:05:10 UTC
Pushed to v4-1-test and v4-0-test.
Closing out bug report.

Thanks!
Comment 21 Andrew Bartlett 2019-06-11 21:02:03 UTC
*** Bug 9903 has been marked as a duplicate of this bug. ***