14636 – assert failed: fsp->notify != NULL in change_notify_remove_request (source3/smbd/notify.c:384)

Bug 14636 - assert failed: fsp->notify != NULL in change_notify_remove_request (source3/smbd/notify.c:384)

Summary: assert failed: fsp->notify != NULL in change_notify_remove_request (source3/s...

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	File services (show other bugs)
Version:	4.14.0rc4
Hardware:	x64 FreeBSD

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Karolin Seeger
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-02-13 18:47 UTC by Peter Eriksson
Modified:	2021-03-11 11:48 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
GDB stack backtrace (6.80 KB, text/plain) 2021-03-02 22:38 UTC, Peter Eriksson	no flags	Details
debug level 10 log of panic (207.63 KB, text/plain) 2021-03-03 01:50 UTC, Jeremy Allison	no flags	Details
debug level 10 log of panic (262.70 KB, text/plain) 2021-03-03 01:52 UTC, Jeremy Allison	no flags	Details
gdb stack backtrace (23.94 KB, text/plain) 2021-03-03 01:55 UTC, Jeremy Allison	no flags	Details
Possible patch (1.10 KB, patch) 2021-03-03 18:37 UTC, Volker Lendecke	no flags	Details
New patch (1.97 KB, patch) 2021-03-03 18:40 UTC, Volker Lendecke	slow: review+	Details
tar file with torture script and two backtraces (3.41 KB, application/gzip) 2021-03-03 21:19 UTC, Peter Eriksson	no flags	Details
torture script for testing restart loops (2.42 KB, application/x-shellscript) 2021-03-03 21:37 UTC, Peter Eriksson	no flags	Details
Patch for 4.12 (2.34 KB, patch) 2021-03-05 17:01 UTC, Volker Lendecke	jra: review+	Details
Patch for 4.13 (2.31 KB, patch) 2021-03-05 17:03 UTC, Volker Lendecke	jra: review+	Details
Patch for 4.14 (2.33 KB, patch) 2021-03-05 17:04 UTC, Volker Lendecke	jra: review+	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Eriksson 2021-02-13 18:47:02 UTC

Just got a number of core dumps due to a failed assertion when terminating smbd processes via signal SIGTERM. This doesn't happen all them time, just on our more busy servers:

(gdb) where
#0  0x0000000804738c2a in thr_kill () from /lib/libc.so.7
#1  0x0000000804737084 in raise () from /lib/libc.so.7
#2  0x00000008046ad279 in abort () from /lib/libc.so.7
#3  0x0000000802586430 in dump_core () at ../../source3/lib/dumpcore.c:338
#4  0x000000080259508d in smb_panic_s3 (why=<optimized out>) at ../../source3/lib/util.c:849
#5  0x000000080129abdf in smb_panic (why=why@entry=0x8018e4180 "assert failed: fsp->notify != NULL")
    at ../../lib/util/fault.c:184
#6  0x000000080180b14e in change_notify_remove_request (sconn=sconn@entry=0x80fe242e0, 
    remove_req=<optimized out>) at ../../source3/smbd/notify.c:384
#7  0x000000080180bbac in smbd_notify_cancel_by_map (map=0x80ff222e0) at ../../source3/smbd/notify.c:432
#8  0x000000080180c06e in smbd_notify_cancel_by_smbreq (smbreq=<optimized out>)
    at ../../source3/smbd/notify.c:473
#9  0x00000008017f15f7 in smbd_smb2_notify_state_destructor (state=<optimized out>)
    at ../../source3/smbd/smb2_notify.c:176
#10 0x000000080150cf7f in ?? () from /usr/local/lib/libtalloc.so.2
#11 0x0000000801520c4d in tevent_req_received () from /usr/local/lib/libtevent.so.0
#12 0x00000008015208a9 in ?? () from /usr/local/lib/libtevent.so.0
#13 0x000000080150cf7f in ?? () from /usr/local/lib/libtalloc.so.2
#14 0x000000080150ce3f in ?? () from /usr/local/lib/libtalloc.so.2
#15 0x000000080150ce3f in ?? () from /usr/local/lib/libtalloc.so.2
#16 0x00000008018010d7 in exit_server_common (how=how@entry=SERVER_EXIT_NORMAL, 
    reason=0x80182c5ea "termination signal") at ../../source3/smbd/server_exit.c:169
#17 0x000000080180149e in smbd_exit_server_cleanly (explanation=<optimized out>)
    at ../../source3/smbd/server_exit.c:237
#18 0x0000000803be08f6 in exit_server_cleanly (reason=reason@entry=0x80182c5ea "termination signal")
    at ../../source3/lib/smbd_shim.c:121
#19 0x00000008017c169c in smbd_sig_term_handler (ev=<optimized out>, se=<optimized out>, 
    signum=<optimized out>, count=<optimized out>, siginfo=<optimized out>, private_data=<optimized out>)
    at ../../source3/smbd/process.c:979
#20 0x0000000801523f97 in tevent_common_invoke_signal_handler () from /usr/local/lib/libtevent.so.0
#21 0x00000008015240d5 in tevent_common_check_signal () from /usr/local/lib/libtevent.so.0
#22 0x000000080152256c in ?? () from /usr/local/lib/libtevent.so.0
#23 0x000000080151ed41 in _tevent_loop_once () from /usr/local/lib/libtevent.so.0
#24 0x000000080151efcb in tevent_common_loop_wait () from /usr/local/lib/libtevent.so.0
#25 0x00000008017c8e52 in smbd_process (ev_ctx=ev_ctx@entry=0x80fe24060, msg_ctx=msg_ctx@entry=0x80fe1d300, 
    dce_ctx=dce_ctx@entry=0x80fe0e0c0, sock_fd=sock_fd@entry=49, interactive=interactive@entry=false)
    at ../../source3/smbd/process.c:4212
#26 0x000000000102e1af in smbd_accept_connection (ev=0x80fe24060, fde=<optimized out>, flags=<optimized out>, 
    private_data=<optimized out>) at ../../source3/smbd/server.c:1014
#27 0x000000080151fb1d in tevent_common_invoke_fd_handler () from /usr/local/lib/libtevent.so.0
#28 0x000000080152294e in ?? () from /usr/local/lib/libtevent.so.0
#29 0x000000080151ed41 in _tevent_loop_once () from /usr/local/lib/libtevent.so.0
#30 0x000000080151efcb in tevent_common_loop_wait () from /usr/local/lib/libtevent.so.0
#31 0x000000000102fd39 in smbd_parent_loop (parent=0x80fe1d760, ev_ctx=0x80fe24060)
    at ../../source3/smbd/server.c:1361
#32 main (argc=<optimized out>, argv=<optimized out>) at ../../source3/smbd/server.c:2214


(gdb) list
379		 * Paranoia checks, the fsp referenced must must have the request in
380		 * its list of pending requests
381		 */
382	
383		fsp = remove_req->fsp;
384		SMB_ASSERT(fsp->notify != NULL);
385	
386		for (req = fsp->notify->requests; req; req = req->next) {
387			if (req == remove_req) {
388				break;

Comment 1 Jeremy Allison 2021-03-02 20:45:42 UTC

OK, I'm looking into this and I can't quite see how it happens.

In 4.13.x on a SHUTDOWN_CLOSE we end up in:

smbd_exit_server_cleanly()
exit_server_common()
smbXsrv_session_logoff_all()
smbXsrv_session_logoff_all_callback()
smbXsrv_session_clear_and_logoff()
smbXsrv_session_logoff()
file_close_user()

-- and for all open file handles (including directories with notify requests)--

close_file(NULL, fsp, SHUTDOWN_CLOSE);
remove_pending_change_notify_requests_by_fid(fsp, notify_status);

-- which marshals the reply and sends it (which should then clear the state->has_request flag).

change_notify_remove_request()

So there really shouldn't be any smbd_smb2_notify_state_destructor() function to be triggered when the talloc hierarchy is freed.

Hmmm. Is there any way you can spot exactly what state the smbd is in when the SIGTERM causes the crash ? I'll do some more investigations locally.

Comment 2 Peter Eriksson 2021-03-02 21:20:42 UTC

This stack trace was captured with my standard build of Samba, which used optimization, FreeBSD-provided libtalloc.so, and it also happened with the ... other unmentionable ( :-) ) Bug ... still in the code, and with dbwrap_tdb_mutexes enabled.

I haven't seen any of these crashes (or any core dumps at all actually) since I recompiled it with my fix to the Bug, without optimization, with libtalloc internally and dbwrap_tdb_mutexes disabled - so perhaps it was due to one of the these problems?

I'll see if I can provoke it to happen on our test server(s) again somehow.

Comment 3 Jeremy Allison 2021-03-02 21:36:28 UTC

I just did a naive test on latest master by using smbclient to issue a wait-notify on an existing smbd and then terminating it via kill. No issues.

Let's see if you can reproduce and if not close this one out as "INVALID".

Comment 4 Peter Eriksson 2021-03-02 22:38:22 UTC

Created attachment 16489 [details]
GDB stack backtrace

Comment 5 Peter Eriksson 2021-03-02 22:38:59 UTC

I managed to prove it again! Sorry :-)

I created a script that basically did this in a loop:

while true; do
  service samba_server restart
  sleep 1
  wait
  sleep 1
  for N in `seq 1 100`; do 
    smbclient -k //server/share -c 'notify foo' &
  done
  sleep 10
done

"service samba_server restart" stops winbindd and then smbd, and then starts them up again - in that order.

After a couple of iterations in that loop I got a shitload of coredumps in my /var/cores directory :-)

This is with Samba 4.14.0rc4 (with patches) on FreeBSD 11.4.

A fresh stack backtrace uploaded.

Comment 6 Jeremy Allison 2021-03-02 22:46:56 UTC

Reproduced in master - thanks ! Here is the script I used.

mkdir /tmp/testdir
while true; do
  killall smbd
  /usr/local/samba/sbin/smbd
  sleep 1
  wait
  sleep 1
  for N in `seq 1 100`; do
    /usr/local/samba/bin/smbclient //127.0.0.1/tmp -UUSER%PASS -c 'notify testdir' &
  done
  sleep 10
done

I'll add Metze, Ralph and Volker to this one to track it down.

Comment 7 Stefan Metzmacher 2021-03-02 23:42:13 UTC

(In reply to Jeremy Allison from comment #6)

Jeremy, can you reproduce this with log level 10
and selftest/gdb_backtrace as panic action and upload
everthing?

Comment 8 Jeremy Allison 2021-03-03 01:50:19 UTC

Created attachment 16490 [details]
debug level 10 log of panic

debug level 10 log of panic

Comment 9 Jeremy Allison 2021-03-03 01:52:16 UTC

Created attachment 16491 [details]
debug level 10 log of panic

Hopefully got the right file this time :-).

Comment 10 Jeremy Allison 2021-03-03 01:55:14 UTC

Created attachment 16492 [details]
gdb stack backtrace

Matches debug log in attachment id "16491: debug level 10 log of panic".

Comment 11 Jeremy Allison 2021-03-03 01:58:21 UTC

This is master with git refspec to of tree as 1c9add54750cb7f2b49be69a548ce8bdb15e7ac2.

Comment 12 Jeremy Allison 2021-03-03 02:08:07 UTC

These logs and gdb bt are interesting in that they're different from the one found my Peter Eriksson.

It appears to be something to do with share modes:

  BACKTRACE:
   #0 log_stack_trace + 0x3b [ip=0x7f852358a03c] [sp=0x7ffcecc098e0]
   #1 smb_panic_log + 0x1b5 [ip=0x7f8523589fb0] [sp=0x7ffcecc0a1f0]
   #2 smb_panic + 0x1c [ip=0x7f8523589fcf] [sp=0x7ffcecc0a210]
   #3 fault_report + 0x91 [ip=0x7f8523589ad6] [sp=0x7ffcecc0a230]
   #4 sig_fault + 0x19 [ip=0x7f8523589aef] [sp=0x7ffcecc0a2e0]
   #5 funlockfile + 0x60 [ip=0x7f8522b90bb0] [sp=0x7ffcecc0a300]
   #6 __pthread_mutex_lock_full + 0x415 [ip=0x7f8522b874d5] [sp=0x7ffcecc0a8a0]
   #7 chain_mutex_lock + 0x27 [ip=0x7f85222b22c1] [sp=0x7ffcecc0a900]
   #8 tdb_mutex_lock + 0xa5 [ip=0x7f85222b240c] [sp=0x7ffcecc0a930]
   #9 fcntl_lock + 0x55 [ip=0x7f85222a622b] [sp=0x7ffcecc0a9b0]
   #10 tdb_brlock + 0xa2 [ip=0x7f85222a63ff] [sp=0x7ffcecc0aa20]
   #11 tdb_nest_lock + 0x1ba [ip=0x7f85222a69f0] [sp=0x7ffcecc0aa60]
   #12 tdb_lock_list + 0x7d [ip=0x7f85222a6c4e] [sp=0x7ffcecc0aaa0]
   #13 tdb_lock + 0x2e [ip=0x7f85222a6cef] [sp=0x7ffcecc0aae0]
   #14 tdb_chainlock + 0x5d [ip=0x7f85222a7742] [sp=0x7ffcecc0ab20]
   #15 db_tdb_do_locked + 0xa2 [ip=0x7f85222c4e64] [sp=0x7ffcecc0ab60]
   #16 dbwrap_do_locked + 0x8c [ip=0x7f85222c122e] [sp=0x7ffcecc0ac30]
   #17 dbwrap_watched_do_locked + 0xc3 [ip=0x7f852305f31f] [sp=0x7ffcecc0ac80]
   #18 dbwrap_do_locked + 0x8c [ip=0x7f85222c122e] [sp=0x7ffcecc0ad30]
   #19 g_lock_lock_retry + 0x1cd [ip=0x7f8523063d5c] [sp=0x7ffcecc0ad80]
   #20 _tevent_req_notify_callback + 0x6e [ip=0x7f8523103120] [sp=0x7ffcecc0ae10]
   #21 tevent_req_finish + 0x108 [ip=0x7f8523103289] [sp=0x7ffcecc0ae30]
   #22 _tevent_req_done + 0x29 [ip=0x7f85231032b9] [sp=0x7ffcecc0ae80]
   #23 dbwrap_watched_watch_done + 0x109 [ip=0x7f852306188e] [sp=0x7ffcecc0aea0]
   #24 _tevent_req_notify_callback + 0x6e [ip=0x7f8523103120] [sp=0x7ffcecc0aef0]
   #25 tevent_req_finish + 0x108 [ip=0x7f8523103289] [sp=0x7ffcecc0af10]
   #26 tevent_req_trigger + 0x53 [ip=0x7f85231033c9] [sp=0x7ffcecc0af60]
   #27 tevent_common_invoke_immediate_handler + 0x1a3 [ip=0x7f8523101ea9] [sp=0x7ffcecc0afa0]
   #28 tevent_common_loop_immediate + 0x3b [ip=0x7f8523101fc4] [sp=0x7ffcecc0b050]
   #29 epoll_event_loop_once + 0xa2 [ip=0x7f852310cc29] [sp=0x7ffcecc0b080]
   #30 std_event_loop_once + 0x60 [ip=0x7f8523109379] [sp=0x7ffcecc0b0d0]
   #31 _tevent_loop_once + 0x126 [ip=0x7f85231006f6] [sp=0x7ffcecc0b110]
   #32 tevent_req_poll + 0x29 [ip=0x7f8523103571] [sp=0x7ffcecc0b150]
   #33 tevent_req_poll_ntstatus + 0x2b [ip=0x7f8522ffd954] [sp=0x7ffcecc0b180]
   #34 g_lock_lock + 0x265 [ip=0x7f8523064430] [sp=0x7ffcecc0b1c0]
   #35 get_share_mode_lock + 0x21f [ip=0x7f8523380627] [sp=0x7ffcecc0b260]
   #36 get_existing_share_mode_lock + 0x38 [ip=0x7f852337409c] [sp=0x7ffcecc0b360]
   #37 close_directory + 0x131 [ip=0x7f85232b48c6] [sp=0x7ffcecc0b3a0]
   #38 close_file + 0x461 [ip=0x7f85232b52ad] [sp=0x7ffcecc0b440]
   #39 file_close_user + 0x51 [ip=0x7f8523218340] [sp=0x7ffcecc0b4a0]
   #40 smbXsrv_session_logoff + 0x9e [ip=0x7f8523331110] [sp=0x7ffcecc0b4d0]
   #41 smbXsrv_session_clear_and_logoff + 0xc6 [ip=0x7f852332fb2f] [sp=0x7ffcecc0b540]
   #42 smbXsrv_session_logoff_all_callback + 0xe4 [ip=0x7f8523331720] [sp=0x7ffcecc0b580]
   #43 db_rbt_traverse_internal + 0x136 [ip=0x7f85222c4276] [sp=0x7ffcecc0b5e0]
   #44 db_rbt_traverse + 0xab [ip=0x7f85222c446c] [sp=0x7ffcecc0b6a0]
   #45 dbwrap_traverse + 0x39 [ip=0x7f85222c0c16] [sp=0x7ffcecc0b6f0]
   #46 smbXsrv_session_logoff_all + 0xc9 [ip=0x7f852333153a] [sp=0x7ffcecc0b730]
   #47 exit_server_common + 0x2f9 [ip=0x7f8523339ed3] [sp=0x7ffcecc0b770]
   #48 smbd_exit_server_cleanly + 0x21 [ip=0x7f852333a304] [sp=0x7ffcecc0b7d0]
   #49 exit_server_cleanly + 0x2c [ip=0x7f8522d60607] [sp=0x7ffcecc0b7f0]
   #50 smbd_sig_term_handler + 0x2e [ip=0x7f85232d4f5f] [sp=0x7ffcecc0b810]


I have another panic in the set of logs that is also related to share mode
deletion called from close_directory() with the same call stack above that.

   #0 log_stack_trace + 0x3b [ip=0x7f852358a03c] [sp=0x7ffcecc0a930]
   #1 smb_panic_log + 0x1b5 [ip=0x7f8523589fb0] [sp=0x7ffcecc0b240]
   #2 smb_panic + 0x1c [ip=0x7f8523589fcf] [sp=0x7ffcecc0b260]
   #3 share_mode_lock_destructor + 0x2c1 [ip=0x7f8523380e5a] [sp=0x7ffcecc0b280]
   #4 _tc_free_internal + 0x151 [ip=0x7f85230dcfdb] [sp=0x7ffcecc0b2b0]
   #5 _talloc_free_internal + 0x9d [ip=0x7f85230dd482] [sp=0x7ffcecc0b360]
   #6 _talloc_free + 0x106 [ip=0x7f85230de826] [sp=0x7ffcecc0b390]
   #7 close_directory + 0x596 [ip=0x7f85232b4d2b] [sp=0x7ffcecc0b3c0]
   #8 close_file + 0x461 [ip=0x7f85232b52ad] [sp=0x7ffcecc0b440]
   #9 file_close_user + 0x51 [ip=0x7f8523218340] [sp=0x7ffcecc0b4a0]
   #10 smbXsrv_session_logoff + 0x9e [ip=0x7f8523331110] [sp=0x7ffcecc0b4d0]
   #11 smbXsrv_session_clear_and_logoff + 0xc6 [ip=0x7f852332fb2f] [sp=0x7ffcecc0b540]
   #12 smbXsrv_session_logoff_all_callback + 0xe4 [ip=0x7f8523331720] [sp=0x7ffcecc0b580]
   #13 db_rbt_traverse_internal + 0x136 [ip=0x7f85222c4276] [sp=0x7ffcecc0b5e0]
   #14 db_rbt_traverse + 0xab [ip=0x7f85222c446c] [sp=0x7ffcecc0b6a0]
   #15 dbwrap_traverse + 0x39 [ip=0x7f85222c0c16] [sp=0x7ffcecc0b6f0]
   #16 smbXsrv_session_logoff_all + 0xc9 [ip=0x7f852333153a] [sp=0x7ffcecc0b730]
   #17 exit_server_common + 0x2f9 [ip=0x7f8523339ed3] [sp=0x7ffcecc0b770]
   #18 smbd_exit_server_cleanly + 0x21 [ip=0x7f852333a304] [sp=0x7ffcecc0b7d0]
   #19 exit_server_cleanly + 0x2c [ip=0x7f8522d60607] [sp=0x7ffcecc0b7f0]
   #20 smbd_sig_term_handler + 0x2e [ip=0x7f85232d4f5f] [sp=0x7ffcecc0b810]

It's possible this is two different bugs, but they're both related to holding open a directory with a notify and then killing the process with SIGTERM.

Comment 13 Jeremy Allison 2021-03-03 03:43:45 UTC

The logs I got seem to hint at a problem with close_directory() correctly finding the share mode record when there are large numbers of open handles on the same directory.

This seems to be similar to bug:

BUG: https://bugzilla.samba.org/show_bug.cgi?id=14625

but that is already fixed in this code by commit 0bdbe50fac680be3fe21043246b8c75005611351.

I'll do some more testing tomorrow morning to try and isolate the problem further.

Comment 14 Peter Eriksson 2021-03-03 09:37:04 UTC

Just had a quick check in my core-dump-collection. Out of (ca) 160 core dumps, one was different and referenced the share_mode_lock_destructor:

#4  0x0000000802cb7ef3 in smb_panic_s3 (
    why=0x801b85d74 "Could not unlock share mode\n")
    at ../../source3/lib/util.c:850
#5  0x000000080146f5f8 in smb_panic (
    why=0x801b85d74 "Could not unlock share mode\n")
    at ../../lib/util/fault.c:197
#6  0x0000000801a31c86 in share_mode_lock_destructor (lck=0x815d358e0)
    at ../../source3/locking/share_mode_lock.c:973
#7  0x0000000802a39b04 in _tc_free_internal (tc=0x815d35880, 
    location=0x801b3e478 "../../source3/smbd/close.c:1243")
    at ../../lib/talloc/talloc.c:1158
#8  0x0000000802a39e80 in _talloc_free_internal (ptr=0x815d358e0, 
   location=0x801b3e478 "../../source3/smbd/close.c:1243")
    at ../../lib/talloc/talloc.c:1248
#9  0x0000000802a3b1ee in _talloc_free (ptr=0x815d358e0, 
    location=0x801b3e478 "../../source3/smbd/close.c:1243")
    at ../../lib/talloc/talloc.c:1792
#10 0x0000000801970006 in close_directory (req=0x0, fsp=0x815cfd060, 
    close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1243
#11 0x0000000801970585 in close_file (req=0x0, fsp=0x815cfd060, 
    close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1344
#12 0x00000008018ddbc2 in file_close_user (sconn=0x815c60560, vuid=3222883096)
    at ../../source3/smbd/files.c:714
#13 0x00000008019e92d4 in smbXsrv_session_logoff (session=0x815c5d580)
    at ../../source3/smbd/smbXsrv_session.c:1686
#14 0x00000008019e7d0c in smbXsrv_session_clear_and_logoff (
    session=0x815c5d580) at ../../source3/smbd/smbXsrv_session.c:1193
#15 0x00000008019e98e0 in smbXsrv_session_logoff_all_callback (
    local_rec=0x7fffffffe170, private_data=0x7fffffffe280)
    at ../../source3/smbd/smbXsrv_session.c:1835
#16 0x000000080710e520 in db_rbt_traverse_internal (db=0x815cff660, f=
    0x8019e97ff <smbXsrv_session_logoff_all_callback>, 
    private_data=0x7fffffffe280, count=0x7fffffffe1f8, rw=true)
    at ../../lib/dbwrap/dbwrap_rbt.c:464
#17 0x000000080710e712 in db_rbt_traverse (db=0x815cff660, f=
    0x8019e97ff <smbXsrv_session_logoff_all_callback>, 
    private_data=0x7fffffffe280) at ../../lib/dbwrap/dbwrap_rbt.c:522
#18 0x000000080710b07a in dbwrap_traverse (db=0x815cff660, f=
    0x8019e97ff <smbXsrv_session_logoff_all_callback>, 
    private_data=0x7fffffffe280, count=0x7fffffffe274)
    at ../../lib/dbwrap/dbwrap.c:394
#19 0x00000008019e96fc in smbXsrv_session_logoff_all (client=0x815c594c0)
    at ../../source3/smbd/smbXsrv_session.c:1789
#20 0x00000008019f1fdf in exit_server_common (how=SERVER_EXIT_NORMAL, 
    reason=0x801b47bba "termination signal")
    at ../../source3/smbd/server_exit.c:168
#21 0x00000008019f2408 in smbd_exit_server_cleanly (
    explanation=0x801b47bba "termination signal")
    at ../../source3/smbd/server_exit.c:256
#22 0x0000000804574d0e in exit_server_cleanly (
    reason=0x801b47bba "termination signal")
    at ../../source3/lib/smbd_shim.c:121
#23 0x000000080198f63d in smbd_sig_term_handler (ev=0x815c60060, 
    se=0x815c56c60, signum=15, count=1, siginfo=0x0, private_data=0x815c60560)
    at ../../source3/smbd/process.c:979
#24 0x0000000802422f61 in tevent_common_invoke_signal_handler (se=0x815c56c60, 
    signum=15, count=1, siginfo=0x0, removed=0x0)
    at ../../lib/tevent/tevent_signal.c:370
#25 0x000000080242320c in tevent_common_check_signal (ev=0x815c60060)
    at ../../lib/tevent/tevent_signal.c:468
#26 0x000000080242061f in poll_event_loop_poll (ev=0x815c60060, 
    tvalp=0x7fffffffe4c0) at ../../lib/tevent/tevent_poll.c:488
#27 0x0000000802420df4 in poll_event_loop_once (ev=0x815c60060, 
    location=0x801b4a988 "../../source3/smbd/process.c:4232")
    at ../../lib/tevent/tevent_poll.c:626
#28 0x000000080241b095 in _tevent_loop_once (ev=0x815c60060, 
    location=0x801b4a988 "../../source3/smbd/process.c:4232")
    at ../../lib/tevent/tevent.c:772

Comment 15 Volker Lendecke 2021-03-03 18:37:44 UTC

Created attachment 16495 [details]
Possible patch

Can you try the attached patch?

Even with this patch running Jeremy's script causes panics, but this I believe is because the parent smbd is killed as part of the killall. If children are still running, they find an empty locking.tdb because the new smbd will CLEAR_IF_FIRST a fresh locking.tdb, the previous CLEAR_IF_FIRST protection died with the previous parent smbd.

Comment 16 Volker Lendecke 2021-03-03 18:40:30 UTC

Created attachment 16496 [details]
New patch

This one also includes another uninitialized variable read fix. I had intended to upload both at the same time. Take this file, it contains both patches.

Comment 17 Jeremy Allison 2021-03-03 18:44:06 UTC

(In reply to Volker Lendecke from comment #16)

I will fix the script to only kill child processes. I have Google work I must do first, but then I'll give this a go. Thanks a *LOT* for looking at this !

Comment 18 Volker Lendecke 2021-03-03 18:56:15 UTC

> I will fix the script to only kill child processes. I have Google work I
> must do first, but then I'll give this a go. Thanks a *LOT* for looking at
> this !

"killall smbd" is okay -- it's the immediate restart of smbd that causes trouble. The script needs to check that all smbds are actually gone before starting a new one.

Comment 19 Jeremy Allison 2021-03-03 19:18:01 UTC

(In reply to Volker Lendecke from comment #18)

So using:

killall -w smbd

should do it...

Comment 20 Jeremy Allison 2021-03-03 19:37:32 UTC

Hmmm. Actually killall -w doesn't seem to work right. I've got a version that I think will do it.

Comment 21 Jeremy Allison 2021-03-03 20:25:40 UTC

OK, with your patch and the following script:

while true; do
  echo "starting smbd"
  /usr/local/samba/sbin/smbd
  sleep 5
  wait
  sleep 1
  for N in `seq 1 100`; do 
    /usr/local/samba/bin/smbclient //127.0.0.1/tmp -Uuser%pass -c 'notify testdir' &
  done
  sleep 5
  while [ `ps axwu | grep smbd | grep -v grep|wc -l` -ne 0 ]
  do
    echo "killing smbd"
    killall smbd
    sleep 5
  done
done

I get *NO* crashes, just an increasing number of empty log files :-) !!!! Wooot!

Comment 22 Peter Eriksson 2021-03-03 21:19:17 UTC

Created attachment 16497 [details]
tar file with torture script and two backtraces

I can confirm that with the patch applied and the torture script restart-loop modified to wait for _all_ smbd processes to terminate before starting up a new one I too can run without crashes.

See the attached tar.gz file for my version of the torture script (for freebsd). If I run it with:
  ./samba-torture service
or
  ./samba-torture pkill

then I will get two variants of core dumps (included in the tar.gz file).

If I run it with:
  ./samba-torture pidfiles

then it will run fine (for as long as I tested it).


The differences are:

./samba-torture service: 

Uses a "standard" (modified version from FreeBSD ports version of Samba) Freebsd restart script called as:

   service samba_server restart

that uses the smbd/winbindd pid files to kill the services. However, it only waits for the master process pointed to by the pid files to terminate before it restarts things again - so probably some smbd processes are a bit slow to terminate.


./samba-server pkill:

Just does a "pkill smbd" & "pkill winbindd" so basically kills them in random order


./samba-torture pidfiles:

kills the master processes pointed to via the pid files. _And_ then waits in a loop until all smbd/winbindd processes are gone before starting things up again.

I'll check and see if the /etc/rc.d/samba_server service script can be modified to really wait for all processes to die before starting up new ones. 


Example output:

# service samba_server restart
Performing sanity check on Samba configuration: OK
Stopping winbindd.
Waiting for PIDS: 52016.
Stopping smbd.
Waiting for PIDS: 52022.
Performing sanity check on Samba configuration: OK
Starting smbd.
Starting winbindd.

Comment 23 Peter Eriksson 2021-03-03 21:37:15 UTC

Created attachment 16498 [details]
torture script for testing restart loops

An updated version of my torture script.

I'm pondering how to really write a system restart script that handles this problem (needs to wait for all smbd's to die before starting up new ones).

Doing it the way I'm doing it with pgrep doesn't really work in the general case. Suppose I'm running multiple samba "servers" on separate IP addresses on the same system - then restarting one will still se the "others" smbd processes...

Hmm. It really would be nice if the master smbd process didn't terminate until the last of it's sub-processes has died...

Comment 24 Jeremy Allison 2021-03-03 22:12:29 UTC

Also tested this with the patch applied to 4.13.next, 4.14.next.

Everything works ! I think you nailed it Volker. Do you want to push to a gitlab MR and I'll RB+ and push ?

Comment 25 Samba QA Contact 2021-03-05 11:23:04 UTC

This bug was referenced in samba master:

84b634c613352fc1da8e1525d72597c526d534d2
654c18a244f060d81280493a324b98602a69dbbf

Comment 26 Volker Lendecke 2021-03-05 17:01:32 UTC

Created attachment 16502 [details]
Patch for 4.12

Comment 27 Volker Lendecke 2021-03-05 17:03:04 UTC

Created attachment 16503 [details]
Patch for 4.13

Comment 28 Volker Lendecke 2021-03-05 17:04:17 UTC

Created attachment 16504 [details]
Patch for 4.14

Comment 29 Volker Lendecke 2021-03-05 17:04:33 UTC

4.11 and older are not affected

Comment 30 Jeremy Allison 2021-03-05 19:22:20 UTC

Re-assigning to Karolin for inclusion in 4.14.next, 4.13.next, 4.12.next.

Comment 31 Peter Eriksson 2021-03-06 13:10:45 UTC

A question:
Regarding the "restarting smbd before all subprocesses has died" issue:
Should perhaps "smbcontrol" be modified to also wait?

A quick test with my stress-test script seems to indicate that "smbcontrol smbd shutdown" only terminates the master process.

And I guess various systemd/init scripts for various systems should be fixed too...

(I have submitted a patch for FreeBSD's one).

Comment 32 Karolin Seeger 2021-03-08 07:17:17 UTC

(In reply to Jeremy Allison from comment #30)
Pushed to autobuild-v4-{14,13,12}-test.

Comment 33 Samba QA Contact 2021-03-08 08:52:04 UTC

This bug was referenced in samba v4-12-test:

df832cb62c01bf6a2a801340a4434c0db51c34e0
5dd17586cd600518c3187b4af2d4cc6167d52eb7

Comment 34 Samba QA Contact 2021-03-08 09:48:03 UTC

This bug was referenced in samba v4-13-test:

efd3ee23123c2cc7685113f4253b800258b7532f
6c5e6046345914d8e0660d9d279d8abc3921535a

Comment 35 Samba QA Contact 2021-03-08 10:59:03 UTC

This bug was referenced in samba v4-14-test:

02264306200fc718c066ea2ecdadd1f03ffb9ea3
f912b8f600a2e85b594c0ae84d687a49f958ebfa

Comment 36 Jeremy Allison 2021-03-08 18:31:36 UTC

(In reply to Peter Eriksson from comment #31)

Maybe, but that's a bug for another day (if indeed it is one :-).

Comment 37 Samba QA Contact 2021-03-09 11:01:09 UTC

This bug was referenced in samba v4-13-stable (Release samba-4.13.5):

efd3ee23123c2cc7685113f4253b800258b7532f
6c5e6046345914d8e0660d9d279d8abc3921535a

Comment 38 Samba QA Contact 2021-03-09 12:38:59 UTC

This bug was referenced in samba v4-14-stable (Release samba-4.14.0):

02264306200fc718c066ea2ecdadd1f03ffb9ea3
f912b8f600a2e85b594c0ae84d687a49f958ebfa

Comment 39 Karolin Seeger 2021-03-10 08:14:05 UTC

(In reply to Peter Eriksson from comment #31)
Hi Peter,

please open another bug to track this issue, please.

Thanks!

Comment 40 Karolin Seeger 2021-03-10 08:14:52 UTC

Pushed to all branches.
Closing out bug report.

Thanks!

Comment 41 Samba QA Contact 2021-03-11 11:48:53 UTC

This bug was referenced in samba v4-12-stable (Release samba-4.12.12):

df832cb62c01bf6a2a801340a4434c0db51c34e0
5dd17586cd600518c3187b4af2d4cc6167d52eb7