Just got a number of core dumps due to a failed assertion when terminating smbd processes via signal SIGTERM. This doesn't happen all them time, just on our more busy servers: (gdb) where #0 0x0000000804738c2a in thr_kill () from /lib/libc.so.7 #1 0x0000000804737084 in raise () from /lib/libc.so.7 #2 0x00000008046ad279 in abort () from /lib/libc.so.7 #3 0x0000000802586430 in dump_core () at ../../source3/lib/dumpcore.c:338 #4 0x000000080259508d in smb_panic_s3 (why=<optimized out>) at ../../source3/lib/util.c:849 #5 0x000000080129abdf in smb_panic (why=why@entry=0x8018e4180 "assert failed: fsp->notify != NULL") at ../../lib/util/fault.c:184 #6 0x000000080180b14e in change_notify_remove_request (sconn=sconn@entry=0x80fe242e0, remove_req=<optimized out>) at ../../source3/smbd/notify.c:384 #7 0x000000080180bbac in smbd_notify_cancel_by_map (map=0x80ff222e0) at ../../source3/smbd/notify.c:432 #8 0x000000080180c06e in smbd_notify_cancel_by_smbreq (smbreq=<optimized out>) at ../../source3/smbd/notify.c:473 #9 0x00000008017f15f7 in smbd_smb2_notify_state_destructor (state=<optimized out>) at ../../source3/smbd/smb2_notify.c:176 #10 0x000000080150cf7f in ?? () from /usr/local/lib/libtalloc.so.2 #11 0x0000000801520c4d in tevent_req_received () from /usr/local/lib/libtevent.so.0 #12 0x00000008015208a9 in ?? () from /usr/local/lib/libtevent.so.0 #13 0x000000080150cf7f in ?? () from /usr/local/lib/libtalloc.so.2 #14 0x000000080150ce3f in ?? () from /usr/local/lib/libtalloc.so.2 #15 0x000000080150ce3f in ?? () from /usr/local/lib/libtalloc.so.2 #16 0x00000008018010d7 in exit_server_common (how=how@entry=SERVER_EXIT_NORMAL, reason=0x80182c5ea "termination signal") at ../../source3/smbd/server_exit.c:169 #17 0x000000080180149e in smbd_exit_server_cleanly (explanation=<optimized out>) at ../../source3/smbd/server_exit.c:237 #18 0x0000000803be08f6 in exit_server_cleanly (reason=reason@entry=0x80182c5ea "termination signal") at ../../source3/lib/smbd_shim.c:121 #19 0x00000008017c169c in smbd_sig_term_handler (ev=<optimized out>, se=<optimized out>, signum=<optimized out>, count=<optimized out>, siginfo=<optimized out>, private_data=<optimized out>) at ../../source3/smbd/process.c:979 #20 0x0000000801523f97 in tevent_common_invoke_signal_handler () from /usr/local/lib/libtevent.so.0 #21 0x00000008015240d5 in tevent_common_check_signal () from /usr/local/lib/libtevent.so.0 #22 0x000000080152256c in ?? () from /usr/local/lib/libtevent.so.0 #23 0x000000080151ed41 in _tevent_loop_once () from /usr/local/lib/libtevent.so.0 #24 0x000000080151efcb in tevent_common_loop_wait () from /usr/local/lib/libtevent.so.0 #25 0x00000008017c8e52 in smbd_process (ev_ctx=ev_ctx@entry=0x80fe24060, msg_ctx=msg_ctx@entry=0x80fe1d300, dce_ctx=dce_ctx@entry=0x80fe0e0c0, sock_fd=sock_fd@entry=49, interactive=interactive@entry=false) at ../../source3/smbd/process.c:4212 #26 0x000000000102e1af in smbd_accept_connection (ev=0x80fe24060, fde=<optimized out>, flags=<optimized out>, private_data=<optimized out>) at ../../source3/smbd/server.c:1014 #27 0x000000080151fb1d in tevent_common_invoke_fd_handler () from /usr/local/lib/libtevent.so.0 #28 0x000000080152294e in ?? () from /usr/local/lib/libtevent.so.0 #29 0x000000080151ed41 in _tevent_loop_once () from /usr/local/lib/libtevent.so.0 #30 0x000000080151efcb in tevent_common_loop_wait () from /usr/local/lib/libtevent.so.0 #31 0x000000000102fd39 in smbd_parent_loop (parent=0x80fe1d760, ev_ctx=0x80fe24060) at ../../source3/smbd/server.c:1361 #32 main (argc=<optimized out>, argv=<optimized out>) at ../../source3/smbd/server.c:2214 (gdb) list 379 * Paranoia checks, the fsp referenced must must have the request in 380 * its list of pending requests 381 */ 382 383 fsp = remove_req->fsp; 384 SMB_ASSERT(fsp->notify != NULL); 385 386 for (req = fsp->notify->requests; req; req = req->next) { 387 if (req == remove_req) { 388 break;
OK, I'm looking into this and I can't quite see how it happens. In 4.13.x on a SHUTDOWN_CLOSE we end up in: smbd_exit_server_cleanly() exit_server_common() smbXsrv_session_logoff_all() smbXsrv_session_logoff_all_callback() smbXsrv_session_clear_and_logoff() smbXsrv_session_logoff() file_close_user() -- and for all open file handles (including directories with notify requests)-- close_file(NULL, fsp, SHUTDOWN_CLOSE); remove_pending_change_notify_requests_by_fid(fsp, notify_status); -- which marshals the reply and sends it (which should then clear the state->has_request flag). change_notify_remove_request() So there really shouldn't be any smbd_smb2_notify_state_destructor() function to be triggered when the talloc hierarchy is freed. Hmmm. Is there any way you can spot exactly what state the smbd is in when the SIGTERM causes the crash ? I'll do some more investigations locally.
This stack trace was captured with my standard build of Samba, which used optimization, FreeBSD-provided libtalloc.so, and it also happened with the ... other unmentionable ( :-) ) Bug ... still in the code, and with dbwrap_tdb_mutexes enabled. I haven't seen any of these crashes (or any core dumps at all actually) since I recompiled it with my fix to the Bug, without optimization, with libtalloc internally and dbwrap_tdb_mutexes disabled - so perhaps it was due to one of the these problems? I'll see if I can provoke it to happen on our test server(s) again somehow.
I just did a naive test on latest master by using smbclient to issue a wait-notify on an existing smbd and then terminating it via kill. No issues. Let's see if you can reproduce and if not close this one out as "INVALID".
Created attachment 16489 [details] GDB stack backtrace
I managed to prove it again! Sorry :-) I created a script that basically did this in a loop: while true; do service samba_server restart sleep 1 wait sleep 1 for N in `seq 1 100`; do smbclient -k //server/share -c 'notify foo' & done sleep 10 done "service samba_server restart" stops winbindd and then smbd, and then starts them up again - in that order. After a couple of iterations in that loop I got a shitload of coredumps in my /var/cores directory :-) This is with Samba 4.14.0rc4 (with patches) on FreeBSD 11.4. A fresh stack backtrace uploaded.
Reproduced in master - thanks ! Here is the script I used. mkdir /tmp/testdir while true; do killall smbd /usr/local/samba/sbin/smbd sleep 1 wait sleep 1 for N in `seq 1 100`; do /usr/local/samba/bin/smbclient //127.0.0.1/tmp -UUSER%PASS -c 'notify testdir' & done sleep 10 done I'll add Metze, Ralph and Volker to this one to track it down.
(In reply to Jeremy Allison from comment #6) Jeremy, can you reproduce this with log level 10 and selftest/gdb_backtrace as panic action and upload everthing?
Created attachment 16490 [details] debug level 10 log of panic debug level 10 log of panic
Created attachment 16491 [details] debug level 10 log of panic Hopefully got the right file this time :-).
Created attachment 16492 [details] gdb stack backtrace Matches debug log in attachment id "16491: debug level 10 log of panic".
This is master with git refspec to of tree as 1c9add54750cb7f2b49be69a548ce8bdb15e7ac2.
These logs and gdb bt are interesting in that they're different from the one found my Peter Eriksson. It appears to be something to do with share modes: BACKTRACE: #0 log_stack_trace + 0x3b [ip=0x7f852358a03c] [sp=0x7ffcecc098e0] #1 smb_panic_log + 0x1b5 [ip=0x7f8523589fb0] [sp=0x7ffcecc0a1f0] #2 smb_panic + 0x1c [ip=0x7f8523589fcf] [sp=0x7ffcecc0a210] #3 fault_report + 0x91 [ip=0x7f8523589ad6] [sp=0x7ffcecc0a230] #4 sig_fault + 0x19 [ip=0x7f8523589aef] [sp=0x7ffcecc0a2e0] #5 funlockfile + 0x60 [ip=0x7f8522b90bb0] [sp=0x7ffcecc0a300] #6 __pthread_mutex_lock_full + 0x415 [ip=0x7f8522b874d5] [sp=0x7ffcecc0a8a0] #7 chain_mutex_lock + 0x27 [ip=0x7f85222b22c1] [sp=0x7ffcecc0a900] #8 tdb_mutex_lock + 0xa5 [ip=0x7f85222b240c] [sp=0x7ffcecc0a930] #9 fcntl_lock + 0x55 [ip=0x7f85222a622b] [sp=0x7ffcecc0a9b0] #10 tdb_brlock + 0xa2 [ip=0x7f85222a63ff] [sp=0x7ffcecc0aa20] #11 tdb_nest_lock + 0x1ba [ip=0x7f85222a69f0] [sp=0x7ffcecc0aa60] #12 tdb_lock_list + 0x7d [ip=0x7f85222a6c4e] [sp=0x7ffcecc0aaa0] #13 tdb_lock + 0x2e [ip=0x7f85222a6cef] [sp=0x7ffcecc0aae0] #14 tdb_chainlock + 0x5d [ip=0x7f85222a7742] [sp=0x7ffcecc0ab20] #15 db_tdb_do_locked + 0xa2 [ip=0x7f85222c4e64] [sp=0x7ffcecc0ab60] #16 dbwrap_do_locked + 0x8c [ip=0x7f85222c122e] [sp=0x7ffcecc0ac30] #17 dbwrap_watched_do_locked + 0xc3 [ip=0x7f852305f31f] [sp=0x7ffcecc0ac80] #18 dbwrap_do_locked + 0x8c [ip=0x7f85222c122e] [sp=0x7ffcecc0ad30] #19 g_lock_lock_retry + 0x1cd [ip=0x7f8523063d5c] [sp=0x7ffcecc0ad80] #20 _tevent_req_notify_callback + 0x6e [ip=0x7f8523103120] [sp=0x7ffcecc0ae10] #21 tevent_req_finish + 0x108 [ip=0x7f8523103289] [sp=0x7ffcecc0ae30] #22 _tevent_req_done + 0x29 [ip=0x7f85231032b9] [sp=0x7ffcecc0ae80] #23 dbwrap_watched_watch_done + 0x109 [ip=0x7f852306188e] [sp=0x7ffcecc0aea0] #24 _tevent_req_notify_callback + 0x6e [ip=0x7f8523103120] [sp=0x7ffcecc0aef0] #25 tevent_req_finish + 0x108 [ip=0x7f8523103289] [sp=0x7ffcecc0af10] #26 tevent_req_trigger + 0x53 [ip=0x7f85231033c9] [sp=0x7ffcecc0af60] #27 tevent_common_invoke_immediate_handler + 0x1a3 [ip=0x7f8523101ea9] [sp=0x7ffcecc0afa0] #28 tevent_common_loop_immediate + 0x3b [ip=0x7f8523101fc4] [sp=0x7ffcecc0b050] #29 epoll_event_loop_once + 0xa2 [ip=0x7f852310cc29] [sp=0x7ffcecc0b080] #30 std_event_loop_once + 0x60 [ip=0x7f8523109379] [sp=0x7ffcecc0b0d0] #31 _tevent_loop_once + 0x126 [ip=0x7f85231006f6] [sp=0x7ffcecc0b110] #32 tevent_req_poll + 0x29 [ip=0x7f8523103571] [sp=0x7ffcecc0b150] #33 tevent_req_poll_ntstatus + 0x2b [ip=0x7f8522ffd954] [sp=0x7ffcecc0b180] #34 g_lock_lock + 0x265 [ip=0x7f8523064430] [sp=0x7ffcecc0b1c0] #35 get_share_mode_lock + 0x21f [ip=0x7f8523380627] [sp=0x7ffcecc0b260] #36 get_existing_share_mode_lock + 0x38 [ip=0x7f852337409c] [sp=0x7ffcecc0b360] #37 close_directory + 0x131 [ip=0x7f85232b48c6] [sp=0x7ffcecc0b3a0] #38 close_file + 0x461 [ip=0x7f85232b52ad] [sp=0x7ffcecc0b440] #39 file_close_user + 0x51 [ip=0x7f8523218340] [sp=0x7ffcecc0b4a0] #40 smbXsrv_session_logoff + 0x9e [ip=0x7f8523331110] [sp=0x7ffcecc0b4d0] #41 smbXsrv_session_clear_and_logoff + 0xc6 [ip=0x7f852332fb2f] [sp=0x7ffcecc0b540] #42 smbXsrv_session_logoff_all_callback + 0xe4 [ip=0x7f8523331720] [sp=0x7ffcecc0b580] #43 db_rbt_traverse_internal + 0x136 [ip=0x7f85222c4276] [sp=0x7ffcecc0b5e0] #44 db_rbt_traverse + 0xab [ip=0x7f85222c446c] [sp=0x7ffcecc0b6a0] #45 dbwrap_traverse + 0x39 [ip=0x7f85222c0c16] [sp=0x7ffcecc0b6f0] #46 smbXsrv_session_logoff_all + 0xc9 [ip=0x7f852333153a] [sp=0x7ffcecc0b730] #47 exit_server_common + 0x2f9 [ip=0x7f8523339ed3] [sp=0x7ffcecc0b770] #48 smbd_exit_server_cleanly + 0x21 [ip=0x7f852333a304] [sp=0x7ffcecc0b7d0] #49 exit_server_cleanly + 0x2c [ip=0x7f8522d60607] [sp=0x7ffcecc0b7f0] #50 smbd_sig_term_handler + 0x2e [ip=0x7f85232d4f5f] [sp=0x7ffcecc0b810] I have another panic in the set of logs that is also related to share mode deletion called from close_directory() with the same call stack above that. #0 log_stack_trace + 0x3b [ip=0x7f852358a03c] [sp=0x7ffcecc0a930] #1 smb_panic_log + 0x1b5 [ip=0x7f8523589fb0] [sp=0x7ffcecc0b240] #2 smb_panic + 0x1c [ip=0x7f8523589fcf] [sp=0x7ffcecc0b260] #3 share_mode_lock_destructor + 0x2c1 [ip=0x7f8523380e5a] [sp=0x7ffcecc0b280] #4 _tc_free_internal + 0x151 [ip=0x7f85230dcfdb] [sp=0x7ffcecc0b2b0] #5 _talloc_free_internal + 0x9d [ip=0x7f85230dd482] [sp=0x7ffcecc0b360] #6 _talloc_free + 0x106 [ip=0x7f85230de826] [sp=0x7ffcecc0b390] #7 close_directory + 0x596 [ip=0x7f85232b4d2b] [sp=0x7ffcecc0b3c0] #8 close_file + 0x461 [ip=0x7f85232b52ad] [sp=0x7ffcecc0b440] #9 file_close_user + 0x51 [ip=0x7f8523218340] [sp=0x7ffcecc0b4a0] #10 smbXsrv_session_logoff + 0x9e [ip=0x7f8523331110] [sp=0x7ffcecc0b4d0] #11 smbXsrv_session_clear_and_logoff + 0xc6 [ip=0x7f852332fb2f] [sp=0x7ffcecc0b540] #12 smbXsrv_session_logoff_all_callback + 0xe4 [ip=0x7f8523331720] [sp=0x7ffcecc0b580] #13 db_rbt_traverse_internal + 0x136 [ip=0x7f85222c4276] [sp=0x7ffcecc0b5e0] #14 db_rbt_traverse + 0xab [ip=0x7f85222c446c] [sp=0x7ffcecc0b6a0] #15 dbwrap_traverse + 0x39 [ip=0x7f85222c0c16] [sp=0x7ffcecc0b6f0] #16 smbXsrv_session_logoff_all + 0xc9 [ip=0x7f852333153a] [sp=0x7ffcecc0b730] #17 exit_server_common + 0x2f9 [ip=0x7f8523339ed3] [sp=0x7ffcecc0b770] #18 smbd_exit_server_cleanly + 0x21 [ip=0x7f852333a304] [sp=0x7ffcecc0b7d0] #19 exit_server_cleanly + 0x2c [ip=0x7f8522d60607] [sp=0x7ffcecc0b7f0] #20 smbd_sig_term_handler + 0x2e [ip=0x7f85232d4f5f] [sp=0x7ffcecc0b810] It's possible this is two different bugs, but they're both related to holding open a directory with a notify and then killing the process with SIGTERM.
The logs I got seem to hint at a problem with close_directory() correctly finding the share mode record when there are large numbers of open handles on the same directory. This seems to be similar to bug: BUG: https://bugzilla.samba.org/show_bug.cgi?id=14625 but that is already fixed in this code by commit 0bdbe50fac680be3fe21043246b8c75005611351. I'll do some more testing tomorrow morning to try and isolate the problem further.
Just had a quick check in my core-dump-collection. Out of (ca) 160 core dumps, one was different and referenced the share_mode_lock_destructor: #4 0x0000000802cb7ef3 in smb_panic_s3 ( why=0x801b85d74 "Could not unlock share mode\n") at ../../source3/lib/util.c:850 #5 0x000000080146f5f8 in smb_panic ( why=0x801b85d74 "Could not unlock share mode\n") at ../../lib/util/fault.c:197 #6 0x0000000801a31c86 in share_mode_lock_destructor (lck=0x815d358e0) at ../../source3/locking/share_mode_lock.c:973 #7 0x0000000802a39b04 in _tc_free_internal (tc=0x815d35880, location=0x801b3e478 "../../source3/smbd/close.c:1243") at ../../lib/talloc/talloc.c:1158 #8 0x0000000802a39e80 in _talloc_free_internal (ptr=0x815d358e0, location=0x801b3e478 "../../source3/smbd/close.c:1243") at ../../lib/talloc/talloc.c:1248 #9 0x0000000802a3b1ee in _talloc_free (ptr=0x815d358e0, location=0x801b3e478 "../../source3/smbd/close.c:1243") at ../../lib/talloc/talloc.c:1792 #10 0x0000000801970006 in close_directory (req=0x0, fsp=0x815cfd060, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1243 #11 0x0000000801970585 in close_file (req=0x0, fsp=0x815cfd060, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1344 #12 0x00000008018ddbc2 in file_close_user (sconn=0x815c60560, vuid=3222883096) at ../../source3/smbd/files.c:714 #13 0x00000008019e92d4 in smbXsrv_session_logoff (session=0x815c5d580) at ../../source3/smbd/smbXsrv_session.c:1686 #14 0x00000008019e7d0c in smbXsrv_session_clear_and_logoff ( session=0x815c5d580) at ../../source3/smbd/smbXsrv_session.c:1193 #15 0x00000008019e98e0 in smbXsrv_session_logoff_all_callback ( local_rec=0x7fffffffe170, private_data=0x7fffffffe280) at ../../source3/smbd/smbXsrv_session.c:1835 #16 0x000000080710e520 in db_rbt_traverse_internal (db=0x815cff660, f= 0x8019e97ff <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe280, count=0x7fffffffe1f8, rw=true) at ../../lib/dbwrap/dbwrap_rbt.c:464 #17 0x000000080710e712 in db_rbt_traverse (db=0x815cff660, f= 0x8019e97ff <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe280) at ../../lib/dbwrap/dbwrap_rbt.c:522 #18 0x000000080710b07a in dbwrap_traverse (db=0x815cff660, f= 0x8019e97ff <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe280, count=0x7fffffffe274) at ../../lib/dbwrap/dbwrap.c:394 #19 0x00000008019e96fc in smbXsrv_session_logoff_all (client=0x815c594c0) at ../../source3/smbd/smbXsrv_session.c:1789 #20 0x00000008019f1fdf in exit_server_common (how=SERVER_EXIT_NORMAL, reason=0x801b47bba "termination signal") at ../../source3/smbd/server_exit.c:168 #21 0x00000008019f2408 in smbd_exit_server_cleanly ( explanation=0x801b47bba "termination signal") at ../../source3/smbd/server_exit.c:256 #22 0x0000000804574d0e in exit_server_cleanly ( reason=0x801b47bba "termination signal") at ../../source3/lib/smbd_shim.c:121 #23 0x000000080198f63d in smbd_sig_term_handler (ev=0x815c60060, se=0x815c56c60, signum=15, count=1, siginfo=0x0, private_data=0x815c60560) at ../../source3/smbd/process.c:979 #24 0x0000000802422f61 in tevent_common_invoke_signal_handler (se=0x815c56c60, signum=15, count=1, siginfo=0x0, removed=0x0) at ../../lib/tevent/tevent_signal.c:370 #25 0x000000080242320c in tevent_common_check_signal (ev=0x815c60060) at ../../lib/tevent/tevent_signal.c:468 #26 0x000000080242061f in poll_event_loop_poll (ev=0x815c60060, tvalp=0x7fffffffe4c0) at ../../lib/tevent/tevent_poll.c:488 #27 0x0000000802420df4 in poll_event_loop_once (ev=0x815c60060, location=0x801b4a988 "../../source3/smbd/process.c:4232") at ../../lib/tevent/tevent_poll.c:626 #28 0x000000080241b095 in _tevent_loop_once (ev=0x815c60060, location=0x801b4a988 "../../source3/smbd/process.c:4232") at ../../lib/tevent/tevent.c:772
Created attachment 16495 [details] Possible patch Can you try the attached patch? Even with this patch running Jeremy's script causes panics, but this I believe is because the parent smbd is killed as part of the killall. If children are still running, they find an empty locking.tdb because the new smbd will CLEAR_IF_FIRST a fresh locking.tdb, the previous CLEAR_IF_FIRST protection died with the previous parent smbd.
Created attachment 16496 [details] New patch This one also includes another uninitialized variable read fix. I had intended to upload both at the same time. Take this file, it contains both patches.
(In reply to Volker Lendecke from comment #16) I will fix the script to only kill child processes. I have Google work I must do first, but then I'll give this a go. Thanks a *LOT* for looking at this !
> I will fix the script to only kill child processes. I have Google work I > must do first, but then I'll give this a go. Thanks a *LOT* for looking at > this ! "killall smbd" is okay -- it's the immediate restart of smbd that causes trouble. The script needs to check that all smbds are actually gone before starting a new one.
(In reply to Volker Lendecke from comment #18) So using: killall -w smbd should do it...
Hmmm. Actually killall -w doesn't seem to work right. I've got a version that I think will do it.
OK, with your patch and the following script: while true; do echo "starting smbd" /usr/local/samba/sbin/smbd sleep 5 wait sleep 1 for N in `seq 1 100`; do /usr/local/samba/bin/smbclient //127.0.0.1/tmp -Uuser%pass -c 'notify testdir' & done sleep 5 while [ `ps axwu | grep smbd | grep -v grep|wc -l` -ne 0 ] do echo "killing smbd" killall smbd sleep 5 done done I get *NO* crashes, just an increasing number of empty log files :-) !!!! Wooot!
Created attachment 16497 [details] tar file with torture script and two backtraces I can confirm that with the patch applied and the torture script restart-loop modified to wait for _all_ smbd processes to terminate before starting up a new one I too can run without crashes. See the attached tar.gz file for my version of the torture script (for freebsd). If I run it with: ./samba-torture service or ./samba-torture pkill then I will get two variants of core dumps (included in the tar.gz file). If I run it with: ./samba-torture pidfiles then it will run fine (for as long as I tested it). The differences are: ./samba-torture service: Uses a "standard" (modified version from FreeBSD ports version of Samba) Freebsd restart script called as: service samba_server restart that uses the smbd/winbindd pid files to kill the services. However, it only waits for the master process pointed to by the pid files to terminate before it restarts things again - so probably some smbd processes are a bit slow to terminate. ./samba-server pkill: Just does a "pkill smbd" & "pkill winbindd" so basically kills them in random order ./samba-torture pidfiles: kills the master processes pointed to via the pid files. _And_ then waits in a loop until all smbd/winbindd processes are gone before starting things up again. I'll check and see if the /etc/rc.d/samba_server service script can be modified to really wait for all processes to die before starting up new ones. Example output: # service samba_server restart Performing sanity check on Samba configuration: OK Stopping winbindd. Waiting for PIDS: 52016. Stopping smbd. Waiting for PIDS: 52022. Performing sanity check on Samba configuration: OK Starting smbd. Starting winbindd.
Created attachment 16498 [details] torture script for testing restart loops An updated version of my torture script. I'm pondering how to really write a system restart script that handles this problem (needs to wait for all smbd's to die before starting up new ones). Doing it the way I'm doing it with pgrep doesn't really work in the general case. Suppose I'm running multiple samba "servers" on separate IP addresses on the same system - then restarting one will still se the "others" smbd processes... Hmm. It really would be nice if the master smbd process didn't terminate until the last of it's sub-processes has died...
Also tested this with the patch applied to 4.13.next, 4.14.next. Everything works ! I think you nailed it Volker. Do you want to push to a gitlab MR and I'll RB+ and push ?
This bug was referenced in samba master: 84b634c613352fc1da8e1525d72597c526d534d2 654c18a244f060d81280493a324b98602a69dbbf
Created attachment 16502 [details] Patch for 4.12
Created attachment 16503 [details] Patch for 4.13
Created attachment 16504 [details] Patch for 4.14
4.11 and older are not affected
Re-assigning to Karolin for inclusion in 4.14.next, 4.13.next, 4.12.next.
A question: Regarding the "restarting smbd before all subprocesses has died" issue: Should perhaps "smbcontrol" be modified to also wait? A quick test with my stress-test script seems to indicate that "smbcontrol smbd shutdown" only terminates the master process. And I guess various systemd/init scripts for various systems should be fixed too... (I have submitted a patch for FreeBSD's one).
(In reply to Jeremy Allison from comment #30) Pushed to autobuild-v4-{14,13,12}-test.
This bug was referenced in samba v4-12-test: df832cb62c01bf6a2a801340a4434c0db51c34e0 5dd17586cd600518c3187b4af2d4cc6167d52eb7
This bug was referenced in samba v4-13-test: efd3ee23123c2cc7685113f4253b800258b7532f 6c5e6046345914d8e0660d9d279d8abc3921535a
This bug was referenced in samba v4-14-test: 02264306200fc718c066ea2ecdadd1f03ffb9ea3 f912b8f600a2e85b594c0ae84d687a49f958ebfa
(In reply to Peter Eriksson from comment #31) Maybe, but that's a bug for another day (if indeed it is one :-).
This bug was referenced in samba v4-13-stable (Release samba-4.13.5): efd3ee23123c2cc7685113f4253b800258b7532f 6c5e6046345914d8e0660d9d279d8abc3921535a
This bug was referenced in samba v4-14-stable (Release samba-4.14.0): 02264306200fc718c066ea2ecdadd1f03ffb9ea3 f912b8f600a2e85b594c0ae84d687a49f958ebfa
(In reply to Peter Eriksson from comment #31) Hi Peter, please open another bug to track this issue, please. Thanks!
Pushed to all branches. Closing out bug report. Thanks!
This bug was referenced in samba v4-12-stable (Release samba-4.12.12): df832cb62c01bf6a2a801340a4434c0db51c34e0 5dd17586cd600518c3187b4af2d4cc6167d52eb7