Bug 14660 - Daemons restarting too quickly cause core dumps
Summary: Daemons restarting too quickly cause core dumps
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.14.0
Hardware: All All
: P5 minor (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-10 09:43 UTC by Peter Eriksson
Modified: 2021-03-10 09:43 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2021-03-10 09:43:55 UTC
When restarting daemons ("service samba_server restart" on FreeBSD, or similar) and you have many active SMB connections (== many smbd daemons running) then there is a race between the terminating daemons and the new one starting up in conjunction with locks being torn down and initialized anew than can cause core dumps in the terminating ones (which you typically won't notice since it's happening during the termination phase and core dumps aren't enabled...)

Ie, typically many system rc scripts wait for the master process (the ones in the pid files) to terminate before assuming it's OK to start up new ones - but you really need to wait for all (smbd) processes to terminate before restarting.

Suggested solutions:
1. Make the master smbd wait for all subprocesses to terminate before exiting.
(assumes the startup scripts wait for the master pid to terminate)

2. Modify all various OS system startup scripts (including smbcontrol) to wait for _all_ processes to terminate


This is how a quick-and-dirty scripted "stop" might be done that waits for 
all processes to terminate instead of just blindly continuing:

PIDFILES=/var/run
DAEMONS="smbd winbindd"
for D in $DAEMONS; do
  MASTERPID="`cat $PIDFILES/$D.pid`"
  kill $MASTERPID
  # Wait for a processes to terminate (assumes SID == MASTERPID)
  while pgrep -s $MASTERPID 2>/dev/null; do
    sleep 1
  done
done


(Solaris based systems using SMF doesn't have this problem since it be default waits for all processes in the "contract" to terminate before continuing :-)

If nothing else - perhaps a note should be made in the documentation that this needs to be done?