Hi, When receiving a lot of multiple signals, smbd process is not responding. This occurs in the following environment: # uname -mrsv Linux 3.3.4 #1 Fri Mar 28 19:49:13 JST 2014 armv7l I attach to smbd process with gdb and got output like this: # gdb -p 21127 --- snip --- (gdb) bt #0 0xb6129750 in tevent_common_check_signal (ev=0xb6f44290) at ../lib/tevent/tevent_signal.c:356 #1 0xb685b00c in run_events_poll (ev=0xb6f44290, pollrtn=0, pfds=0x0, num_pfds=0) at lib/events.c:193 #2 0xb685ba9c in s3_event_loop_once (ev=0xb6f44290, location=0xb6f1ad44 "smbd/server.c:866") at lib/events.c:331 #3 0xb6126998 in _tevent_loop_once (ev=0xb6f44290, location=0xb6f1ad44 "smbd/server.c:866") at ../lib/tevent/tevent.c:494 #4 0xb6cfcb9c in smbd_parent_loop (parent=0xb6f6e938) at smbd/server.c:866 #5 0xb6cfe228 in main (argc=2, argv=0xbeceab34) at smbd/server.c:1357 (gdb) p sig_state->got_signal $1 = {count = 83510, seen = 83511} (gdb) p sig_state->signal_count $2 = {{count = 0, seen = 0}, {count = 1, seen = 1}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, { count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, { count = 0, seen = 0}, {count = 41132, seen = 41132}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, { count = 42378, seen = 42378}, {count = 0, seen = 0} <repeats 47 times>} While count and seen is different in sig_state->got_signal (total number of all signals), both variables has same value in each signal. Therefore, smbd does not poll() in s3_event_loop_once(). This is because signal handler (tevent_common_signal_handler()) increments non-atomic variables like this: #define TEVENT_SIG_INCREMENT(s) (s).count++ --- snip --- static void tevent_common_signal_handler(int signum) { --- snip --- TEVENT_SIG_INCREMENT(sig_state->signal_count[signum]); TEVENT_SIG_INCREMENT(sig_state->got_signal); You can reproduce on ARM computer in the following commands: # PID=<pid of smbd parent> # while true; do kill -CHLD $PID; kill -USR1 $PID; done The attached patch fixes the issue by calling __sync_fetch_and_add(). You can apply this fix only when using GCC.
Created attachment 9994 [details] FIX TEVENT_SIG_INCREMENT atomic
Can this also be solved by the following (completely untested) patch? diff --git a/lib/tevent/tevent_signal.c b/lib/tevent/tevent_signal.c index cc7fb0a..0f86c84 100644 --- a/lib/tevent/tevent_signal.c +++ b/lib/tevent/tevent_signal.c @@ -40,8 +40,8 @@ #define TEVENT_SA_INFO_QUEUE_COUNT 64 struct tevent_sigcounter { - uint32_t count; - uint32_t seen; + sig_atomic_t count; + sig_atomic_t seen; }; #define TEVENT_SIG_INCREMENT(s) (s).count++ If this solved it, it might do without a gcc dependency. Thanks, Volker
(In reply to comment #2) > Can this also be solved by the following (completely untested) patch? > > diff --git a/lib/tevent/tevent_signal.c b/lib/tevent/tevent_signal.c > index cc7fb0a..0f86c84 100644 > --- a/lib/tevent/tevent_signal.c > +++ b/lib/tevent/tevent_signal.c > @@ -40,8 +40,8 @@ > #define TEVENT_SA_INFO_QUEUE_COUNT 64 > > struct tevent_sigcounter { > - uint32_t count; > - uint32_t seen; > + sig_atomic_t count; > + sig_atomic_t seen; > }; > > #define TEVENT_SIG_INCREMENT(s) (s).count++ > > If this solved it, it might do without a gcc dependency. No, this does not solve. I tried this sig_atomic_t fix, seen is larger than count in sig_state->got_signal. Increment(++) is not atomic instruction (Load -> Add -> Store) in disassembled code (ARM architecture). 90 in ../lib/tevent/tevent_signal.c 0xb6645938 <+24>: ldr r4, [pc, #160] ; 0xb66459e0 <tevent_common_signal_handler+192> 0xb664593c <+28>: ldr r3, [pc, #160] ; 0xb66459e4 <tevent_common_signal_handler+196> 0xb6645940 <+32>: add r4, pc, r4 0xb6645944 <+36>: ldr r3, [r4, r3] 0xb6645948 <+40>: add r2, r5, #65 ; 0x41 0xb664594c <+44>: ldr r1, [r3, r2, lsl #3] 0xb6645954 <+52>: add r1, r1, #1 0xb6645958 <+56>: str r1, [r3, r2, lsl #3] 91 in ../lib/tevent/tevent_signal.c 0xb664595c <+60>: ldr r2, [r3, #1040] ; 0x410 0xb6645964 <+68>: add r2, r2, #1 0xb664596c <+76>: str r2, [r3, #1040] ; 0x410 Multiple signals are asigned to tevent_common_signal_handler(). When smbd receives a signal in the middle of handling one signal, interruption occurs. I think following flow causes inconsistency between sig_state->got_signal member. Initial state: sig_state->signal_count[10] = {count=7, seen=7} sig_state->signal_count[17] = {count=3, seen=3} sig_state->got_signal = {count=10, seen=10} Flow: 1-1. Receives SIGCHLD 1-2. Load sig_state->signal_count[17].count (=3) 1-3. Add sig_state->signal_count[17].count (=4) 1-4. Store sig_state->signal_count[17].count 1-5. Load sig_state->got_signal.count (=10) 2-1. Receives SIGUSR1 <-- Interruption 2-2. Load sig_state->signal_count[10].count (=7) 2-3. Add sig_state->signal_count[10].count =(8) 2-4. Store sig_state->signal_count[10].count 2-5. Load sig_state->got_signal.count (=10) 2-6. Add sig_state->got_signal.count (=11) 2-7. Store sig_state->got_signal.count 1-6. Add sig_state->got_signal.count (=11, add old value) <-- Resume 1-7. Store sig_state->got_signal.count 1-8. Check SIGUSR1 (in tevent_common_check_signal()) --> Add sig_state->signal_count[10].seen (=8) --> Add sig_state->got_signal.seen (=11) 1-9. Check SIGCHLD (in tevent_common_check_signal()) --> Add sig_state->signal_count[17].seen (=4) --> Add sig_state->got_signal.seen (=12) <-- This is valid End state: sig_state->signal_count[10] = {count=8, seen=8} sig_state->signal_count[17] = {count=4, seen=4} sig_state->got_signal = {count=11, seen=12}
Hmm. Ok. Is there any way to increment a variable in a signal handler on the ARM architecture without using special compiler instructions? Hmm. SUSv4 has: If the signal occurs other than as the result of calling abort(), raise(), ^[CX] [Option Start] kill(), pthread_kill(), or sigqueue (), [Option End] the behavior is undefined if the signal handler refers to any object with static storage duration other than by assigning a value to an object declared as volatile sig_atomic_t, or if the signal handler calls any function in the standard library other than one of the functions listed in Signal Concepts . Furthermore, if such a call fails, the value of errno is unspecified. So strictly you can't increment a variable in a signal handler. Is this Linux? If so, we could avoid most of that nastiness by using signalfd(), but that's non-portable as well unfortunately. We might also depend on the signal pipe we are using to not overflow and write the necessary information there. write() is definitely a signal-safe function to call according to SUSv4. Thanks a lot for your analysis! Your patch certainly looks sane for your environment and points out an important bug in our signal handling. For upstream I'd like to give it some more thought to see if we can avoid a gcc'ism. If we have to go down to compiler extensions, we do need some configure checks for this. Volker
http://locklessinc.com/articles/signalsafe_locks/ looks interesting, but that's assembly also :-(. How about blocking *all* signals on first entry to a handler. A bit overkill, but would that do the trick ?
(In reply to comment #4) > Hmm. Ok. Is there any way to increment a variable in a signal handler on the > ARM architecture without using special compiler instructions? Use atomic.h (if available, but no standard API) or stdatomic.h (but C11 only), or remove sig_state->got_signal.
(In reply to comment #4) > So strictly you can't increment a variable in a signal handler. No. You can increment a sig_atomic_t (or equivalent size integer) variable if it is not modified by the main functions and a signal handler kicked by another signal. TEVENT_SIG_INCREMENT(sig_state->signal_count[signum]); The above code is safe because sig_state->signal_count[signum] is modified by the signum signal only. TEVENT_SIG_INCREMENT(sig_state->got_signal); This is not safe.
Interestingly enough, __sync_fetch_and_add() and friends seem to be the nearest thing to a standard in all the compilers we care about. They're implemented in gcc, llvm, IBM xlC on AIX, and Intel icc (10.1 and above). Oracle solaris is different (of course :-), but has: #include <atomic.h> void atomic_add_32(volatile uint32_t *target, int32_t delta); I'm planning to add waf configure tests for this and propose a patch for master for this. Cheers, Jeremy.
Created attachment 10003 [details] Proposed patch for master How about this as a patch for master ? Includes the required waf configure checks and should work on Solaris also. Can you let me know if this works for you and I'll submit it to samba-technical for inclusion ? Thanks, Jeremy.
Ping ! This has gone into master - it'd be good to know if this works for you so we can get it into 4.1.x. Jeremy.
(In reply to comment #10) > Ping ! This has gone into master - it'd be good to know if this works for > you so we can get it into 4.1.x. > > Jeremy. I tried master source (Latest commit: 7c2c6748e323fb0e54fbc2d1b773608904458e94). It looks solved. I compiled and ran this environment: # uname -mrsv Linux 3.3.4 #1 Tue Oct 15 17:00:00 JST 2013 armv7l # gcc --version gcc (Debian 4.6.3-14) 4.6.3 # /usr/local/samba/sbin/smbd -b | grep -i __sync_fetch_and_add HAVE___SYNC_FETCH_AND_ADD # grep "server role" /usr/local/samba/etc/smb.conf server role = active directory domain controller # ps -e -o pid,ppid,cmd | egrep "(samba|smb)" 17215 1 /usr/local/samba/sbin/samba 17229 17215 /usr/local/samba/sbin/samba 17230 17215 /usr/local/samba/sbin/samba 17231 17229 /usr/local/samba/sbin/smbd -D --option=server role check:inhibit=yes --foreground --- snip --- put a load on smbd process like this: # PID=17231 # while true; do kill -USR2 $PID; kill -CHLD $PID; done check with gdb: (gdb) p sig_state->got_signal $1 = {count = 427427, seen = 427427} count and seen are the same value. Thanks,
Created attachment 10162 [details] git-am fix for 4.1.x. Back-ported for 4.1.next from master.
Created attachment 10163 [details] git-am fix for 4.0.next Back-ported to 4.0.next from master patch.
Karo, please pull the patches into the release branches! Thanks, Volker
(In reply to comment #14) > Karo, please pull the patches into the release branches! > > Thanks, > > Volker Pushed to autobuild-v4-[0|1]-test.
(In reply to comment #15) > (In reply to comment #14) > > Karo, please pull the patches into the release branches! > > > > Thanks, > > > > Volker > > Pushed to autobuild-v4-[0|1]-test. Pushed to both branches. Closing out bug report. Thanks!