Bug 10640 - smbd is not responding - tevent_common_signal_handler() increments non-atomic variables
Summary: smbd is not responding - tevent_common_signal_handler() increments non-atomic...
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.6
Classification: Unclassified
Component: File services (show other bugs)
Version: 3.6.23
Hardware: Other Linux
: P5 normal
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-29 08:55 UTC by KAMEI Yutaka (mail address dead)
Modified: 2014-09-01 19:03 UTC (History)
0 users

See Also:


Attachments
FIX TEVENT_SIG_INCREMENT atomic (453 bytes, patch)
2014-05-29 08:57 UTC, KAMEI Yutaka (mail address dead)
no flags Details
Proposed patch for master (3.70 KB, patch)
2014-06-03 17:48 UTC, Jeremy Allison
no flags Details
git-am fix for 4.1.x. (3.82 KB, patch)
2014-07-30 17:06 UTC, Jeremy Allison
vl: review+
Details
git-am fix for 4.0.next (3.82 KB, patch)
2014-07-30 17:07 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description KAMEI Yutaka (mail address dead) 2014-05-29 08:55:18 UTC
Hi,

When receiving a lot of multiple signals, smbd process is not responding.
This occurs in the following environment:

    # uname -mrsv
    Linux 3.3.4 #1 Fri Mar 28 19:49:13 JST 2014 armv7l

I attach to smbd process with gdb and got output like this:

    # gdb -p 21127
    --- snip ---
    (gdb) bt
    #0  0xb6129750 in tevent_common_check_signal (ev=0xb6f44290)
        at ../lib/tevent/tevent_signal.c:356
    #1  0xb685b00c in run_events_poll (ev=0xb6f44290, pollrtn=0, pfds=0x0,
        num_pfds=0) at lib/events.c:193
    #2  0xb685ba9c in s3_event_loop_once (ev=0xb6f44290,
        location=0xb6f1ad44 "smbd/server.c:866") at lib/events.c:331
    #3  0xb6126998 in _tevent_loop_once (ev=0xb6f44290,
        location=0xb6f1ad44 "smbd/server.c:866") at ../lib/tevent/tevent.c:494
    #4  0xb6cfcb9c in smbd_parent_loop (parent=0xb6f6e938) at smbd/server.c:866
    #5  0xb6cfe228 in main (argc=2, argv=0xbeceab34) at smbd/server.c:1357
    (gdb) p sig_state->got_signal
    $1 = {count = 83510, seen = 83511}
    (gdb) p sig_state->signal_count
    $2 = {{count = 0, seen = 0}, {count = 1, seen = 1}, {count = 0, seen = 0},
      {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {
        count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {
        count = 0, seen = 0}, {count = 41132, seen = 41132}, {count = 0,
        seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {count = 0,
        seen = 0}, {count = 0, seen = 0}, {count = 0, seen = 0}, {
        count = 42378, seen = 42378}, {count = 0, seen = 0} <repeats 47 times>}


While count and seen is different in sig_state->got_signal (total number of
all signals), both variables has same value in each signal.
Therefore, smbd does not poll() in s3_event_loop_once().

This is because signal handler (tevent_common_signal_handler()) increments
non-atomic variables like this:

    #define TEVENT_SIG_INCREMENT(s) (s).count++
    --- snip ---
    static void tevent_common_signal_handler(int signum)
    {
    --- snip ---
    TEVENT_SIG_INCREMENT(sig_state->signal_count[signum]);
    TEVENT_SIG_INCREMENT(sig_state->got_signal);


You can reproduce on ARM computer in the following commands:

    # PID=<pid of smbd parent>
    # while true; do kill -CHLD $PID; kill -USR1 $PID; done

The attached patch fixes the issue by calling __sync_fetch_and_add().
You can apply this fix only when using GCC.
Comment 1 KAMEI Yutaka (mail address dead) 2014-05-29 08:57:20 UTC
Created attachment 9994 [details]
FIX TEVENT_SIG_INCREMENT atomic
Comment 2 Volker Lendecke 2014-05-29 09:36:04 UTC
Can this also be solved by the following (completely untested) patch?

diff --git a/lib/tevent/tevent_signal.c b/lib/tevent/tevent_signal.c
index cc7fb0a..0f86c84 100644
--- a/lib/tevent/tevent_signal.c
+++ b/lib/tevent/tevent_signal.c
@@ -40,8 +40,8 @@
 #define TEVENT_SA_INFO_QUEUE_COUNT 64
 
 struct tevent_sigcounter {
-       uint32_t count;
-       uint32_t seen;
+       sig_atomic_t count;
+       sig_atomic_t seen;
 };
 
 #define TEVENT_SIG_INCREMENT(s) (s).count++

If this solved it, it might do without a gcc dependency.

Thanks,

Volker
Comment 3 KAMEI Yutaka (mail address dead) 2014-05-29 11:59:29 UTC
(In reply to comment #2)
> Can this also be solved by the following (completely untested) patch?
> 
> diff --git a/lib/tevent/tevent_signal.c b/lib/tevent/tevent_signal.c
> index cc7fb0a..0f86c84 100644
> --- a/lib/tevent/tevent_signal.c
> +++ b/lib/tevent/tevent_signal.c
> @@ -40,8 +40,8 @@
>  #define TEVENT_SA_INFO_QUEUE_COUNT 64
> 
>  struct tevent_sigcounter {
> -       uint32_t count;
> -       uint32_t seen;
> +       sig_atomic_t count;
> +       sig_atomic_t seen;
>  };
> 
>  #define TEVENT_SIG_INCREMENT(s) (s).count++
> 
> If this solved it, it might do without a gcc dependency.

No, this does not solve.
I tried this sig_atomic_t fix, seen is larger than count
in sig_state->got_signal.

Increment(++) is not atomic instruction (Load -> Add -> Store)
in disassembled code (ARM architecture).

    90      in ../lib/tevent/tevent_signal.c
       0xb6645938 <+24>:    ldr     r4, [pc, #160]  ; 0xb66459e0 <tevent_common_signal_handler+192>
       0xb664593c <+28>:    ldr     r3, [pc, #160]  ; 0xb66459e4 <tevent_common_signal_handler+196>
       0xb6645940 <+32>:    add     r4, pc, r4
       0xb6645944 <+36>:    ldr     r3, [r4, r3]
       0xb6645948 <+40>:    add     r2, r5, #65     ; 0x41
       0xb664594c <+44>:    ldr     r1, [r3, r2, lsl #3]
       0xb6645954 <+52>:    add     r1, r1, #1
       0xb6645958 <+56>:    str     r1, [r3, r2, lsl #3]
    91      in ../lib/tevent/tevent_signal.c
       0xb664595c <+60>:    ldr     r2, [r3, #1040] ; 0x410
       0xb6645964 <+68>:    add     r2, r2, #1
       0xb664596c <+76>:    str     r2, [r3, #1040] ; 0x410

Multiple signals are asigned to tevent_common_signal_handler().
When smbd receives a signal in the middle of handling one signal, interruption occurs.
I think following flow causes inconsistency between sig_state->got_signal member.

    Initial state:
        sig_state->signal_count[10] = {count=7, seen=7}
        sig_state->signal_count[17] = {count=3, seen=3}
        sig_state->got_signal = {count=10, seen=10}

    Flow:
        1-1. Receives SIGCHLD
        1-2. Load sig_state->signal_count[17].count (=3)
        1-3. Add sig_state->signal_count[17].count (=4)
        1-4. Store sig_state->signal_count[17].count
        1-5. Load sig_state->got_signal.count (=10)
        2-1. Receives SIGUSR1                       <-- Interruption
        2-2. Load sig_state->signal_count[10].count (=7)
        2-3. Add sig_state->signal_count[10].count =(8)
        2-4. Store sig_state->signal_count[10].count
        2-5. Load sig_state->got_signal.count (=10)
        2-6. Add sig_state->got_signal.count (=11)
        2-7. Store sig_state->got_signal.count
        1-6. Add sig_state->got_signal.count (=11, add old value) <-- Resume
        1-7. Store sig_state->got_signal.count
        1-8. Check SIGUSR1 (in tevent_common_check_signal())
            --> Add sig_state->signal_count[10].seen (=8)
            --> Add sig_state->got_signal.seen (=11)
        1-9. Check SIGCHLD (in tevent_common_check_signal())
            --> Add sig_state->signal_count[17].seen (=4)
            --> Add sig_state->got_signal.seen (=12) <-- This is valid

    End state:
        sig_state->signal_count[10] = {count=8, seen=8}
        sig_state->signal_count[17] = {count=4, seen=4}
        sig_state->got_signal = {count=11, seen=12}
Comment 4 Volker Lendecke 2014-05-29 12:34:19 UTC
Hmm. Ok. Is there any way to increment a variable in a signal handler on the ARM architecture without using special compiler instructions?

Hmm. SUSv4 has:

    If the signal occurs other than as the result of calling abort(),
    raise(), ^[CX] [Option Start] kill(), pthread_kill(), or sigqueue
    (), [Option End]  the behavior is undefined if the signal handler
    refers to any object with static storage duration other than by
    assigning a value to an object declared as volatile sig_atomic_t,
    or if the signal handler calls any function in the standard
    library other than one of the functions listed in Signal Concepts
    . Furthermore, if such a call fails, the value of errno is
    unspecified.

So strictly you can't increment a variable in a signal handler.

Is this Linux? If so, we could avoid most of that nastiness by using signalfd(), but that's non-portable as well unfortunately.

We might also depend on the signal pipe we are using to not overflow and write the necessary information there. write() is definitely a signal-safe function to call according to SUSv4.

Thanks a lot for your analysis! Your patch certainly looks sane for your environment and points out an important bug in our signal handling. For upstream I'd like to give it some more thought to see if we can avoid a gcc'ism. If we have to go down to compiler extensions, we do need some configure checks for this.

Volker
Comment 5 Jeremy Allison 2014-05-29 18:50:39 UTC
http://locklessinc.com/articles/signalsafe_locks/

looks interesting, but that's assembly also :-(.

How about blocking *all* signals on first entry to a handler. A bit overkill, but would that do the trick ?
Comment 6 SATOH Fumiyasu 2014-05-30 05:06:50 UTC
(In reply to comment #4)
> Hmm. Ok. Is there any way to increment a variable in a signal handler on the
> ARM architecture without using special compiler instructions?

Use atomic.h (if available, but no standard API) or stdatomic.h (but C11 only),
or remove sig_state->got_signal.
Comment 7 SATOH Fumiyasu 2014-05-30 13:58:25 UTC
(In reply to comment #4)
> So strictly you can't increment a variable in a signal handler.

No. You can increment a sig_atomic_t (or equivalent size integer)
variable if it is not modified by the main functions and a signal
handler kicked by another signal.

    TEVENT_SIG_INCREMENT(sig_state->signal_count[signum]);

The above code is safe because sig_state->signal_count[signum] is
modified by the signum signal only.

    TEVENT_SIG_INCREMENT(sig_state->got_signal);

This is not safe.
Comment 8 Jeremy Allison 2014-05-30 18:42:32 UTC
Interestingly enough, __sync_fetch_and_add() and friends seem to be the nearest thing to a standard in all the compilers we care about.

They're implemented in gcc, llvm, IBM xlC on AIX, and Intel icc (10.1 and above).

Oracle solaris is different (of course :-), but has:

#include <atomic.h>
void atomic_add_32(volatile uint32_t *target, int32_t delta);

I'm planning to add waf configure tests for this and propose a patch for master for this.

Cheers,

Jeremy.
Comment 9 Jeremy Allison 2014-06-03 17:48:53 UTC
Created attachment 10003 [details]
Proposed patch for master

How about this as a patch for master ? Includes the required waf configure checks and should work on Solaris also.

Can you let me know if this works for you and I'll submit it to samba-technical for inclusion ?

Thanks,

Jeremy.
Comment 10 Jeremy Allison 2014-07-12 00:01:33 UTC
Ping ! This has gone into master - it'd be good to know if this works for
you so we can get it into 4.1.x.

Jeremy.
Comment 11 KAMEI Yutaka (mail address dead) 2014-07-30 14:42:36 UTC
(In reply to comment #10)
> Ping ! This has gone into master - it'd be good to know if this works for
> you so we can get it into 4.1.x.
> 
> Jeremy.

I tried master source (Latest commit: 7c2c6748e323fb0e54fbc2d1b773608904458e94).
It looks solved.

I compiled and ran this environment:

    # uname -mrsv
    Linux 3.3.4 #1 Tue Oct 15 17:00:00 JST 2013 armv7l

    # gcc --version
    gcc (Debian 4.6.3-14) 4.6.3

    # /usr/local/samba/sbin/smbd -b | grep -i __sync_fetch_and_add
    HAVE___SYNC_FETCH_AND_ADD

    # grep "server role" /usr/local/samba/etc/smb.conf
            server role = active directory domain controller

    # ps -e -o pid,ppid,cmd | egrep "(samba|smb)"
    17215     1 /usr/local/samba/sbin/samba
    17229 17215 /usr/local/samba/sbin/samba
    17230 17215 /usr/local/samba/sbin/samba
    17231 17229 /usr/local/samba/sbin/smbd -D --option=server role check:inhibit=yes --foreground
    --- snip ---

put a load on smbd process like this:

    # PID=17231
    # while true; do kill -USR2 $PID; kill -CHLD $PID; done

check with gdb:

    (gdb) p sig_state->got_signal
    $1 = {count = 427427, seen = 427427}

count and seen are the same value.

Thanks,
Comment 12 Jeremy Allison 2014-07-30 17:06:30 UTC
Created attachment 10162 [details]
git-am fix for 4.1.x.

Back-ported for 4.1.next from master.
Comment 13 Jeremy Allison 2014-07-30 17:07:14 UTC
Created attachment 10163 [details]
git-am fix for 4.0.next

Back-ported to 4.0.next from master patch.
Comment 14 Volker Lendecke 2014-07-31 06:42:21 UTC
Karo, please pull the patches into the release branches!

Thanks,

Volker
Comment 15 Karolin Seeger 2014-08-03 18:33:42 UTC
(In reply to comment #14)
> Karo, please pull the patches into the release branches!
> 
> Thanks,
> 
> Volker

Pushed to autobuild-v4-[0|1]-test.
Comment 16 Karolin Seeger 2014-09-01 19:03:55 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Karo, please pull the patches into the release branches!
> > 
> > Thanks,
> > 
> > Volker
> 
> Pushed to autobuild-v4-[0|1]-test.

Pushed to both branches.
Closing out bug report.

Thanks!