Bug 7317 - problems with SIGCHLD handling in winbindd
Summary: problems with SIGCHLD handling in winbindd
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.5
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.5.1
Hardware: Other Linux
: P3 major
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-01 09:38 UTC by Stefan Metzmacher
Modified: 2010-04-13 13:26 UTC (History)
6 users (show)

See Also:


Attachments
Patch for v3-5 (3.46 KB, patch)
2010-04-01 10:41 UTC, Stefan Metzmacher
jra: review+
Details
Patches for v3-5 (which also remove 2 now unused variables) (4.28 KB, patch)
2010-04-06 09:08 UTC, Stefan Metzmacher
no flags Details
Patch for v3-4 (4.28 KB, patch)
2010-04-06 09:11 UTC, Stefan Metzmacher
no flags Details
new Patch for v3-5 (including a fix for smbcontrol winbindd validate-cache) (5.39 KB, patch)
2010-04-08 06:02 UTC, Stefan Metzmacher
jra: review+
Details
New patch for v3-4 (including the fix for smbcontrol winbindd validate-cache) (5.53 KB, patch)
2010-04-08 06:31 UTC, Stefan Metzmacher
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Metzmacher 2010-04-01 09:38:13 UTC
The main problem is that we call CatchChild() within the
parent winbindd, which overwrites the signal handler
that was registered by winbindd_setup_sig_chld_handler().

That means winbindd_sig_chld_handler() and winbind_child_died()
are never triggered when a winbindd domain child dies.
As a result will get "broken pipe" for all requests to that domain.

From looking at the code the problem seems to be there at least since
3.2.x.
Comment 1 Stefan Metzmacher 2010-04-01 09:38:50 UTC
The problem was only partly fixed by the following commit in v3-2-test

commit 6da910cc1c6baccbb143f0b2d347e31e9f84c35b
Author: Jeremy Allison <jra@samba.org>
Date:   Wed May 14 14:11:26 2008 -0700

    Fix bug #5464. Pointed out by Herb @ Connectathon. In fork_domain_child() we call :
    
    CatchChild();
    
    *before* we fork the domain child. This call establishes a signal handler that
    eats SIGCLD signals and doesn't call sys_select_signal() as the main daemon
    SIGCLD handler should do. This causes the parent to ignore dead children and
    time out, instead of calling winbind_child_died() on receipt of the signal. The
    correct fix is to move the CatchChild call into the child code after the fork.
    
    Jeremy.
Comment 2 Stefan Metzmacher 2010-04-01 09:42:36 UTC
This bug was introduced by this commits:

commit d0add5f946cf63ea43067e8e935876b5346d11de
Author: Jeremy Allison <jra@samba.org>
Date:   Tue Dec 12 22:41:42 2006 +0000

    r20140: Make online/offline detection completely asynchronous.
    Now I've done this I might be able to reduce the probe
    timeout and reduce the backoff algorithm, going back
    to checking every cache time seconds (5 mins by default),
    as the parent or forked domain child will never block.
    Jeremy.

and 

commit f379a5c47d5004a5a66b6c12ec119c739b9e146d
Author: Michael Adam <obnox@samba.org>
Date:   Sun Sep 2 00:32:57 2007 +0000

    r24879: Activate the winbindd cache-validation message handler.
    Now the winbindd cache can be checked at runtime by
    calling "smbcontrol winbindd validate-cache".
    
    For the execution of the validation code, I fork a child
    and in the child restore the default SIGCHLD handler in
    order for the fork/waitpid mechanism of tdb_validate to work.
    
    Michael
Comment 3 Stefan Metzmacher 2010-04-01 09:43:20 UTC
As this bug is so old I think it's ok to fix it for 3.5.3
Comment 4 Stefan Metzmacher 2010-04-01 09:52:37 UTC
I have a patch ready and will upload it when make test passed...
Comment 5 Stefan Metzmacher 2010-04-01 10:41:02 UTC
Created attachment 5586 [details]
Patch for v3-5
Comment 6 Jeremy Allison 2010-04-01 12:12:02 UTC
Ok, reviewed this and it's a must-have fix for 3.5.2 and the next 3.4.x I think. Being unable to restart a dead winbindd child for a domain is a major problem (IMHO).

Thanks for the detective work - I see how it happened (all with the best of intentions of course).

Now we have tevent_add_signal() we must work towards retiring CatchChild() in all our code I think.

I'm going to re-assign to Karolin for consideration in 3.5.2, and create a back-port for 3.4.x.

Jeremy.
Comment 7 Jeremy Allison 2010-04-01 13:35:38 UTC
Patch applies relatively cleanly to 3.4.x. I'd suggest adding it there too. I'm guessing the reason that it hasn't bitten us obviously and badly up until now is that winbindd rarely crashes, and when it does users generally restart the whole winbindd system, not individual pieces.

However, this fix is a significant improvement in recoverability in unexpected situations, and so my personal suggestion is that it goes into 3.5.2 and 3.4.x. Happy to be convinced otherwise of course :-).

Jeremy.
Comment 8 Stefan Metzmacher 2010-04-06 02:17:16 UTC
I'm fine with both including it in 3.5.2 or waiting for 3.5.3
Comment 9 Karolin Seeger 2010-04-06 02:21:46 UTC
Picking this one for 3.5.2 would mean to delay the release for 5 (!) weeks in this case (cannot pick the patch today and release tomorrow due to our freezing policy).
I would strongly argue against delaying, because 3 smbd crash fixes have been submitted since 3.5.1. So I would vote for 3.5.3.
Comment 10 Stefan Metzmacher 2010-04-06 02:44:01 UTC
I think we should get 3.5.2 out now (without this), and 3.5.3 in about 5 weeks or so.
Comment 11 Stefan Metzmacher 2010-04-06 09:08:09 UTC
Created attachment 5600 [details]
Patches for v3-5 (which also remove 2 now unused variables)

Jeremy, please recheck the v3-5 patches
Comment 12 Stefan Metzmacher 2010-04-06 09:11:05 UTC
Created attachment 5601 [details]
Patch for v3-4

Basicly the same patches for v3-4
Comment 13 Stefan Metzmacher 2010-04-08 06:02:33 UTC
Created attachment 5610 [details]
new Patch for v3-5 (including a fix for smbcontrol winbindd validate-cache)
Comment 14 Stefan Metzmacher 2010-04-08 06:31:55 UTC
Created attachment 5611 [details]
New patch for v3-4 (including the fix for smbcontrol winbindd validate-cache)
Comment 15 Jeremy Allison 2010-04-12 13:25:50 UTC
Comment on attachment 5610 [details]
new Patch for v3-5 (including a fix for smbcontrol winbindd validate-cache)

Looks good to me!
Jeremy.
Comment 16 Jeremy Allison 2010-04-12 13:26:19 UTC
Comment on attachment 5611 [details]
New patch for v3-4 (including the fix for smbcontrol winbindd validate-cache)

Same fix for 3.4.x. - looks good to me.

Jeremy.
Comment 17 Jeremy Allison 2010-04-12 13:26:44 UTC
Re-assigning to Karolin for inclusion in next releases.

Jeremy.
Comment 18 Karolin Seeger 2010-04-13 13:26:44 UTC
Pushed to v3-5-test and v3-4-test.
Closing out bug report.

Thanks!