The main problem is that we call CatchChild() within the parent winbindd, which overwrites the signal handler that was registered by winbindd_setup_sig_chld_handler(). That means winbindd_sig_chld_handler() and winbind_child_died() are never triggered when a winbindd domain child dies. As a result will get "broken pipe" for all requests to that domain. From looking at the code the problem seems to be there at least since 3.2.x.
The problem was only partly fixed by the following commit in v3-2-test commit 6da910cc1c6baccbb143f0b2d347e31e9f84c35b Author: Jeremy Allison <jra@samba.org> Date: Wed May 14 14:11:26 2008 -0700 Fix bug #5464. Pointed out by Herb @ Connectathon. In fork_domain_child() we call : CatchChild(); *before* we fork the domain child. This call establishes a signal handler that eats SIGCLD signals and doesn't call sys_select_signal() as the main daemon SIGCLD handler should do. This causes the parent to ignore dead children and time out, instead of calling winbind_child_died() on receipt of the signal. The correct fix is to move the CatchChild call into the child code after the fork. Jeremy.
This bug was introduced by this commits: commit d0add5f946cf63ea43067e8e935876b5346d11de Author: Jeremy Allison <jra@samba.org> Date: Tue Dec 12 22:41:42 2006 +0000 r20140: Make online/offline detection completely asynchronous. Now I've done this I might be able to reduce the probe timeout and reduce the backoff algorithm, going back to checking every cache time seconds (5 mins by default), as the parent or forked domain child will never block. Jeremy. and commit f379a5c47d5004a5a66b6c12ec119c739b9e146d Author: Michael Adam <obnox@samba.org> Date: Sun Sep 2 00:32:57 2007 +0000 r24879: Activate the winbindd cache-validation message handler. Now the winbindd cache can be checked at runtime by calling "smbcontrol winbindd validate-cache". For the execution of the validation code, I fork a child and in the child restore the default SIGCHLD handler in order for the fork/waitpid mechanism of tdb_validate to work. Michael
As this bug is so old I think it's ok to fix it for 3.5.3
I have a patch ready and will upload it when make test passed...
Created attachment 5586 [details] Patch for v3-5
Ok, reviewed this and it's a must-have fix for 3.5.2 and the next 3.4.x I think. Being unable to restart a dead winbindd child for a domain is a major problem (IMHO). Thanks for the detective work - I see how it happened (all with the best of intentions of course). Now we have tevent_add_signal() we must work towards retiring CatchChild() in all our code I think. I'm going to re-assign to Karolin for consideration in 3.5.2, and create a back-port for 3.4.x. Jeremy.
Patch applies relatively cleanly to 3.4.x. I'd suggest adding it there too. I'm guessing the reason that it hasn't bitten us obviously and badly up until now is that winbindd rarely crashes, and when it does users generally restart the whole winbindd system, not individual pieces. However, this fix is a significant improvement in recoverability in unexpected situations, and so my personal suggestion is that it goes into 3.5.2 and 3.4.x. Happy to be convinced otherwise of course :-). Jeremy.
I'm fine with both including it in 3.5.2 or waiting for 3.5.3
Picking this one for 3.5.2 would mean to delay the release for 5 (!) weeks in this case (cannot pick the patch today and release tomorrow due to our freezing policy). I would strongly argue against delaying, because 3 smbd crash fixes have been submitted since 3.5.1. So I would vote for 3.5.3.
I think we should get 3.5.2 out now (without this), and 3.5.3 in about 5 weeks or so.
Created attachment 5600 [details] Patches for v3-5 (which also remove 2 now unused variables) Jeremy, please recheck the v3-5 patches
Created attachment 5601 [details] Patch for v3-4 Basicly the same patches for v3-4
Created attachment 5610 [details] new Patch for v3-5 (including a fix for smbcontrol winbindd validate-cache)
Created attachment 5611 [details] New patch for v3-4 (including the fix for smbcontrol winbindd validate-cache)
Comment on attachment 5610 [details] new Patch for v3-5 (including a fix for smbcontrol winbindd validate-cache) Looks good to me! Jeremy.
Comment on attachment 5611 [details] New patch for v3-4 (including the fix for smbcontrol winbindd validate-cache) Same fix for 3.4.x. - looks good to me. Jeremy.
Re-assigning to Karolin for inclusion in next releases. Jeremy.
Pushed to v3-5-test and v3-4-test. Closing out bug report. Thanks!