We observed lots of winbindd zombie processes after Samba 4.2 rans for a while. After spending a few days chasing this issue, I found out its root cause: function tdb_runtime_check_for_robust_mutexes() in lib/tdb/common/mutex.c uses signal() function to set SIGCHLD signal handler. On Solaris, signal(SIGCHLD, ...) sets SA_RESETHAND flag internally, so this will cause SIGCHLD signal handler to be reset to default. Once SIGCHLD handler is reset to default, any child process forked by the main winbindd daemon will be left as zombie when it exits. The following C code can be used to show that signal() indeed sets SA_RESETHAND flag. #include <stdio.h> #include <string.h> #include <signal.h> static void (*old_handler)(int) = SIG_ERR; static void sig_chld_handler(int sig) { printf("Caught signal %d\n", sig); } void (*CatchSignal(int signum,void (*handler)(int )))(int) { struct sigaction act; struct sigaction oldact; memset((char*)&act, 0, sizeof(act)); act.sa_handler = handler; #ifdef SA_RESTART if(signum != SIGALRM) act.sa_flags = SA_RESTART; #endif sigemptyset(&act.sa_mask); sigaddset(&act.sa_mask,signum); sigaction(signum,&act,&oldact); return oldact.sa_handler; } void print_sa_flags(int sa_flags) { if(sa_flags & SA_NOCLDSTOP) printf("NOCLDSTOP "); if(sa_flags & SA_ONSTACK) printf("ONSTACK "); if(sa_flags & SA_RESETHAND) printf("RESETHAND "); if(sa_flags & SA_RESTART) printf("RESTART "); if(sa_flags & SA_SIGINFO) printf("SIGINFO "); if(sa_flags & SA_NODEFER) printf("NODEFER "); if(sa_flags & SA_NOCLDWAIT) printf("NOCLDWAIT "); printf("\n"); } void print_signal(int signum) { struct sigaction oldact; memset((char*)&oldact, 0, sizeof(oldact)); sigaction(signum, NULL, &oldact); printf("Signal # %d handler %p flags 0x%X\n", signum, oldact.sa_handler, oldact.sa_flags); print_sa_flags(oldact.sa_flags); } int main(int argc, char **argv) { print_signal(SIGCHLD); printf("Set SIGCLD handler using sigaction().\n"); old_handler = CatchSignal(SIGCHLD, sig_chld_handler); printf("Old handler = %p\n", old_handler); print_signal(SIGCHLD); printf("Now set SIGCLD handler using signal() function.\n"); old_handler = signal(SIGCHLD, sig_chld_handler); printf("Old handler = %p\n", old_handler); print_signal(SIGCHLD); return 0; }
stack traces: 3 21297 setsigact:entry libc.so.1`__sigaction+0xa libc.so.1`signal+0x71 libtdb.so.1.3.4`tdb_runtime_check_for_robust_mutexes+0x197 libtdb-wrap-samba4.so`tdb_wrap_open+0x120 libsmbconf.so.0`gencache_init+0x317 libsmbconf.so.0`gencache_parse+0x64 libsmbconf.so.0`idmap_cache_find_uid2sid+0x91 winbindd`wb_uid2sid_send+0x71 winbindd`winbindd_uid_to_sid_send+0xd9 winbindd`process_request+0x1d8 winbindd`winbind_client_request_read+0x192 libtevent.so.0.9.22`_tevent_req_notify_callback+0x6a libtevent.so.0.9.22`tevent_req_finish+0x78 libtevent.so.0.9.22`_tevent_req_done+0x25 winbindd`wb_req_read_done+0x122 libtevent.so.0.9.22`_tevent_req_notify_callback+0x6a libtevent.so.0.9.22`tevent_req_finish+0x78 libtevent.so.0.9.22`_tevent_req_done+0x25 libsmb-transport-samba4.so`read_packet_handler+0x21d libtevent.so.0.9.22`epoll_event_loop+0x3a5 3 21297 setsigact:entry libc.so.1`__sigaction+0xa libc.so.1`signal+0x71 libtdb.so.1.3.4`tdb_runtime_check_for_robust_mutexes+0x39d libtdb-wrap-samba4.so`tdb_wrap_open+0x120 libsmbconf.so.0`gencache_init+0x317 libsmbconf.so.0`gencache_parse+0x64 libsmbconf.so.0`idmap_cache_find_uid2sid+0x91 winbindd`wb_uid2sid_send+0x71 winbindd`winbindd_uid_to_sid_send+0xd9 winbindd`process_request+0x1d8 winbindd`winbind_client_request_read+0x192 libtevent.so.0.9.22`_tevent_req_notify_callback+0x6a libtevent.so.0.9.22`tevent_req_finish+0x78 libtevent.so.0.9.22`_tevent_req_done+0x25 winbindd`wb_req_read_done+0x122 libtevent.so.0.9.22`_tevent_req_notify_callback+0x6a libtevent.so.0.9.22`tevent_req_finish+0x78 libtevent.so.0.9.22`_tevent_req_done+0x25 libsmb-transport-samba4.so`read_packet_handler+0x21d libtevent.so.0.9.22`epoll_event_loop+0x3a5
Created attachment 10899 [details] git-am proposed patch for master. Can you test this fix and see if it solves the problem ? Thanks, Jeremy.
(In reply to Jeremy Allison from comment #2) Yes, the patch works. Thanks!
Created attachment 10910 [details] *Working* git-am patch for master. Sorry for the problem. Here is a working patch.
Comment on attachment 10910 [details] *Working* git-am patch for master. The problem was if a handler hadn't been installed already, then oldact.sa_handler == NULL (#define SIG_DFL ((__sighandler_t)0)) which was returned and confused with the #else clause of #ifdef HAVE_SIGACTION (which returned NULL as guaranteed failure). So we thought we should have working mutexes because tdb_mutex_locking_supported() would return true, but tdb_runtime_check_for_robust_mutexes() would always return false :-(. New code returns a bool, and is given a pointer to fill with the returned handler.
Created attachment 10914 [details] git-am fix for 4.2.next Cherry-pick of patch that went into master.
Comment on attachment 10914 [details] git-am fix for 4.2.next LGTM
Karolin, please add the patch to the next 4.2 release. Thanks!
(In reply to Andreas Schneider from comment #8) Pushed to autobuild-v4-2-test.
(In reply to Karolin Seeger from comment #9) Pushed to v4-2-test. Closing out bug report. Thanks!