Winbindd locked up again as mentioned in the second entry of bug 13439 Too me it seems that in both cases there are issues with tdb (either there is some bug in libtdb or the server has corrupted tdb files causing issues?) Unfortunately I had not rebuild libtdb with debug info before it locked up ldb was downgraded to 1.1.29-r1 (Gentoo) for samba 4.6.15 tdb is version 1.3.15 tevent is version 0.9.36 glibc is version 2.25-r11 (Gentoo) Interesting is that there are about 5 winbindd processes with 4 being in epoll_wait() and one being stuck: #0 0x00007f8eb6cf52a2 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0 #1 0x00007f8eaf962b3e in ?? () from /usr/lib64/libtdb.so.1 #2 0x00007f8eaf962e3b in ?? () from /usr/lib64/libtdb.so.1 #3 0x00007f8eaf95b261 in ?? () from /usr/lib64/libtdb.so.1 #4 0x00007f8eaf95b889 in ?? () from /usr/lib64/libtdb.so.1 #5 0x00007f8eaf95bb92 in ?? () from /usr/lib64/libtdb.so.1 #6 0x00007f8eaf95bc2b in ?? () from /usr/lib64/libtdb.so.1 #7 0x00007f8eaf95be3e in tdb_chainlock () from /usr/lib64/libtdb.so.1 #8 0x00007f8eb20e9921 in gencache_parse (keystr=keystr@entry=0x7ffcd393a2d0 "IDMAP/GID2SID/6786", parser=parser@entry=0x7f8eb20ecf <idmap_cache_xid2sid_parser>, private_data=private_data@entry=0x7ffcd393a2b0) at ../source3/lib/gencache.c:497 #9 0x00007f8eb20ed54a in idmap_cache_find_gid2sid (gid=<optimized out>, sid=sid@entry=0x7ffcd393a420 expired=expired@entry=0x7ffcd393a418) at ../source3/lib/idmap_cache.c:270 #10 0x000055b8ca331ef4 in wb_xids2sids_send (mem_ctx=<optimized out>, ev=0x55b8cb3b9470, cli=<optimized out>, request=0x55b8cb5c99e0) num_xids=num_xids@entry=1 at ../source/windbindd/wb_xids2sids.c:480 #11 0x000055b8ca33a7de in winbindd_getgrgid_send (mem_ctx=<optimized out>, ev=0x55b8cb3b9470, cli=<optimized out>, request=0x55b8cb5c99e0) at ../source3/winbindd/winbindd_getgrgid.c:57 #12 0x000055b8ca2fd0f6 in process_request (state=0x55b8cb5b1ff0) at ../source3/winbindd/winbindd.c:698 #13 winbind_client_request_read (req=<optimized out>) at ../source3/winbindd/winbindd.c:953 #14 0x000055b8ca34ae38 in wb_req_read_done (subreq=<optimized out>) at ../nsswitch/wb_reqtrans.c:126 #15 0x00007f8eaeecb673 in ?? () from /usr/lib64/libtevent.so.0 #16 0x00007f8eaeec9a47 in ?? () from /usr/lib64/libtevent.so.0 #17 0x00007f8eaeec5e3d in _tevent_loop_once () from /usr/lib64/libtevent.so.0 #18 0x000055b8ca2f87e8 in main (argc=<optimized out>, argv=<optimized out>) at ../source3/winbindd/winbindd.c:1791
Can you try "dbwrap_tdb_mutexes:*=no" in the [global] section. Maybe your kernel/glibc has a broken implementation of robust mutexs. See https://bugzilla.redhat.com/show_bug.cgi?id=1401665
(In reply to Stefan Metzmacher from comment #1) Dear Mr. Metzmacher, thank you for this very useful suggestion. Using the Test Code v2 code from the RH bug id you posted and running the normal priority and the realtime priority program in parallel, the realtime priority process seems to lock up. So I have added the suggested option to smb.conf and hope that this resolves the issues (installing a fixed/patched kernel/glibc would be the better way and will happen at some point in time (when patches/unmasked fixed versions are available)). Btw. Is the dbwrap_tdb_mutexes option undocumented (and thus might vanish sometime without further notice)? At least "man smb.conf" using Samba 4.6.15 doesn't tell anything about it.
(In reply to Stefan Metzmacher from comment #1) Maybe I didn't understand how to run the Test Code v2. I tried it on Ubuntu 16.04.4 LTS (HWE kernel 4.13 with the latest updates on 18.05.2018) on real hardware (AMD Athlon(tm) 64 X2 Dual Core Processor 3800+) - as far as I remember it uses glibc 2.13 --> failed (no process owned the mutex). I run it today (22.05.2018) on Ubuntu 18.04 LTS (kernel 4.15.0-22-generic) with the latest updates (uses glibc 2.27) on a Windows 10 1803 Hyper-V UEFI-VM and one of the two test_mutex_raw processes stopped. So I don't know whether a) I have done something wrong (should the two instances of test_mutex_raw not be run in parallel (one with normal priority and one with realtime priority))? b) the test code has some issues? c) I'm just unlucky and only found broken combinations (please don't tell me that another OS (BSD) works)? d) there aren't many working robust mutexes implementations using glibc and Linux available and thus it shouldn't be used? Btw. it seems that the used glibc version might have been 2.25-r10 (which is missing the patch for BZ 21778). But unfortunately the Test Code v2 also hangs with 2.25-r11 (if a) should not be the case).