Bug 12593 - hang in tdb_runtime_check_for_robust_mutexes
hang in tdb_runtime_check_for_robust_mutexes
Status: REOPENED
Product: TDB
Classification: Unclassified
Component: libtdb
unspecified
All All
: P5 normal
: ---
Assigned To: Ralph Böhme
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-02-18 05:43 UTC by Hussam Al-Tayeb
Modified: 2017-04-27 12:18 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hussam Al-Tayeb 2017-02-18 05:43:34 UTC
I am seeing a hang when /usr/lib/gvfs/gvfsd-smb-browse is started.
I reported in Gnome's bug tracker and the Gnome gvfs developer suggested it is a TDB bug so I am reporting it here.
The following is the backtrace.

Thread 5 (Thread 0x7f2ea5037700 (LWP 1054)):
#0  0x00007f2eb6325506 in sigsuspend () at /usr/lib/libc.so.6
#1  0x00007f2ead463802 in tdb_runtime_check_for_robust_mutexes ()
    at /usr/lib/libtdb.so.1
#2  0x00007f2eade9b05d in tdb_wrap_open ()
    at /usr/lib/samba/libtdb-wrap-samba4.so
#3  0x00007f2eb4e49b21 in  () at /usr/lib/libsmbconf.so.0
#4  0x00007f2eb4e4a305 in gencache_parse () at /usr/lib/libsmbconf.so.0
#5  0x00007f2eb4e4a862 in gencache_get_data_blob () at /usr/lib/libsmbconf.so.0
#6  0x00007f2eb4e4a90b in gencache_get () at /usr/lib/libsmbconf.so.0
#7  0x00007f2eb422f74a in sitename_fetch () at /usr/lib/samba/libgse-samba4.so
#8  0x00007f2eb422d7da in resolve_name () at /usr/lib/samba/libgse-samba4.so
#9  0x00007f2eb762a5e2 in  () at /usr/lib/libsmbclient.so.0
#10 0x0000000000406ebd in do_mount (backend=<optimized out>, job=0x12b72b0 [GVfsJobMount], mount_spec=<optimized out>, mount_source=<optimized out>, is_automount=<optimized out>) at gvfsbackendsmbbrowse.c:913
#11 0x00007f2eb7407f4a in g_vfs_job_run (job=0x12b72b0 [GVfsJobMount])
    at gvfsjob.c:197
#12 0x00007f2eb6920c9e in g_thread_pool_thread_proxy (data=<optimized out>)
    at gthreadpool.c:307
#13 0x00007f2eb69202a5 in g_thread_proxy (data=0x7f2e98004720) at gthread.c:784
#14 0x00007f2eb6697444 in start_thread () at /usr/lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#15 0x00007f2eb63d9cff in clone () at /usr/lib/libc.so.6

Thread 4 (Thread 0x7f2ea5838700 (LWP 1053)):
#0  0x00007f2eb63d066d in poll () at /usr/lib/libc.so.6
#1  0x00007f2eb68f8736 in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x7f2e900010c0, timeout=<optimized out>, context=0x12d19c0)
    at gmain.c:4228
#2  0x00007f2eb68f8736 in g_main_context_iterate (context=context@entry=0x12d19c0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3924
#3  0x00007f2eb68f884c in g_main_context_iteration (context=0x12d19c0, may_block=1) at gmain.c:3990
#4  0x00007f2ea584055d in  () at /usr/lib/gio/modules/libdconfsettings.so
#5  0x00007f2eb69202a5 in g_thread_proxy (data=0x12bc230) at gthread.c:784
#6  0x00007f2eb6697444 in start_thread () at /usr/lib/libpthread.so.0
#7  0x00007f2eb63d9cff in clone () at /usr/lib/libc.so.6

Thread 3 (Thread 0x7f2ea6a47700 (LWP 1051)):
#0  0x00007f2eb63d066d in poll () at /usr/lib/libc.so.6
#1  0x00007f2eb68f8736 in g_main_context_poll (priority=<optimized out>, n_fds=2, fds=0x7f2e980010c0, timeout=<optimized out>, context=0x12ba240)
    at gmain.c:4228
#2  0x00007f2eb68f8736 in g_main_context_iterate (context=0x12ba240, block=block---Type <return> to continue, or q <return> to quit---
@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3924
#3  0x00007f2eb68f8ac2 in g_main_loop_run (loop=0x1291a00) at gmain.c:4125
#4  0x00007f2eb6ee8606 in gdbus_shared_thread_func (user_data=0x12ba210)
    at gdbusprivate.c:247
#5  0x00007f2eb69202a5 in g_thread_proxy (data=0x12bbca0) at gthread.c:784
#6  0x00007f2eb6697444 in start_thread () at /usr/lib/libpthread.so.0
#7  0x00007f2eb63d9cff in clone () at /usr/lib/libc.so.6

Thread 2 (Thread 0x7f2ea7248700 (LWP 1050)):
#0  0x00007f2eb63d066d in poll () at /usr/lib/libc.so.6
#1  0x00007f2eb68f8736 in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x7f2ea00008e0, timeout=<optimized out>, context=0x12b9b70)
    at gmain.c:4228
#2  0x00007f2eb68f8736 in g_main_context_iterate (context=context@entry=0x12b9b70, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3924
#3  0x00007f2eb68f884c in g_main_context_iteration (context=0x12b9b70, may_block=may_block@entry=1) at gmain.c:3990
#4  0x00007f2eb68f8891 in glib_worker_main (data=<optimized out>)
    at gmain.c:5783
#5  0x00007f2eb69202a5 in g_thread_proxy (data=0x12bbc50) at gthread.c:784
#6  0x00007f2eb6697444 in start_thread () at /usr/lib/libpthread.so.0
#7  0x00007f2eb63d9cff in clone () at /usr/lib/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 1 (Thread 0x7f2eb79817c0 (LWP 1049)):
#0  0x00007f2eb63d066d in poll () at /usr/lib/libc.so.6
#1  0x00007f2eb68f8736 in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x12935e0, timeout=<optimized out>, context=0x12aed00) at gmain.c:4228
#2  0x00007f2eb68f8736 in g_main_context_iterate (context=0x12aed00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3924
#3  0x00007f2eb68f8ac2 in g_main_loop_run (loop=0x12932a0) at gmain.c:4125
#4  0x000000000040b7ff in daemon_main (argc=argc@entry=4, argv=argv@entry=0x7ffe54d4ba08, max_job_threads=max_job_threads@entry=1, default_type=default_type@entry=0x40bbab "smb-network", mountable_name=mountable_name@entry=0x40c858 "org.gtk.vfs.mountpoint_smb_browse", first_type_name=first_type_name@entry=0x40bbab "smb-network") at daemon-main.c:398
#5  0x0000000000404ca7 in main (argc=4, argv=0x7ffe54d4ba08)
    at daemon-main-generic.c:45

Downstream bug report:
https://bugzilla.gnome.org/show_bug.cgi?id=778752

Thank you.
Comment 1 Stefan Metzmacher 2017-03-09 10:13:51 UTC

*** This bug has been marked as a duplicate of bug 11808 ***
Comment 2 Hussam Al-Tayeb 2017-03-09 10:27:57 UTC
I am still seeing this in tdb 1.3.12

Bug 11808 says it was fixed in 1.3.9
Comment 3 Stefan Metzmacher 2017-03-09 10:31:00 UTC
(In reply to Hussam Al-Tayeb from comment #2)

Which samba version are you using?
Comment 4 Hussam Al-Tayeb 2017-03-09 10:33:27 UTC
I have Samba 4.5.4 installed.
Comment 5 Stefan Metzmacher 2017-03-09 10:40:09 UTC
(In reply to Hussam Al-Tayeb from comment #4)

Ok, it might be a regression in the fixes for #11808.

Ralph, Uri, can you have a look?
Comment 6 Ralph Böhme 2017-03-09 10:40:50 UTC
(In reply to Hussam Al-Tayeb from comment #4)
Possibly linked against an older installed version of tdb?
Comment 7 Stefan Metzmacher 2017-03-09 10:44:55 UTC
(In reply to Ralph Böhme from comment #6)

The backtrace shows it's blocking in sigsuspend() so I don't think so.

What is the used glibc and kernel version?
Comment 8 Ralph Böhme 2017-03-09 10:48:06 UTC
`rpm -qv libtdb` please.
Comment 9 Ralph Böhme 2017-03-09 10:48:44 UTC
(In reply to Ralph Böhme from comment #8)
typo, sorry: `rpm -qi libtdb`
Comment 10 Hussam Al-Tayeb 2017-03-09 10:49:48 UTC
tdb build date is Mon 19 Dec 2016 while samba build date is Sun 22 Jan 2017.
Linked against glibc 2.24 (now running 2.25)
Running linux kernel 4.9.13.
I am going to update to samba 4.6.0 and I will report if this continues to happen.
Thank you.
Comment 11 Hussam Al-Tayeb 2017-03-09 10:52:11 UTC
(In reply to Ralph Böhme from comment #9)
Not running a RPM distribution but:

cat /usr/lib/pkgconfig/tdb.pc
prefix=/usr
exec_prefix=${prefix}
libdir=${prefix}/lib
includedir=${prefix}/include

Name: tdb
Description: A trivial database
Version: 1.3.12
Libs: -Wl,-rpath,/usr/lib -L${libdir} -ltdb
Cflags: -I${includedir} 
URL: http://tdb.samba.org/

pacman -Ql tdb
tdb /usr/
tdb /usr/bin/
tdb /usr/bin/tdbbackup
tdb /usr/bin/tdbdump
tdb /usr/bin/tdbrestore
tdb /usr/bin/tdbtool
tdb /usr/include/
tdb /usr/include/tdb.h
tdb /usr/lib/
tdb /usr/lib/libtdb.so
tdb /usr/lib/libtdb.so.1
tdb /usr/lib/libtdb.so.1.3.12
tdb /usr/lib/pkgconfig/
tdb /usr/lib/pkgconfig/tdb.pc
tdb /usr/lib/python2.7/
tdb /usr/lib/python2.7/site-packages/
tdb /usr/lib/python2.7/site-packages/_tdb_text.py
tdb /usr/lib/python2.7/site-packages/tdb.so
tdb /usr/share/
tdb /usr/share/man/
tdb /usr/share/man/man8/
tdb /usr/share/man/man8/tdbbackup.8.gz
tdb /usr/share/man/man8/tdbdump.8.gz
tdb /usr/share/man/man8/tdbrestore.8.gz
tdb /usr/share/man/man8/tdbtool.8.gz
Comment 12 Ralph Böhme 2017-03-09 10:54:10 UTC
(In reply to Hussam Al-Tayeb from comment #11)
Thanks!
Well, metze is probably right about the sigsuspend() in the SBT telling us that you *are*  using the version that has the "fix", because the code didn't call sigsuspend() before.
But seeing the version info is a little bit more explicit. :)
Comment 13 Uri Simchoni 2017-03-09 11:08:09 UTC
I wonder whether that check is multi-thread-safe. Ending up hung on sigsuspend probably means that another thread got the signal.

According to POSIX (http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_sigmask.html) pthread_sigmask() only affects the calling thread's mask. If we're already thread #5, other threads may have received the signal.

According to Linux manpage, pthread_sigmask() is just like sigprocmask (i.e. affecting the whole process).
Comment 14 Ralph Böhme 2017-03-09 11:21:32 UTC
(In reply to Uri Simchoni from comment #13)
Yeah, I was suspecting this as well. Which is kind of a problem, because POSIX leaves it unclear if you can call sigprocmask() in a threaded process to change the signal mask of all threads.
Does anyone know? Going to write a test if not.
Comment 15 Uri Simchoni 2017-03-09 11:46:19 UTC
Threads have distinct signal masks, so pthread_sigmask() must change only the calling thread - it makes no sense otherwise. Basically the application writer needs to set signal masks of threads to control which thread gets what signal, and setting it from a library, with no control over other threads, is problematic.

Perhaps gvfsd-smb-browse can initialize tdb before spawning threads? That's what we do on Samba daemons and I really don't see a way around it, except for dropping this test. We also can't have some applications which access gencache assume there are robust mutexes and some assume there isn't. If we discuss changes to tdb - let's move the discussion to the list.
Comment 16 Ralph Böhme 2017-03-09 21:01:32 UTC
(In reply to Uri Simchoni from comment #15)
Fwiw, I tested whether sigprocmask() might change the signal mask of all threads in a multithreaded program and it doesn't. It behaves just like pthread_sigmask().
Comment 17 Ralph Böhme 2017-03-09 21:02:51 UTC
We might be able to keep the runtime test by using a library constructor. We already use this feature in talloc. I'll try to find time tomorrow to code something up.
Comment 18 Jeremy Allison 2017-03-09 21:21:27 UTC
FYI - was that (sigprocmask test) on Linux ? There are glibc-ism's around the way syscalls behave in Linux pthread mapped onto system threads (different processes under the covers) that are different in *BSD and Solaris (not that Solaris matters anymore, but *BSD does :-).

Cam you test sigprocmask on *BSD also ?
Comment 19 Ralph Böhme 2017-03-10 05:20:16 UTC
(In reply to Jeremy Allison from comment #18)
Yes, that was on Linux. I could test it on FreeBSD, but what would it help if we already know it's unusable as it is on Linux?
Comment 20 Jeremy Allison 2017-03-10 16:36:02 UTC
Never mind, I was just curious as I suspect *BSD behaves differently here :-).
Comment 21 Ralph Böhme 2017-03-12 13:31:18 UTC
I proposed a patch on the ML, cf <https://lists.samba.org/archive/samba-technical/2017-March/119314.html>.
Comment 22 Stefan Metzmacher 2017-04-27 12:18:45 UTC
(In reply to Ralph Böhme from comment #21)

The fix proposed for the next tdb release:
https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=be43b65b32e23d467c5aea7f7da9eb8
from this branch:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master4-ldb