Created attachment 18350 [details] reproduce-lmdb-backup-deadlock-part-1-modification-activity.sh When using lmdb/mdb as backendStore for sam.ldb, samba-tool domain backup offline uses the dedicated mdb_copy tool for each of the `sam.ldb.d/*.ldb` backend files. We observed deadlocks, where one mdb_copy process hangs in a `pthread_mutex_lock` call, attempting to lock a mutex on the lmdb metadata. I was able to reproduce this in the lab by running a simple ldbmodify loop (see attachment) in parallel to samba-tool domain backup. It's a bit of a timing/race issue, I could reproduce it pretty often (nearly always) when running the modification loop in one screen terminal and then starting the backup in another. When I combined both into one shell script (with the modifcation loop as subprocesses) the deadlock occurred significantly less often. As far as I dug into it with gdb, in case of deadlock the ldbmodify has an lmdb write transaction open (I compiled lmdb with `#define MDB_DEBUG 1` to confirm) on the backend ldb subfiles (one for each of them), so probably it owns the lmdb metadata mutex on the particular backend subfile that mdb_copy waits to lock. The ldbmodify process in turn hangs waiting to obtain a POSIX write lock on private/sam.ldb.d/metadata.tdb -- and that appears to be blocked by samba-tool (but I didn't dig into the details of that yet). When I remove the acquisition of samdb.search_iterator() in `netcmd/domain_backup.py` (or `netcmd/domain/backup.py` in the latest Samba versions) then the deadlock doesn't occur any longer (but that commit was introduced for the purpose of fixing Bug 14676): https://gitlab.com/samba-team/samba/-/commit/958931ad379#e8fca6c1e346a8bfacf89601ca248b6c9b470bd7_1020_1037 From my experiments it didn't matter if Samba/AD (or DNS server) was running additionally.
Created attachment 18351 [details] reproduce-lmdb-backup-deadlock-part-2-run-samba-backup.sh Part 2 of the reproducer: run samba-tool domain backup offline
I tested with lmdb versions 0.9.22 and 0.9.33 (i.e. currently latest upstream release).
Thanks for the analysis, Arvid. Do you have a sense of how often it happens in real world situations? I guess it depends on the domain size and busyness?