Bug 15668 - lmdb backendStore: samba-tool domain backup offline deadlocks with parallel sam.ldb modifications
Summary: lmdb backendStore: samba-tool domain backup offline deadlocks with parallel s...
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.18.3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-24 07:02 UTC by Arvid Requate
Modified: 2024-06-25 08:17 UTC (History)
1 user (show)

See Also:


Attachments
reproduce-lmdb-backup-deadlock-part-1-modification-activity.sh (1.05 KB, application/x-shellscript)
2024-06-24 07:02 UTC, Arvid Requate
no flags Details
reproduce-lmdb-backup-deadlock-part-2-run-samba-backup.sh (677 bytes, text/plain)
2024-06-24 07:04 UTC, Arvid Requate
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arvid Requate 2024-06-24 07:02:44 UTC
Created attachment 18350 [details]
reproduce-lmdb-backup-deadlock-part-1-modification-activity.sh

When using lmdb/mdb as backendStore for sam.ldb, samba-tool domain backup offline uses the dedicated mdb_copy tool for each of the `sam.ldb.d/*.ldb` backend files. We observed deadlocks, where one mdb_copy process hangs in a `pthread_mutex_lock` call, attempting to lock a mutex on the lmdb metadata. I was able to reproduce this in the lab by running a simple ldbmodify loop (see attachment) in parallel to samba-tool domain backup.

It's a bit of a timing/race issue, I could reproduce it pretty often (nearly always) when running the modification loop in one screen terminal and then starting the backup in another. When I combined both into one shell script (with the modifcation loop as subprocesses) the deadlock occurred significantly less often.

As far as I dug into it with gdb, in case of deadlock the ldbmodify has an lmdb write transaction open (I compiled lmdb with `#define MDB_DEBUG 1` to confirm) on the backend ldb subfiles (one for each of them), so probably it owns the lmdb metadata mutex on the particular backend subfile that mdb_copy waits to lock. The ldbmodify process in turn hangs waiting to obtain a POSIX write lock on private/sam.ldb.d/metadata.tdb -- and that appears to be blocked by samba-tool (but I didn't dig into the details of that yet).

When I remove the acquisition of samdb.search_iterator() in `netcmd/domain_backup.py` (or `netcmd/domain/backup.py` in the latest Samba versions) then the deadlock doesn't occur any longer (but that commit was introduced for the purpose of fixing Bug 14676):

https://gitlab.com/samba-team/samba/-/commit/958931ad379#e8fca6c1e346a8bfacf89601ca248b6c9b470bd7_1020_1037


From my experiments it didn't matter if Samba/AD (or DNS server) was running additionally.
Comment 1 Arvid Requate 2024-06-24 07:04:19 UTC
Created attachment 18351 [details]
reproduce-lmdb-backup-deadlock-part-2-run-samba-backup.sh

Part 2 of the reproducer: run samba-tool domain backup offline
Comment 2 Arvid Requate 2024-06-24 07:05:59 UTC
I tested with lmdb versions 0.9.22 and 0.9.33 (i.e. currently latest upstream release).
Comment 3 Douglas Bagnall 2024-06-25 08:17:14 UTC
Thanks for the analysis, Arvid.

Do you have a sense of how often it happens in real world situations? I guess it depends on the domain size and busyness?