samba-3.4.3-4 (with SCHEDULE_FOR_DELETION vacuuming enhancements)
We are experiencing high ctdb/smbd lockwait latencies while the cluster is under heavy load. Such occurrences are generally non-disruptive, but still cause smbd to be unresponsive over reasonably long periods leading to client side timeouts.
Adding extra debugging information to the fetch_locked code-path in Samba dbwrap_ctdb.c (70f9338bf2e6081916ffe5bb7cddf50b4e958b24) we can observe the lock contention occurs:
[2011/08/18 20:13:03, 0, pid=31234]
Held tdb 0x42fe72c5 lock for 10.292433 seconds
dbid:0x42fe72c5 name:locking.tdb path:/var/lib/ctdb/locking.tdb.0
smbd is holding a lock on the locking.tdb database for over 10 seconds (this has been seen to climb as high as 40 seconds). Unwinding the stack from this point we can see the locking.tdb.0 lock is held by smbd over the file unlink code path in close_remove_share_mode().
These delays can be directly attributed to poor OCFS2 delete performance with large data-sets. (http://oss.oracle.com/pipermail/ocfs2-devel/2011-July/008258.html).
As a potential solution, I'd like to avoid holding the share mode lock over the SMB_VFS_UNLINK() path.
Closing this ticket, as it's ancient.