When running the base.bench-torture torture test with multiple connections, smbds lock up. there is a deadlock between brlock.tdb and locking.tdb. more details to follow
Once call chain is open_file_ntcreate (gets the share mode) -> grant_fsp_oplock_type -> file_has_brlocks -> brl_get_locks_readonly (this wants to get the brlock entry) The other call chain is do_lock (gets the brlock entry) -> brl_lock -> brl_lock_windows_default -> contend_level2_oplocks_begin_default -> get_share_mode_lock ==> DEADLOCK Not sure how to solve this. Maybe over the weekend I find something. Volker
Created attachment 6495 [details] git-am fix for 3.6.0 Force the open operation (which is the expensive one anyway) to acquire and release locks in a way compatible with the more common do_lock check. This should fix the issue in the mundane, non-clusered case (it passes make test, but I haven't checked 100% it fixes your deadlock :-). Volker and Christian, you'll have to decide if this works well enough for you in the clustered case. Jeremy.
Comment on attachment 6495 [details] git-am fix for 3.6.0 As I touched this piece of code in the past too, let me comment on this, please. The patch looks good at first glance, only one question: why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?
initial testing of the patch looks good will give it some more cycles
Pavel Shilovsky wrote: "why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?" Good question - I remember something about looking at the difference in that function in the clustered vs non-clustered case. I'll examine closer and update the attachment if that's appropriate. Thanks ! Jeremy.
This looks okay in general. Jeremy, can you explain these hunks? Why did you move the fd_close, and why did you remove the SMB_ASSERT? Thanks, Volker @@ -2216,8 +2250,6 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn, if (!NT_STATUS_IS_OK(status)) { struct deferred_open_record state; - fd_close(fsp); - state.delayed_for_oplocks = False; state.id = id; @@ -2232,11 +2264,13 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn, defer_open(lck, request_time, timeval_zero(), req, &state); } - TALLOC_FREE(lck); + free_ordered_locks(lck, br_lck); + fd_close(fsp); return status; } grant_fsp_oplock_type(fsp, + br_lck, oplock_request, got_level2_oplock, got_a_none_oplock); @@ -2247,14 +2281,12 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn, } - SMB_ASSERT(lck != NULL); -
1). I moved the fd_close() in order to make the exit code order : + free_ordered_locks(lck, br_lck); + fd_close(fsp); identical in all places. In every other exit path we do the operations in this order. Having one place where this was different doesn't by itself cause problems, but might lead to unexpected intermittent bugs later on if changes are made to fd_close() that may in some way affect the interactions between the two functions - we'd only hit the bug when exiting on that codepath. 2). The SMB_ASSERT(lck) is no longer valid, as the function acquire_ordered_locks() can leave lck == NULL, if lp_locking() == false. No other code depends on lck being non-null. Hope this helps. I'll look at the use of brl_get_locks_readonly() vs brl_get_locks() next. In the meantime I'm going to check into master now Volker and Christian have confirmed this fix looks ok. Jeremy.
Created attachment 6511 [details] git-am fix for 3.6.0 Ok, here is a *much* more minimal patch that achieves the same goal - but uses brl_get_locks_readonly() in the same way the original code did. Leaves the SMB_ASSERTs and fd_close changes out (as Volker pointed out they're not strictly needed changes). This is what has gone into master. Volker & Christian, please review for 3.6.0. Jeremy.
Added Christian Ambach as an additional reviewer.
+ if (lp_locking(fsp->conn->params)) { + *p_br_lck = brl_get_locks_readonly(fsp); + if (*p_br_lck == NULL) { + DEBUG(0, ("Could not get br_lock\n")); + return false; + } + /* Note - we don't need to free the returned + br_lck explicitly as it was allocated on talloc_tos() + and so will be autofreed (and release the lock) + once the frame context disappears. + + If it was set to fsp->brlock_rec then it was + talloc_move'd to hang off the fsp pointer and + in this case is guarenteed to not be holding the + lock on the brlock database. */ + } + + *p_lck = get_share_mode_lock(mem_ctx, + id, + connectpath, + smb_fname, + p_old_write_time); + + if (*p_lck == NULL) { + DEBUG(0, ("Could not get share mode lock\n")); + TALLOC_FREE(*p_br_lck); Why do we need to free it here ( according to the comment above we don't need it )? + return false; + }
Pavel Shilovsky wrote : > Why do we need to free it here ( according to the comment above we don't need > it )? This is the error code path where we're bailing out anyway. I'd rather return both locks or none at all here. Jeremy.
Comment on attachment 6511 [details] git-am fix for 3.6.0 Haven't been able to reproduce the bug anymore with that patch. Volker
Pushed to v3-6-test. Closing out bug report. Thanks!
Comment on attachment 6511 [details] git-am fix for 3.6.0 sorry, for the late reply.. I wasn't able to reproduce the deadlock with the patch, so giving it my blessing