Bug 8175 - smbd deadlock
smbd deadlock
Status: RESOLVED FIXED
Product: Samba 3.6
Classification: Unclassified
Component: File services
3.6.0rc1
All All
: P1 regression
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks: 7803
  Show dependency treegraph
 
Reported: 2011-05-27 15:05 UTC by Christian Ambach
Modified: 2011-06-08 11:26 UTC (History)
1 user (show)

See Also:
jra: review? (ambi)


Attachments
git-am fix for 3.6.0 (9.89 KB, patch)
2011-05-27 19:34 UTC, Jeremy Allison
no flags Details
git-am fix for 3.6.0 (5.74 KB, patch)
2011-06-01 19:16 UTC, Jeremy Allison
vl: review+
ambi: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Ambach 2011-05-27 15:05:00 UTC
When running the base.bench-torture torture test with multiple connections, smbds lock up. there is a deadlock between brlock.tdb and locking.tdb.

more details to follow
Comment 1 Volker Lendecke 2011-05-27 15:17:59 UTC
Once call chain is open_file_ntcreate (gets the share mode) -> grant_fsp_oplock_type -> file_has_brlocks -> brl_get_locks_readonly (this wants to get the brlock entry)

The other call chain is do_lock (gets the brlock entry) -> brl_lock -> brl_lock_windows_default -> contend_level2_oplocks_begin_default -> get_share_mode_lock

==> DEADLOCK

Not sure how to solve this. Maybe over the weekend I find something.

Volker
Comment 2 Jeremy Allison 2011-05-27 19:34:44 UTC
Created attachment 6495 [details]
git-am fix for 3.6.0

Force the open operation (which is the expensive one anyway) to acquire and release locks in a way compatible with the more common do_lock check.
    
This should fix the issue in the mundane, non-clusered case (it passes make test, but I haven't checked 100% it fixes your deadlock :-). Volker and Christian, you'll have to decide if this works well enough for you in the clustered case.
    
Jeremy.
Comment 3 Pavel Shilovsky 2011-05-30 07:13:36 UTC
Comment on attachment 6495 [details]
git-am fix for 3.6.0

As I touched this piece of code in the past too, let me comment on this, please. The patch looks good at first glance, only one question: why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?
Comment 4 Christian Ambach 2011-05-30 15:23:39 UTC
initial testing of the patch looks good
will give it some more cycles
Comment 5 Jeremy Allison 2011-05-31 23:43:24 UTC
Pavel Shilovsky wrote: "why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?"

Good question - I remember something about looking at the difference in that function in the clustered vs non-clustered case. I'll examine closer and update the attachment if that's appropriate.

Thanks !

Jeremy.
Comment 6 Volker Lendecke 2011-06-01 14:38:38 UTC
This looks okay in general.

Jeremy, can you explain these hunks? Why did you move the fd_close, and why did you remove the SMB_ASSERT?

Thanks,

Volker

@@ -2216,8 +2250,6 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 		if (!NT_STATUS_IS_OK(status)) {
 			struct deferred_open_record state;
 
-			fd_close(fsp);
-
 			state.delayed_for_oplocks = False;
 			state.id = id;
 
@@ -2232,11 +2264,13 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 				defer_open(lck, request_time, timeval_zero(),
 					   req, &state);
 			}
-			TALLOC_FREE(lck);
+			free_ordered_locks(lck, br_lck);
+			fd_close(fsp);
 			return status;
 		}
 
 		grant_fsp_oplock_type(fsp,
+				br_lck,
                                 oplock_request,
                                 got_level2_oplock,
                                 got_a_none_oplock);
@@ -2247,14 +2281,12 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 
 	}
 
-	SMB_ASSERT(lck != NULL);
-
Comment 7 Jeremy Allison 2011-06-01 16:39:39 UTC
1). I moved the fd_close() in order to make the exit code order :

+            free_ordered_locks(lck, br_lck);
+            fd_close(fsp);

identical in all places. In every other exit path we do the operations in this order. Having one place where this was different doesn't by itself cause problems, but might lead to unexpected intermittent bugs later on if changes are made to fd_close() that may in some way affect the interactions between the two functions - we'd only hit the bug when exiting on that codepath.

2). The SMB_ASSERT(lck) is no longer valid, as the function acquire_ordered_locks() can leave lck == NULL, if lp_locking() == false. No other code depends on lck being non-null.

Hope this helps. I'll look at the use of brl_get_locks_readonly() vs brl_get_locks() next.

In the meantime I'm going to check into master now Volker and Christian have confirmed this fix looks ok.

Jeremy.
Comment 8 Jeremy Allison 2011-06-01 19:16:11 UTC
Created attachment 6511 [details]
git-am fix for 3.6.0

Ok, here is a *much* more minimal patch that achieves the same goal - but uses brl_get_locks_readonly() in the same way the original code did. Leaves the SMB_ASSERTs and fd_close changes out (as Volker pointed out they're not strictly needed changes). This is what has gone into master.

Volker & Christian, please review for 3.6.0.

Jeremy.
Comment 9 Jeremy Allison 2011-06-01 19:16:44 UTC
Added Christian Ambach as an additional reviewer.
Comment 10 Pavel Shilovsky 2011-06-01 19:28:35 UTC
+	if (lp_locking(fsp->conn->params)) {
+		*p_br_lck = brl_get_locks_readonly(fsp);
+		if (*p_br_lck == NULL) {
+			DEBUG(0, ("Could not get br_lock\n"));
+			return false;
+		}
+		/* Note - we don't need to free the returned
+		   br_lck explicitly as it was allocated on talloc_tos()
+		   and so will be autofreed (and release the lock)
+		   once the frame context disappears.
+
+		   If it was set to fsp->brlock_rec then it was
+		   talloc_move'd to hang off the fsp pointer and
+		   in this case is guarenteed to not be holding the
+		   lock on the brlock database. */
+	}
+
+	*p_lck = get_share_mode_lock(mem_ctx,
+				id,
+				connectpath,
+				smb_fname,
+				p_old_write_time);
+
+	if (*p_lck == NULL) {
+		DEBUG(0, ("Could not get share mode lock\n"));
+		TALLOC_FREE(*p_br_lck);

Why do we need to free it here ( according to the comment above we don't need it )?

+		return false;
+	}
Comment 11 Jeremy Allison 2011-06-01 19:37:34 UTC
Pavel Shilovsky wrote :

> Why do we need to free it here ( according to the comment above we don't need
> it )?

This is the error code path where we're bailing out anyway. I'd rather return both locks or none at all here.

Jeremy.
Comment 12 Volker Lendecke 2011-06-07 07:50:21 UTC
Comment on attachment 6511 [details]
git-am fix for 3.6.0

Haven't been able to reproduce the bug anymore with that patch.

Volker
Comment 13 Karolin Seeger 2011-06-07 11:09:50 UTC
Pushed to v3-6-test.
Closing out bug report.

Thanks!
Comment 14 Christian Ambach 2011-06-08 11:26:38 UTC
Comment on attachment 6511 [details]
git-am fix for 3.6.0

sorry, for the late reply.. I wasn't able to reproduce the deadlock with the patch, so giving it my blessing