8175 – smbd deadlock

Bug 8175 - smbd deadlock

Summary: smbd deadlock

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.6
Classification:	Unclassified
Component:	File services (show other bugs)
Version:	3.6.0rc1
Hardware:	All All

Importance:	P1 regression
Target Milestone:	---
Assignee:	Karolin Seeger
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:	7803
	Show dependency tree / graph

Reported:	2011-05-27 15:05 UTC by Christian Ambach
Modified:	2020-12-11 11:43 UTC (History)
CC List:	1 user (show)

See Also:

Flags:	ambi: review+

Attachments
git-am fix for 3.6.0 (9.89 KB, patch) 2011-05-27 19:34 UTC, Jeremy Allison	no flags	Details
git-am fix for 3.6.0 (5.74 KB, patch) 2011-06-01 19:16 UTC, Jeremy Allison	vl: review+ ambi: review+	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christian Ambach 2011-05-27 15:05:00 UTC

When running the base.bench-torture torture test with multiple connections, smbds lock up. there is a deadlock between brlock.tdb and locking.tdb.

more details to follow

Comment 1 Volker Lendecke 2011-05-27 15:17:59 UTC

Once call chain is open_file_ntcreate (gets the share mode) -> grant_fsp_oplock_type -> file_has_brlocks -> brl_get_locks_readonly (this wants to get the brlock entry)

The other call chain is do_lock (gets the brlock entry) -> brl_lock -> brl_lock_windows_default -> contend_level2_oplocks_begin_default -> get_share_mode_lock

==> DEADLOCK

Not sure how to solve this. Maybe over the weekend I find something.

Volker

Comment 2 Jeremy Allison 2011-05-27 19:34:44 UTC

Created attachment 6495 [details]
git-am fix for 3.6.0

Force the open operation (which is the expensive one anyway) to acquire and release locks in a way compatible with the more common do_lock check.
    
This should fix the issue in the mundane, non-clusered case (it passes make test, but I haven't checked 100% it fixes your deadlock :-). Volker and Christian, you'll have to decide if this works well enough for you in the clustered case.
    
Jeremy.

Comment 3 Pavel Shilovsky 2011-05-30 07:13:36 UTC

Comment on attachment 6495 [details]
git-am fix for 3.6.0

As I touched this piece of code in the past too, let me comment on this, please. The patch looks good at first glance, only one question: why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?

Comment 4 Christian Ambach 2011-05-30 15:23:39 UTC

initial testing of the patch looks good
will give it some more cycles

Comment 5 Jeremy Allison 2011-05-31 23:43:24 UTC

Pavel Shilovsky wrote: "why not to use brl_get_locks_readonly rather than brl_get_locks in acquire_ordered_lock?"

Good question - I remember something about looking at the difference in that function in the clustered vs non-clustered case. I'll examine closer and update the attachment if that's appropriate.

Thanks !

Jeremy.

Comment 6 Volker Lendecke 2011-06-01 14:38:38 UTC

This looks okay in general.

Jeremy, can you explain these hunks? Why did you move the fd_close, and why did you remove the SMB_ASSERT?

Thanks,

Volker

@@ -2216,8 +2250,6 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 		if (!NT_STATUS_IS_OK(status)) {
 			struct deferred_open_record state;
 
-			fd_close(fsp);
-
 			state.delayed_for_oplocks = False;
 			state.id = id;
 
@@ -2232,11 +2264,13 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 				defer_open(lck, request_time, timeval_zero(),
 					   req, &state);
 			}
-			TALLOC_FREE(lck);
+			free_ordered_locks(lck, br_lck);
+			fd_close(fsp);
 			return status;
 		}
 
 		grant_fsp_oplock_type(fsp,
+				br_lck,
                                 oplock_request,
                                 got_level2_oplock,
                                 got_a_none_oplock);
@@ -2247,14 +2281,12 @@ static NTSTATUS open_file_ntcreate(connection_struct *conn,
 
 	}
 
-	SMB_ASSERT(lck != NULL);
-

Comment 7 Jeremy Allison 2011-06-01 16:39:39 UTC

1). I moved the fd_close() in order to make the exit code order :

+            free_ordered_locks(lck, br_lck);
+            fd_close(fsp);

identical in all places. In every other exit path we do the operations in this order. Having one place where this was different doesn't by itself cause problems, but might lead to unexpected intermittent bugs later on if changes are made to fd_close() that may in some way affect the interactions between the two functions - we'd only hit the bug when exiting on that codepath.

2). The SMB_ASSERT(lck) is no longer valid, as the function acquire_ordered_locks() can leave lck == NULL, if lp_locking() == false. No other code depends on lck being non-null.

Hope this helps. I'll look at the use of brl_get_locks_readonly() vs brl_get_locks() next.

In the meantime I'm going to check into master now Volker and Christian have confirmed this fix looks ok.

Jeremy.

Comment 8 Jeremy Allison 2011-06-01 19:16:11 UTC

Created attachment 6511 [details]
git-am fix for 3.6.0

Ok, here is a *much* more minimal patch that achieves the same goal - but uses brl_get_locks_readonly() in the same way the original code did. Leaves the SMB_ASSERTs and fd_close changes out (as Volker pointed out they're not strictly needed changes). This is what has gone into master.

Volker & Christian, please review for 3.6.0.

Jeremy.

Comment 9 Jeremy Allison 2011-06-01 19:16:44 UTC

Added Christian Ambach as an additional reviewer.

Comment 10 Pavel Shilovsky 2011-06-01 19:28:35 UTC

+	if (lp_locking(fsp->conn->params)) {
+		*p_br_lck = brl_get_locks_readonly(fsp);
+		if (*p_br_lck == NULL) {
+			DEBUG(0, ("Could not get br_lock\n"));
+			return false;
+		}
+		/* Note - we don't need to free the returned
+		   br_lck explicitly as it was allocated on talloc_tos()
+		   and so will be autofreed (and release the lock)
+		   once the frame context disappears.
+
+		   If it was set to fsp->brlock_rec then it was
+		   talloc_move'd to hang off the fsp pointer and
+		   in this case is guarenteed to not be holding the
+		   lock on the brlock database. */
+	}
+
+	*p_lck = get_share_mode_lock(mem_ctx,
+				id,
+				connectpath,
+				smb_fname,
+				p_old_write_time);
+
+	if (*p_lck == NULL) {
+		DEBUG(0, ("Could not get share mode lock\n"));
+		TALLOC_FREE(*p_br_lck);

Why do we need to free it here ( according to the comment above we don't need it )?

+		return false;
+	}

Comment 11 Jeremy Allison 2011-06-01 19:37:34 UTC

Pavel Shilovsky wrote :

> Why do we need to free it here ( according to the comment above we don't need
> it )?

This is the error code path where we're bailing out anyway. I'd rather return both locks or none at all here.

Jeremy.

Comment 12 Volker Lendecke 2011-06-07 07:50:21 UTC

Comment on attachment 6511 [details]
git-am fix for 3.6.0

Haven't been able to reproduce the bug anymore with that patch.

Volker

Comment 13 Karolin Seeger 2011-06-07 11:09:50 UTC

Pushed to v3-6-test.
Closing out bug report.

Thanks!

Comment 14 Christian Ambach 2011-06-08 11:26:38 UTC

Comment on attachment 6511 [details]
git-am fix for 3.6.0

sorry, for the late reply.. I wasn't able to reproduce the deadlock with the patch, so giving it my blessing