Bug 12139 - Race between break oplock and check for share_mode
Summary: Race between break oplock and check for share_mode
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.2.12
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-10 14:43 UTC by Lev
Modified: 2021-01-11 11:22 UTC (History)
5 users (show)

See Also:


Attachments
Output of testparm -sv (10.27 KB, text/plain)
2016-08-10 14:44 UTC, Lev
no flags Details
Client-1 log (101.75 KB, text/plain)
2016-08-10 14:44 UTC, Lev
no flags Details
Client-2 log (99.40 KB, text/plain)
2016-08-10 14:44 UTC, Lev
no flags Details
bad hack (1.18 KB, patch)
2016-08-10 15:21 UTC, Volker Lendecke
no flags Details
Proposed patch for master. (6.46 KB, patch)
2016-08-10 21:46 UTC, Jeremy Allison
no flags Details
Test patch for 4.2.x. (4.97 KB, patch)
2016-08-10 22:49 UTC, Jeremy Allison
no flags Details
git-am fix for 4.5.0, 4.4.x, 4.3.x (6.79 KB, patch)
2016-08-11 21:56 UTC, Jeremy Allison
vl: review+
slow: review+
jra: review? (obnox)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lev 2016-08-10 14:43:01 UTC
We have a samba server and 2 windows clients, connected to the same share. Client-1 opens a file and immediately closes it, but Windows delays the close request. Then client-2 also tries to open a file. The oplock on client-1 is broken, it now issues real close request. The problem is that process handling connection with client-1 sends a message about oplock break BEFORE it removes share_mode:

static NTSTATUS close_normal_file(struct smb_request *req, files_struct *fsp, enum file_close_type close_type)
{
	...
	if(fsp->oplock_type) {
		remove_oplock(fsp);
	}
	
	if (fsp->fh->ref_count == 1) {
		/* Should we return on error here... ? */
		tmp = close_remove_share_mode(fsp, close_type);
		status = ntstatus_keeperror(status, tmp);
	}
	...
}

So the client-2 process may see old share_mode, and fail the open request with error NT_STATUS_SHARING_VIOLATION, although client-2 tried to open a file AFTER client-1 closed it.

To reproduce the issue I inserted artificial 5sec sleep into close_normal_file() after call to remove_oplock().

Scenario:

Client-1: vm80, pid=17685.
Client-2: vm86, pid=17743

1. Client-1 creates a file b4 with dwDesiredAccess=GENERIC_READ|GENERIC_WRITE and dwShareMode=0 and immediately closes it. Windows delays the close request, so the file remains with access_mask=0x12019f and oplock=0x3.

[2016/08/10 15:25:12.055645, 10, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:4983(create_file_default)
  create_file: access_mask = 0x12019f file_attributes = 0x80, share_access = 0x0, create_disposition = 0x3 create_options = 0x40 oplock_request = 0x2 private_flags = 0x0 root_dir_fid = 0x0, ea_list = 0x(nil), sd = 0x(nil), fname = b4
[2016/08/10 15:25:12.069567,  5, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/smbd/oplock.c:87(set_file_oplock)
  set_file_oplock: granted oplock 0x3 on file /export/v1:b4, fd1b:90:0/2512722705, tv_sec = 57ab1d28, tv_usec = d978

2. Client-2 tries to open a file with dwDesiredAccess=GENERIC_READ and dwShareMode=FILE_SHARE_READ. 

	[2016/08/10 15:25:17.854027, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:4983(create_file_default)
	  create_file: access_mask = 0x120089 file_attributes = 0x80, share_access = 0x1, create_disposition = 0x3 create_options = 0x40 oplock_request = 0x2 private_flags = 0x0 root_dir_fid = 0x0, ea_list = 0x(nil), sd = 0x(nil), fname = b4

3. share_conflict() returns conflict, so break request is sent to client-1:

	[2016/08/10 15:25:17.864630, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1025(share_conflict)
	  share_conflict: entry->access_mask = 0x12019f, entry->share_access = 0x0, entry->private_options = 0x0
	[2016/08/10 15:25:17.864651, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1036(share_conflict)
	  share_conflict: access_mask = 0x120089, share_access = 0x1
	[2016/08/10 15:25:17.864669, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: [1] am (0x12019f) & right (0x6) = 0x6
	[2016/08/10 15:25:17.864705, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: [1] sa (0x1) & share (0x2) = 0x0
	[2016/08/10 15:25:17.864725, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: check 1 conflict am = 0x12019f, right = 0x6, sa = 0x1, share = 0x2
	[2016/08/10 15:25:17.864829, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1484(delay_for_oplock)
	  breaking from 7 to 1
	[2016/08/10 15:25:17.864847, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1256(send_break_message)
	  Sending break request to PID 17685

[2016/08/10 15:25:17.866427, 10, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/smbd/oplock.c:806(process_oplock_break_message)
  Got oplock break to 1 message from pid 17743: fd1b:90:0/2512722705
[2016/08/10 15:25:17.867347, 10, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/smbd/oplock.c:921(process_oplock_break_message)
  break_from=7, break_to=1
[2016/08/10 15:25:17.867441, 10, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/smb2_break.c:474(send_break_message_smb2)
  send_break_message_smb2: sending oplock break for file b4, fnum 2050546187, smb2 level 1

4. Client-1 closes the file:

[2016/08/10 15:25:17.868508, 10, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/smb2_server.c:1967(smbd_smb2_request_dispatch)
  smbd_smb2_request_dispatch: opcode[SMB2_OP_CLOSE] mid = 5
  
4.1. Process 17685 removes oplock and sends a message to process 17743:
  
[2016/08/10 15:25:17.868722, 10, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/smbd/oplock.c:244(remove_oplock)
  remove_oplock called for b4
[2016/08/10 15:25:17.871852, 10, pid=17685, effective(0, 0), real(0, 0)] ../source3/lib/messages_dgm.c:325(messaging_dgm_send)
  messaging_dgm_send: Sending message to 17743
  
4.3. Process 17685 goes to artificial 5sec sleep:

[2016/08/10 15:25:17.871991,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:746(close_normal_file)
  close_normal_file: sleep 0
[2016/08/10 15:25:18.873361,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:746(close_normal_file)
  close_normal_file: sleep 1
[2016/08/10 15:25:19.874045,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:746(close_normal_file)
  close_normal_file: sleep 2
[2016/08/10 15:25:20.874759,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:746(close_normal_file)
  close_normal_file: sleep 3
[2016/08/10 15:25:21.874967,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:746(close_normal_file)
  close_normal_file: sleep 4 
  
4.2. Process 17743 wakes up, retries file open and fails again in share_conflict, because it still sees share_modes with access_mask=0x120089:
  
	[2016/08/10 15:25:17.875106, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/lib/messages.c:252(messaging_recv_cb)
	  messaging_recv_cb: Received message 0xfa3 len 44 (num_fds:0) from 17685
	[2016/08/10 15:25:17.883919, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/smb2_create.c:510(smbd_smb2_create_send)
	  smbd_smb2_create_send: reentrant for file b4
	[2016/08/10 15:25:17.885296, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:2478(open_file_ntcreate)
	  open_file_ntcreate: fname=b4, dos_attrs=0x80 access_mask=0x120089 share_access=0x1 create_disposition = 0x3 create_options=0x40 unix mode=0744 oplock_request=2 private_flags = 0x0

	[2016/08/10 15:25:17.892258, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1025(share_conflict)
	  share_conflict: entry->access_mask = 0x12019f, entry->share_access = 0x0, entry->private_options = 0x0
	[2016/08/10 15:25:17.892280, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1036(share_conflict)
	  share_conflict: access_mask = 0x120089, share_access = 0x1
	[2016/08/10 15:25:17.892298, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: [1] am (0x12019f) & right (0x6) = 0x6
	[2016/08/10 15:25:17.892316, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: [1] sa (0x1) & share (0x2) = 0x0
	[2016/08/10 15:25:17.892334, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:1085(share_conflict)
	  share_conflict: check 1 conflict am = 0x12019f, right = 0x6, sa = 0x1, share = 0x2
	[2016/08/10 15:25:17.892376, 10, pid=17743, effective(65534, 65534), real(65534, 0), class=locking] ../source3/locking/locking.c:749(share_mode_stale_pid)
	  PID 17685 (index 0 out of 1) still exists
	  
	[2016/08/10 15:25:17.898002, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/open.c:4804(create_file_unixpath)
	  create_file_unixpath: NT_STATUS_SHARING_VIOLATION
	[2016/08/10 15:25:17.898563, 10, pid=17743, effective(65534, 65534), real(65534, 0)] ../source3/smbd/smb2_server.c:2617(smbd_smb2_request_done_ex)
	  smbd_smb2_request_done_ex: idx[1] status[NT_STATUS_SHARING_VIOLATION] body[8] dyn[yes:1] at ../source3/smbd/smb2_server.c:2788
	
4.4. Process 17685 removes share_modes, but it's too late, process 17743 already returned error to the client:

[2016/08/10 15:25:22.875228,  0, pid=17685, effective(65534, 65534), real(65534, 0)] ../source3/smbd/close.c:751(close_normal_file)
  close_normal_file: ref_count=1
[2016/08/10 15:25:22.875511, 10, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/locking/share_mode_lock.c:167(parse_share_modes)
  parse_share_modes:
          share_modes: ARRAY(1)
[2016/08/10 15:25:22.876001, 10, pid=17685, effective(65534, 65534), real(65534, 0), class=locking] ../source3/locking/share_mode_lock.c:187(unparse_share_modes)
  unparse_share_modes:
          share_modes: ARRAY(0)

Attached are testparm output and full logs from both smbd processes.
Comment 1 Lev 2016-08-10 14:44:17 UTC
Created attachment 12352 [details]
Output of testparm -sv
Comment 2 Lev 2016-08-10 14:44:41 UTC
Created attachment 12353 [details]
Client-1 log
Comment 3 Lev 2016-08-10 14:44:57 UTC
Created attachment 12354 [details]
Client-2 log
Comment 4 Volker Lendecke 2016-08-10 15:05:02 UTC
What exact Samba version are you running? This should not happen since we changed locking.tdb to dbwrap_record_watch for this type of operation.
Comment 5 Volker Lendecke 2016-08-10 15:07:05 UTC
"remove_oplock called for b4" -- in 4.2.12 this message is in line 235, not 244... :-)
Comment 6 Lev 2016-08-10 15:08:02 UTC
I'm using 4.2.12. I added some my prints/sleeps, so line numbers may differ.
Comment 7 Volker Lendecke 2016-08-10 15:16:05 UTC
Ok, I see how this can happen. Can you try a quick patch?
Comment 8 Lev 2016-08-10 15:19:31 UTC
Sure, I have an environment ready. What's the patch?
Comment 9 Volker Lendecke 2016-08-10 15:21:41 UTC
Created attachment 12355 [details]
bad hack

Here it is... Don't tell anybody it's mine, this is REALLY only for testing!
Comment 10 Lev 2016-08-10 15:44:34 UTC
Thanks, Volker. Yes, the patch solves the issue.
Comment 11 Jeremy Allison 2016-08-10 16:01:42 UTC
Great analysis - thanks ! Assigning to Volker as he's already looked at it (but this one interests too :-).
Comment 12 Jeremy Allison 2016-08-10 21:46:49 UTC
Created attachment 12359 [details]
Proposed patch for master.

Volker, what about this as a possible patch for master ?

It splits out the internals of remove_oplock() into a new function remove_oplock_under_lock(), allowing it to be called from both remove_oplock() and also from close_remove_share_mode() whilst the lock is already held.
Comment 13 Jeremy Allison 2016-08-10 22:49:34 UTC
Created attachment 12360 [details]
Test patch for 4.2.x.

lev@zadarastorage.com - can you test this back-ported version of the proposed patch for master in your 4.2.x code ?

Thanks !
Comment 14 Lev 2016-08-11 06:07:15 UTC
Thanks, Jeremy. The patch fixes the issue in 4.2.12.
Comment 15 Volker Lendecke 2016-08-11 14:17:44 UTC
(In reply to Jeremy Allison from comment #12)
> Volker, what about this as a possible patch for master ?

Pushed, thanks!

Volker
Comment 16 Jeremy Allison 2016-08-11 21:56:33 UTC
Created attachment 12362 [details]
git-am fix for 4.5.0, 4.4.x, 4.3.x

Contains cherry-pick info from master. Applies cleanly to 4.5.0, 4.4.x, 4.3.x.
Comment 17 Volker Lendecke 2016-08-12 16:56:09 UTC
Jeremy, if there is so much urgency on the review, you should alert me separately. I have it in my inbox, but other customer priorities prevented me from looking at this today.
Comment 18 Jeremy Allison 2016-08-12 17:06:58 UTC
Volker, there's no point in asking you specifically, and I'm not trying to bug you. I assumed you're busy, that's why I asked others to look at it. It's a pretty easy fix so doing the +1 is a formality for any of the requested engineers. One of my vendors wanted it in the next 4.x releases so I asked others to look. Don't take things too personally.

Cheers,

Jeremy.
Comment 19 Ralph Böhme 2016-08-12 17:12:21 UTC
Reassigning to Karolin for inclusion in 4.3, 4.4 and 4.5.
Comment 20 Stefan Metzmacher 2016-08-24 09:45:26 UTC
Pushed to autobuild-v4-5-test.
Comment 21 Stefan Metzmacher 2016-08-28 15:58:10 UTC
Pushed to v4-5-test.
Comment 22 Karolin Seeger 2016-09-13 09:50:04 UTC
Pushed to autobuild-v4-{3,4}-test.
Comment 23 Karolin Seeger 2016-09-20 07:03:17 UTC
Pushed to all branches.
Closing out bug report.

Thanks!