Bug 13058 - Kernel oplock can remain not broken if oplock holder closes the file and another process catches the lease.
Kernel oplock can remain not broken if oplock holder closes the file and anot...
Status: RESOLVED FIXED
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services
4.5.10
All All
: P5 normal
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks: 13121
  Show dependency treegraph
 
Reported: 2017-09-27 16:54 UTC by Lev
Modified: 2017-11-14 09:50 UTC (History)
2 users (show)

See Also:


Attachments
smbd logs (7.56 KB, application/x-zip-compressed)
2017-09-27 16:54 UTC, Lev
no flags Details
git-am *TEST* patch for 4.7.next, 4.6.next. (10.56 KB, patch)
2017-10-30 22:34 UTC, Jeremy Allison
no flags Details
git-am fix for master (13.14 KB, patch)
2017-11-09 21:31 UTC, Jeremy Allison
slow: review+
Details
git-am fix for 4.7.next. (2.06 KB, patch)
2017-11-11 04:02 UTC, Jeremy Allison
slow: review-
Details
git-am fix for 4.6.next (2.06 KB, patch)
2017-11-11 04:03 UTC, Jeremy Allison
slow: review-
Details
git-am fix for 4.7.next. (13.50 KB, patch)
2017-11-11 17:09 UTC, Jeremy Allison
slow: review+
Details
git-am fix for 4.6.next (13.48 KB, patch)
2017-11-11 17:09 UTC, Jeremy Allison
slow: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lev 2017-09-27 16:54:17 UTC
Created attachment 13639 [details]
smbd logs

There are 3 smbd processes.

1. smbd-1 opens the file and sets the lease.
2. smbd-2 and smbd-3 try to open the file. open() fails(EAGAIN) and open is deferred.
3. smbd-1 sends oplock break request to the windows host.
4. smbd-1 closes the file.
5. smbd-2 calls defer_open_done(), sees that the file has no oplock and reschedules open.
6. smbd-2 opens the file and sets the lease.
7. smbd-3 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open.

Now smbd-3 wait for the lease to be broken, smbd-2 holds the lease and doesn't know smbd-3 wait for it.

In 60 sec smbd-3 panics with "Kernel oplock holder didn't respond to break message"

Logs summary:

1. smbd-1 opens the file and sets the lease

[2017/09/27 18:47:56.056660,  5, pid=6646, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open)
  open(file1, flags=0x208c2, mode=0x1e4)
[2017/09/27 18:47:56.057524,  2, pid=6646] ../source3/smbd/open.c:1322(open_file)
  nobody opened file file1 read=No write=Yes (numopen=1)
[2017/09/27 18:47:56.058272,  3, pid=6646, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock)
  linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 1144407203
[2017/09/27 18:47:56.058638,  5, pid=6646, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock)
  set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/1144407203, tv_sec = 59cbc82c, tv_usec = cf48

2. smbd-2 and smbd-3 try to open the file.

	[2017/09/27 18:48:03.989624,  0, pid=6435, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open)
	  open(file1, flags=0x20802, mode=0x1e4): errno=11
	[2017/09/27 18:48:03.989666,  3, pid=6435] ../source3/smbd/open.c:1179(open_file)
	  Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626)
	[2017/09/27 18:48:03.989702,  1, pid=6435] ../source3/smbd/open.c:3264(open_file_ntcreate)
	  open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY
	[2017/09/27 18:48:03.989949,  1, pid=6435] ../source3/smbd/open.c:1802(delay_for_oplock)
	  breaking from 7 to 0 (delay_mask=4,first_open_attempt=1)
	[2017/09/27 18:48:03.990174,  1, pid=6435] ../source3/smbd/open.c:1574(send_break_message)
	  Sending break request to PID 6646
	[2017/09/27 18:48:03.990642,  1, pid=6435] ../source3/smbd/open.c:2287(defer_open)
	  defer_open: request time [2017/09/27 18:48:03] timeout [2017/09/27 18:49:03] mid [9] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0]
	[2017/09/27 18:48:03.990704,  1, pid=6435] ../source3/smbd/open.c:2309(defer_open)
	  defer_open: defering mid 9
	[2017/09/27 18:48:03.990977,  1, pid=6435] ../source3/smbd/open.c:3307(open_file_ntcreate)
	  Sent oplock break request to kernel oplock holder

		[2017/09/27 18:48:08.072527,  0, pid=6534, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open)
		  open(file1, flags=0x20802, mode=0x1e4): errno=11
		[2017/09/27 18:48:08.072613,  3, pid=6534] ../source3/smbd/open.c:1179(open_file)
		  Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626)
		[2017/09/27 18:48:08.072833,  1, pid=6534] ../source3/smbd/open.c:3264(open_file_ntcreate)
		  open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY
		[2017/09/27 18:48:08.073151,  1, pid=6534] ../source3/smbd/open.c:1802(delay_for_oplock)
		  breaking from 7 to 0 (delay_mask=4,first_open_attempt=1)
		[2017/09/27 18:48:08.073277,  1, pid=6534] ../source3/smbd/open.c:1574(send_break_message)
		  Sending break request to PID 6646
		[2017/09/27 18:48:08.074171,  1, pid=6534] ../source3/smbd/open.c:2287(defer_open)
		  defer_open: request time [2017/09/27 18:48:08] timeout [2017/09/27 18:49:08] mid [9] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0]
		[2017/09/27 18:48:08.074266,  1, pid=6534] ../source3/smbd/open.c:2309(defer_open)
		  defer_open: defering mid 9
		[2017/09/27 18:48:08.074453,  1, pid=6534] ../source3/smbd/open.c:3307(open_file_ntcreate)
		  Sent oplock break request to kernel oplock holder

3. smbd-1 sends oplock break request to the windows host.

[2017/09/27 18:48:03.989699,  1, pid=6646, class=locking] ../source3/smbd/oplock.c:43(break_kernel_oplock)
  break_kernel_oplock: break oplock 0x3 on file /export/v1:file1
[2017/09/27 18:48:03.991711,  1, pid=6646, class=locking] ../source3/smbd/oplock.c:806(process_oplock_break_message)
  Got oplock break to 0 message from pid 6435: fd1c:c3:0/1144407203
[2017/09/27 18:48:03.991771,  1, pid=6646, class=locking] ../source3/smbd/oplock.c:656(initial_break_processing)
  initial_break_processing: called for fd1c:c3:0/1144407203
  Current oplocks_open (exclusive = 1, levelII = 0)
[2017/09/27 18:48:03.991908,  1, pid=6646, class=locking] ../source3/smbd/oplock.c:921(process_oplock_break_message)
  break_from=7, break_to=0
[2017/09/27 18:48:03.991962,  1, pid=6646] ../source3/smbd/smb2_break.c:474(send_break_message_smb2)
  send_break_message_smb2: sending oplock break for file file1, fnum 2868223777, smb2 level 0

4. smbd-1 closes the file.

[2017/09/27 18:48:24.000888,  2, pid=6646] ../source3/smbd/close.c:788(close_normal_file)
  nobody closed file file1 (numopen=0) NT_STATUS_OK
[2017/09/27 18:48:24.001687,  5, pid=6646] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order)
  check lock order 1 for /var/lock/samba/smbXsrv_open_global.tdb
[2017/09/27 18:48:24.002202,  5, pid=6646] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor)
  release lock order 1 for /var/lock/samba/smbXsrv_open_global.tdb
[2017/09/27 18:48:24.002496,  5, pid=6646] ../source3/smbd/files.c:569(file_free)
  freed files structure 2868223777 (0 used)  
  
5. smbd-2 calls defer_open_done(), sees that the file has no oplock and reschedules open.

	[2017/09/27 18:48:28.002100,  5, pid=6435] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order)
	  check lock order 1 for /var/lock/samba/locking.tdb
	[2017/09/27 18:48:28.002241,  5, pid=6435, class=locking] ../source3/locking/share_mode_lock.c:547(get_share_mode_lock_internal)
	  get_share_mode_lock_internal: Could not get share mode lock
	[2017/09/27 18:48:28.002297,  5, pid=6435] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor)
	  release lock order 1 for /var/lock/samba/locking.tdb
	[2017/09/27 18:48:28.002389,  1, pid=6435] ../source3/smbd/open.c:2406(defer_open_done)
	  defer_open_done: scheduling mid 9
  
6. smbd-2 opens the file and sets the lease.  

	[2017/09/27 18:48:28.005909,  5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open)
	  open(file1, flags=0x20802, mode=0x1e4)
	[2017/09/27 18:48:28.005957,  2, pid=6435] ../source3/smbd/open.c:1322(open_file)
	  nobody opened file file1 read=No write=Yes (numopen=1)
	[2017/09/27 18:48:28.011377,  3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock)
	  linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 3200188779
	[2017/09/27 18:48:28.011634,  5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock)
	  set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/3200188779, tv_sec = 59cbc84c, tv_usec = 13bb

7. smbd-3 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open.

		[2017/09/27 18:48:34.004600,  5, pid=6534] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order)
		  check lock order 1 for /var/lock/samba/locking.tdb
		[2017/09/27 18:48:34.004885,  1, pid=6534] ../source3/smbd/open.c:2396(defer_open_done)
		  defer_open_done: Unchanged lease: 7
		[2017/09/27 18:48:34.004916,  1, pid=6534] ../source3/smbd/open.c:2423(defer_open_done)
		  defer_open_done: Keep waiting for oplock release for [/export/v1/file1] mid: 9
		[2017/09/27 18:48:34.004979,  5, pid=6534] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor)
		  release lock order 1 for /var/lock/samba/locking.tdb
				  
8. smbd-3 calls defer_open_done() because of 60-sec timeout, and panics.

		[2017/09/27 18:49:08.085984,  1, pid=6534] ../source3/smbd/open.c:2349(defer_open_done)
		  dbwrap_watched_watch_recv returned NT_STATUS_IO_TIMEOUT
		[2017/09/27 18:49:08.086021,  0, pid=6534] ../source3/lib/util.c:791(smb_panic_s3)
		  PANIC (pid 6534): Kernel oplock holder didn't respond to break message
Comment 1 Lev 2017-09-27 18:34:43 UTC
Even simpler scenario with only 2 smbd processes:

1. smbd-1 opens the file and sets the lease.
2. smbd-2 tries to open the file. open() fails(EAGAIN) and open is deferred.
3. smbd-1 sends oplock break request to the windows host.
4. smbd-1 closes the file.
5. smbd-1 opens the file and sets the lease.
6. smbd-2 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open.

Now smbd-2 wait for the lease to be broken, smbd-1 holds the lease but doesn't know smbd-2 waits for it. In 60 sec smbd-2 panics.

This scenario is similar to one described in https://bugzilla.samba.org/show_bug.cgi?id=12853#c10. The difference is that in 12853 smbd-2 rescheduled the open request, that later did blocking open(), and here request is not rescheduled.

1. smbd-1 opens the file and sets the lease.

[2017/09/27 21:07:13.442844,  5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open)
  open(file1, flags=0x208c2, mode=0x1e4)
[2017/09/27 21:07:13.445673,  3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock)
  linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 2689086092
[2017/09/27 21:07:13.445770,  5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock)
  set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/2689086092, tv_sec = 59cbe8d1, tv_usec = 6b493

2. smbd-2 tries to open the file.

	[2017/09/27 21:07:31.212303,  0, pid=8017, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open)
	  open(file1, flags=0x20802, mode=0x1e4): errno=11
	[2017/09/27 21:07:31.212404,  3, pid=8017] ../source3/smbd/open.c:1179(open_file)
	  Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626)
	[2017/09/27 21:07:31.212512,  1, pid=8017] ../source3/smbd/open.c:3264(open_file_ntcreate)
	  open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY
	[2017/09/27 21:07:31.213188,  1, pid=8017] ../source3/smbd/open.c:1802(delay_for_oplock)
	  breaking from 7 to 0 (delay_mask=4,first_open_attempt=1)
	[2017/09/27 21:07:31.213276,  1, pid=8017] ../source3/smbd/open.c:1574(send_break_message)
	  Sending break request to PID 6435
	[2017/09/27 21:07:31.215086,  1, pid=8017] ../source3/smbd/open.c:2287(defer_open)
	  defer_open: request time [2017/09/27 21:07:31] timeout [2017/09/27 21:08:31] mid [19] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0]
	[2017/09/27 21:07:31.215625,  1, pid=8017] ../source3/smbd/open.c:2309(defer_open)
	  defer_open: defering mid 19
	[2017/09/27 21:07:31.215826,  1, pid=8017] ../source3/smbd/open.c:3307(open_file_ntcreate)
	  Sent oplock break request to kernel oplock holder
  
3. smbd-1 sends oplock break request to the windows host.

[2017/09/27 21:07:31.212417,  1, pid=6435, class=locking] ../source3/smbd/oplock.c:43(break_kernel_oplock)
  break_kernel_oplock: break oplock 0x3 on file /export/v1:file1
[2017/09/27 21:07:31.214064,  1, pid=6435, class=locking] ../source3/smbd/oplock.c:1000(process_kernel_oplock_break)
  Got kernel oplock break message from pid 6435: fd1c:c3:0/2689086092
[2017/09/27 21:07:31.214144,  1, pid=6435, class=locking] ../source3/smbd/oplock.c:656(initial_break_processing)
  initial_break_processing: called for fd1c:c3:0/2689086092
  Current oplocks_open (exclusive = 1, levelII = 0)
[2017/09/27 21:07:31.214230,  1, pid=6435] ../source3/smbd/smb2_break.c:474(send_break_message_smb2)
  send_break_message_smb2: sending oplock break for file file1, fnum 1083006431, smb2 level 0
  
4. smbd-1 closes the file.

[2017/09/27 21:07:51.220854,  2, pid=6435] ../source3/smbd/close.c:788(close_normal_file)
  nobody closed file file1 (numopen=1) NT_STATUS_OK

5. smbd-1 opens the file and sets the lease.

[2017/09/27 21:07:51.226068,  5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open)
  open(file1, flags=0x20802, mode=0x1e4)
[2017/09/27 21:07:51.232719,  3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock)
  linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 3903088344
[2017/09/27 21:07:51.233253,  5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock)
  set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/3903088344, tv_sec = 59cbe8f7, tv_usec = 36b60
  
6. smbd-2 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open.

	[2017/09/27 21:08:01.220715,  5, pid=8017] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order)
	  check lock order 1 for /var/lock/samba/locking.tdb
	[2017/09/27 21:08:01.220992,  1, pid=8017] ../source3/smbd/open.c:2396(defer_open_done)
	  defer_open_done: Unchanged lease: 7
	[2017/09/27 21:08:01.221044,  1, pid=8017] ../source3/smbd/open.c:2423(defer_open_done)
	  defer_open_done: Keep waiting for oplock release for [/export/v1/file1] mid: 19
	  
7. smbd-2 calls defer_open_done() because of 60-sec timeout, and panics.	  

	[2017/09/27 21:08:31.214877,  1, pid=8017] ../source3/smbd/open.c:2349(defer_open_done)
	  dbwrap_watched_watch_recv returned NT_STATUS_IO_TIMEOUT
	[2017/09/27 21:08:31.214914,  0, pid=8017] ../source3/lib/util.c:791(smb_panic_s3)
	  PANIC (pid 8017): Kernel oplock holder didn't respond to break message
Comment 2 Jeremy Allison 2017-09-29 23:43:43 UTC
So looks like defer_open_done() needs to be checking more than:

                        lease_type = get_lease_type_from_share_mode(lck->data);

                        if ((lease_type != 0) &&
                            (lease_type == state->lease_type))
                        {
                                DBG_DEBUG("Unchanged lease: %" PRIu32 "\n",
                                          lease_type);
                                schedule_req = false;
                        }

to determine if the lease has changed, yes ? Would that be the fix needed here ?
Comment 3 Jeremy Allison 2017-09-29 23:45:35 UTC
Looks like the sequence_number field might do the trick here..
Comment 4 Jeremy Allison 2017-10-19 20:37:41 UTC
FYI. This is the commit that caused the problem. b35a296a27a0807c780f2a9e7af2f2e93feefaa8

I'll check into the details here first, as this is really subtle.
Comment 5 Ralph Böhme 2017-10-19 21:11:02 UTC
Ups, b35a296a27a0807c780f2a9e7af2f2e93feefaa8 was me. Nasty stuff, at some point we really have to make create_file fully async.

Preperational stuff is here:
https://git.samba.org/?p=slow/samba.git;a=shortlog;h=refs/heads/smb2_create_send

Thanks for looking into this!
Comment 6 Jeremy Allison 2017-10-19 21:17:00 UTC
I don't think b35a296a27a0807c780f2a9e7af2f2e93feefaa8 was wrong as such, just that it didn't take into account this specific case. Should be fixable with some minor tweaks I think - but I'll need you to look at them really carefully :-).
Comment 7 Jeremy Allison 2017-10-27 23:30:38 UTC
OK, I think I've figured out a fix for this. Should have a patch for you to test early next week.
Comment 8 Jeremy Allison 2017-10-30 22:34:05 UTC
Created attachment 13742 [details]
git-am *TEST* patch for 4.7.next, 4.6.next.

Hi Lev,

Can you test this patch ? I think b35a296a27a0807c780f2a9e7af2f2e93feefaa8 tried to fix the kernel oplock issue in the wrong way, so this patch tries another approach I think might be more successful.

Cheers,

Jeremy.
Comment 9 Jeremy Allison 2017-11-02 18:49:47 UTC
OK, after a conversation with Ralph, we have a plan for a fix for this :-).

I'll add a new patch once I've coded up a regression test.

Jeremy.
Comment 10 Jeremy Allison 2017-11-09 21:31:09 UTC
Created attachment 13764 [details]
git-am fix for master

Correct patch for master. Includes regression test case.
Comment 11 Jeremy Allison 2017-11-11 04:02:41 UTC
Created attachment 13770 [details]
git-am fix for 4.7.next.

Cherry-picked from the patch that went into master.
Comment 12 Jeremy Allison 2017-11-11 04:03:12 UTC
Created attachment 13771 [details]
git-am fix for 4.6.next

Back-port of the fix that went into master.
Comment 13 Ralph Böhme 2017-11-11 14:13:38 UTC
Comment on attachment 13770 [details]
git-am fix for 4.7.next.

Wrong bug...
Comment 14 Ralph Böhme 2017-11-11 14:13:58 UTC
Comment on attachment 13771 [details]
git-am fix for 4.6.next

Wrong bug...
Comment 15 Ralph Böhme 2017-11-11 14:15:17 UTC
Comment on attachment 13771 [details]
git-am fix for 4.6.next

Wrong bug...
Comment 16 Jeremy Allison 2017-11-11 17:09:14 UTC
Created attachment 13772 [details]
git-am fix for 4.7.next.

Sigh. I did the work, just added the wrong attachment. Correct one attached.
Comment 17 Jeremy Allison 2017-11-11 17:09:48 UTC
Created attachment 13773 [details]
git-am fix for 4.6.next

Sigh. I did the work, just added the wrong attachment. Correct one attached.
Comment 18 Ralph Böhme 2017-11-11 17:19:58 UTC
Comment on attachment 13772 [details]
git-am fix for 4.7.next.

I was just about two attach the backports myself...
Comment 19 Ralph Böhme 2017-11-11 17:20:56 UTC
Reassigning to Karolin for inclusion in 4.6 and 4.7.
Comment 20 Karolin Seeger 2017-11-13 08:22:51 UTC
(In reply to Ralph Böhme from comment #19)
Pushed to autobuild-v4-{6,7}-test.
Comment 21 Karolin Seeger 2017-11-14 09:50:03 UTC
(In reply to Karolin Seeger from comment #20)
Pushed to both branches.
Closing out bug report.

Thanks!