Created attachment 13639 [details] smbd logs There are 3 smbd processes. 1. smbd-1 opens the file and sets the lease. 2. smbd-2 and smbd-3 try to open the file. open() fails(EAGAIN) and open is deferred. 3. smbd-1 sends oplock break request to the windows host. 4. smbd-1 closes the file. 5. smbd-2 calls defer_open_done(), sees that the file has no oplock and reschedules open. 6. smbd-2 opens the file and sets the lease. 7. smbd-3 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open. Now smbd-3 wait for the lease to be broken, smbd-2 holds the lease and doesn't know smbd-3 wait for it. In 60 sec smbd-3 panics with "Kernel oplock holder didn't respond to break message" Logs summary: 1. smbd-1 opens the file and sets the lease [2017/09/27 18:47:56.056660, 5, pid=6646, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open) open(file1, flags=0x208c2, mode=0x1e4) [2017/09/27 18:47:56.057524, 2, pid=6646] ../source3/smbd/open.c:1322(open_file) nobody opened file file1 read=No write=Yes (numopen=1) [2017/09/27 18:47:56.058272, 3, pid=6646, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock) linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 1144407203 [2017/09/27 18:47:56.058638, 5, pid=6646, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock) set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/1144407203, tv_sec = 59cbc82c, tv_usec = cf48 2. smbd-2 and smbd-3 try to open the file. [2017/09/27 18:48:03.989624, 0, pid=6435, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open) open(file1, flags=0x20802, mode=0x1e4): errno=11 [2017/09/27 18:48:03.989666, 3, pid=6435] ../source3/smbd/open.c:1179(open_file) Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626) [2017/09/27 18:48:03.989702, 1, pid=6435] ../source3/smbd/open.c:3264(open_file_ntcreate) open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY [2017/09/27 18:48:03.989949, 1, pid=6435] ../source3/smbd/open.c:1802(delay_for_oplock) breaking from 7 to 0 (delay_mask=4,first_open_attempt=1) [2017/09/27 18:48:03.990174, 1, pid=6435] ../source3/smbd/open.c:1574(send_break_message) Sending break request to PID 6646 [2017/09/27 18:48:03.990642, 1, pid=6435] ../source3/smbd/open.c:2287(defer_open) defer_open: request time [2017/09/27 18:48:03] timeout [2017/09/27 18:49:03] mid [9] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0] [2017/09/27 18:48:03.990704, 1, pid=6435] ../source3/smbd/open.c:2309(defer_open) defer_open: defering mid 9 [2017/09/27 18:48:03.990977, 1, pid=6435] ../source3/smbd/open.c:3307(open_file_ntcreate) Sent oplock break request to kernel oplock holder [2017/09/27 18:48:08.072527, 0, pid=6534, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open) open(file1, flags=0x20802, mode=0x1e4): errno=11 [2017/09/27 18:48:08.072613, 3, pid=6534] ../source3/smbd/open.c:1179(open_file) Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626) [2017/09/27 18:48:08.072833, 1, pid=6534] ../source3/smbd/open.c:3264(open_file_ntcreate) open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY [2017/09/27 18:48:08.073151, 1, pid=6534] ../source3/smbd/open.c:1802(delay_for_oplock) breaking from 7 to 0 (delay_mask=4,first_open_attempt=1) [2017/09/27 18:48:08.073277, 1, pid=6534] ../source3/smbd/open.c:1574(send_break_message) Sending break request to PID 6646 [2017/09/27 18:48:08.074171, 1, pid=6534] ../source3/smbd/open.c:2287(defer_open) defer_open: request time [2017/09/27 18:48:08] timeout [2017/09/27 18:49:08] mid [9] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0] [2017/09/27 18:48:08.074266, 1, pid=6534] ../source3/smbd/open.c:2309(defer_open) defer_open: defering mid 9 [2017/09/27 18:48:08.074453, 1, pid=6534] ../source3/smbd/open.c:3307(open_file_ntcreate) Sent oplock break request to kernel oplock holder 3. smbd-1 sends oplock break request to the windows host. [2017/09/27 18:48:03.989699, 1, pid=6646, class=locking] ../source3/smbd/oplock.c:43(break_kernel_oplock) break_kernel_oplock: break oplock 0x3 on file /export/v1:file1 [2017/09/27 18:48:03.991711, 1, pid=6646, class=locking] ../source3/smbd/oplock.c:806(process_oplock_break_message) Got oplock break to 0 message from pid 6435: fd1c:c3:0/1144407203 [2017/09/27 18:48:03.991771, 1, pid=6646, class=locking] ../source3/smbd/oplock.c:656(initial_break_processing) initial_break_processing: called for fd1c:c3:0/1144407203 Current oplocks_open (exclusive = 1, levelII = 0) [2017/09/27 18:48:03.991908, 1, pid=6646, class=locking] ../source3/smbd/oplock.c:921(process_oplock_break_message) break_from=7, break_to=0 [2017/09/27 18:48:03.991962, 1, pid=6646] ../source3/smbd/smb2_break.c:474(send_break_message_smb2) send_break_message_smb2: sending oplock break for file file1, fnum 2868223777, smb2 level 0 4. smbd-1 closes the file. [2017/09/27 18:48:24.000888, 2, pid=6646] ../source3/smbd/close.c:788(close_normal_file) nobody closed file file1 (numopen=0) NT_STATUS_OK [2017/09/27 18:48:24.001687, 5, pid=6646] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order) check lock order 1 for /var/lock/samba/smbXsrv_open_global.tdb [2017/09/27 18:48:24.002202, 5, pid=6646] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor) release lock order 1 for /var/lock/samba/smbXsrv_open_global.tdb [2017/09/27 18:48:24.002496, 5, pid=6646] ../source3/smbd/files.c:569(file_free) freed files structure 2868223777 (0 used) 5. smbd-2 calls defer_open_done(), sees that the file has no oplock and reschedules open. [2017/09/27 18:48:28.002100, 5, pid=6435] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order) check lock order 1 for /var/lock/samba/locking.tdb [2017/09/27 18:48:28.002241, 5, pid=6435, class=locking] ../source3/locking/share_mode_lock.c:547(get_share_mode_lock_internal) get_share_mode_lock_internal: Could not get share mode lock [2017/09/27 18:48:28.002297, 5, pid=6435] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor) release lock order 1 for /var/lock/samba/locking.tdb [2017/09/27 18:48:28.002389, 1, pid=6435] ../source3/smbd/open.c:2406(defer_open_done) defer_open_done: scheduling mid 9 6. smbd-2 opens the file and sets the lease. [2017/09/27 18:48:28.005909, 5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open) open(file1, flags=0x20802, mode=0x1e4) [2017/09/27 18:48:28.005957, 2, pid=6435] ../source3/smbd/open.c:1322(open_file) nobody opened file file1 read=No write=Yes (numopen=1) [2017/09/27 18:48:28.011377, 3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock) linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 3200188779 [2017/09/27 18:48:28.011634, 5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock) set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/3200188779, tv_sec = 59cbc84c, tv_usec = 13bb 7. smbd-3 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open. [2017/09/27 18:48:34.004600, 5, pid=6534] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order) check lock order 1 for /var/lock/samba/locking.tdb [2017/09/27 18:48:34.004885, 1, pid=6534] ../source3/smbd/open.c:2396(defer_open_done) defer_open_done: Unchanged lease: 7 [2017/09/27 18:48:34.004916, 1, pid=6534] ../source3/smbd/open.c:2423(defer_open_done) defer_open_done: Keep waiting for oplock release for [/export/v1/file1] mid: 9 [2017/09/27 18:48:34.004979, 5, pid=6534] ../lib/dbwrap/dbwrap.c:127(dbwrap_lock_order_state_destructor) release lock order 1 for /var/lock/samba/locking.tdb 8. smbd-3 calls defer_open_done() because of 60-sec timeout, and panics. [2017/09/27 18:49:08.085984, 1, pid=6534] ../source3/smbd/open.c:2349(defer_open_done) dbwrap_watched_watch_recv returned NT_STATUS_IO_TIMEOUT [2017/09/27 18:49:08.086021, 0, pid=6534] ../source3/lib/util.c:791(smb_panic_s3) PANIC (pid 6534): Kernel oplock holder didn't respond to break message
Even simpler scenario with only 2 smbd processes: 1. smbd-1 opens the file and sets the lease. 2. smbd-2 tries to open the file. open() fails(EAGAIN) and open is deferred. 3. smbd-1 sends oplock break request to the windows host. 4. smbd-1 closes the file. 5. smbd-1 opens the file and sets the lease. 6. smbd-2 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open. Now smbd-2 wait for the lease to be broken, smbd-1 holds the lease but doesn't know smbd-2 waits for it. In 60 sec smbd-2 panics. This scenario is similar to one described in https://bugzilla.samba.org/show_bug.cgi?id=12853#c10. The difference is that in 12853 smbd-2 rescheduled the open request, that later did blocking open(), and here request is not rescheduled. 1. smbd-1 opens the file and sets the lease. [2017/09/27 21:07:13.442844, 5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open) open(file1, flags=0x208c2, mode=0x1e4) [2017/09/27 21:07:13.445673, 3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock) linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 2689086092 [2017/09/27 21:07:13.445770, 5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock) set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/2689086092, tv_sec = 59cbe8d1, tv_usec = 6b493 2. smbd-2 tries to open the file. [2017/09/27 21:07:31.212303, 0, pid=8017, class=vfs] ../source3/modules/vfs_default.c:566(vfswrap_open) open(file1, flags=0x20802, mode=0x1e4): errno=11 [2017/09/27 21:07:31.212404, 3, pid=8017] ../source3/smbd/open.c:1179(open_file) Error opening file file1 (NT_STATUS_NETWORK_BUSY) (local_flags=2626) (flags=2626) [2017/09/27 21:07:31.212512, 1, pid=8017] ../source3/smbd/open.c:3264(open_file_ntcreate) open_file /export/v1:file1 with flags=0x2 flags2=0xA40 mode=0744, access_mask = 0x120196, open_access_mask = 0x120196 returned NT_STATUS_NETWORK_BUSY [2017/09/27 21:07:31.213188, 1, pid=8017] ../source3/smbd/open.c:1802(delay_for_oplock) breaking from 7 to 0 (delay_mask=4,first_open_attempt=1) [2017/09/27 21:07:31.213276, 1, pid=8017] ../source3/smbd/open.c:1574(send_break_message) Sending break request to PID 6435 [2017/09/27 21:07:31.215086, 1, pid=8017] ../source3/smbd/open.c:2287(defer_open) defer_open: request time [2017/09/27 21:07:31] timeout [2017/09/27 21:08:31] mid [19] delayed_for_oplocks [yes] kernel_oplock [yes] file_id [fd1c:c3:0] [2017/09/27 21:07:31.215625, 1, pid=8017] ../source3/smbd/open.c:2309(defer_open) defer_open: defering mid 19 [2017/09/27 21:07:31.215826, 1, pid=8017] ../source3/smbd/open.c:3307(open_file_ntcreate) Sent oplock break request to kernel oplock holder 3. smbd-1 sends oplock break request to the windows host. [2017/09/27 21:07:31.212417, 1, pid=6435, class=locking] ../source3/smbd/oplock.c:43(break_kernel_oplock) break_kernel_oplock: break oplock 0x3 on file /export/v1:file1 [2017/09/27 21:07:31.214064, 1, pid=6435, class=locking] ../source3/smbd/oplock.c:1000(process_kernel_oplock_break) Got kernel oplock break message from pid 6435: fd1c:c3:0/2689086092 [2017/09/27 21:07:31.214144, 1, pid=6435, class=locking] ../source3/smbd/oplock.c:656(initial_break_processing) initial_break_processing: called for fd1c:c3:0/2689086092 Current oplocks_open (exclusive = 1, levelII = 0) [2017/09/27 21:07:31.214230, 1, pid=6435] ../source3/smbd/smb2_break.c:474(send_break_message_smb2) send_break_message_smb2: sending oplock break for file file1, fnum 1083006431, smb2 level 0 4. smbd-1 closes the file. [2017/09/27 21:07:51.220854, 2, pid=6435] ../source3/smbd/close.c:788(close_normal_file) nobody closed file file1 (numopen=1) NT_STATUS_OK 5. smbd-1 opens the file and sets the lease. [2017/09/27 21:07:51.226068, 5, pid=6435, class=vfs] ../source3/modules/vfs_default.c:564(vfswrap_open) open(file1, flags=0x20802, mode=0x1e4) [2017/09/27 21:07:51.232719, 3, pid=6435, class=locking] ../source3/smbd/oplock_linux.c:155(linux_set_kernel_oplock) linux_set_kernel_oplock: got kernel oplock on file file1, file_id = fd1c:c3:0 gen_id = 3903088344 [2017/09/27 21:07:51.233253, 5, pid=6435, class=locking] ../source3/smbd/oplock.c:89(set_file_oplock) set_file_oplock: granted oplock 0x3 on file /export/v1:file1, fd1c:c3:0/3903088344, tv_sec = 59cbe8f7, tv_usec = 36b60 6. smbd-2 calls defer_open_done(), sees that the file lease was not changed and does not reschedule open. [2017/09/27 21:08:01.220715, 5, pid=8017] ../lib/dbwrap/dbwrap.c:159(dbwrap_check_lock_order) check lock order 1 for /var/lock/samba/locking.tdb [2017/09/27 21:08:01.220992, 1, pid=8017] ../source3/smbd/open.c:2396(defer_open_done) defer_open_done: Unchanged lease: 7 [2017/09/27 21:08:01.221044, 1, pid=8017] ../source3/smbd/open.c:2423(defer_open_done) defer_open_done: Keep waiting for oplock release for [/export/v1/file1] mid: 19 7. smbd-2 calls defer_open_done() because of 60-sec timeout, and panics. [2017/09/27 21:08:31.214877, 1, pid=8017] ../source3/smbd/open.c:2349(defer_open_done) dbwrap_watched_watch_recv returned NT_STATUS_IO_TIMEOUT [2017/09/27 21:08:31.214914, 0, pid=8017] ../source3/lib/util.c:791(smb_panic_s3) PANIC (pid 8017): Kernel oplock holder didn't respond to break message
So looks like defer_open_done() needs to be checking more than: lease_type = get_lease_type_from_share_mode(lck->data); if ((lease_type != 0) && (lease_type == state->lease_type)) { DBG_DEBUG("Unchanged lease: %" PRIu32 "\n", lease_type); schedule_req = false; } to determine if the lease has changed, yes ? Would that be the fix needed here ?
Looks like the sequence_number field might do the trick here..
FYI. This is the commit that caused the problem. b35a296a27a0807c780f2a9e7af2f2e93feefaa8 I'll check into the details here first, as this is really subtle.
Ups, b35a296a27a0807c780f2a9e7af2f2e93feefaa8 was me. Nasty stuff, at some point we really have to make create_file fully async. Preperational stuff is here: https://git.samba.org/?p=slow/samba.git;a=shortlog;h=refs/heads/smb2_create_send Thanks for looking into this!
I don't think b35a296a27a0807c780f2a9e7af2f2e93feefaa8 was wrong as such, just that it didn't take into account this specific case. Should be fixable with some minor tweaks I think - but I'll need you to look at them really carefully :-).
OK, I think I've figured out a fix for this. Should have a patch for you to test early next week.
Created attachment 13742 [details] git-am *TEST* patch for 4.7.next, 4.6.next. Hi Lev, Can you test this patch ? I think b35a296a27a0807c780f2a9e7af2f2e93feefaa8 tried to fix the kernel oplock issue in the wrong way, so this patch tries another approach I think might be more successful. Cheers, Jeremy.
OK, after a conversation with Ralph, we have a plan for a fix for this :-). I'll add a new patch once I've coded up a regression test. Jeremy.
Created attachment 13764 [details] git-am fix for master Correct patch for master. Includes regression test case.
Created attachment 13770 [details] git-am fix for 4.7.next. Cherry-picked from the patch that went into master.
Created attachment 13771 [details] git-am fix for 4.6.next Back-port of the fix that went into master.
Comment on attachment 13770 [details] git-am fix for 4.7.next. Wrong bug...
Comment on attachment 13771 [details] git-am fix for 4.6.next Wrong bug...
Created attachment 13772 [details] git-am fix for 4.7.next. Sigh. I did the work, just added the wrong attachment. Correct one attached.
Created attachment 13773 [details] git-am fix for 4.6.next Sigh. I did the work, just added the wrong attachment. Correct one attached.
Comment on attachment 13772 [details] git-am fix for 4.7.next. I was just about two attach the backports myself...
Reassigning to Karolin for inclusion in 4.6 and 4.7.
(In reply to Ralph Böhme from comment #19) Pushed to autobuild-v4-{6,7}-test.
(In reply to Karolin Seeger from comment #20) Pushed to both branches. Closing out bug report. Thanks!