With a Samba debug build with additional debugging I was able to track down a failing Durabel Handle reconnect to cookie.allow_reconnect not being set in the cookie of the handle open record. [ 0 2018/07/23 19:24:00.720815 ../source3/smbd/durable.c:593 vfs_default_durable_reconnect] vfs_default_durable_reconnect: !cookie.allow_reconnect This happens because of a race condition, the reconnect is faster then the disconnect. In the disconnect we do a bunch of stuff including setting timestamps with utimes() before updating the open record with a cookie that has cookie.allow_reconnect set to true. What happens is that the utimes() in the disconnect is so slow, allowing the reconnect to overtake the disconnect. The reconnect then finds the old open record with the seqnum node0/win10.log: [ 1 2018/07/23 19:24:00.716142 ../librpc/ndr/ndr.c:419 ndr_print_debug] ... seqnum : 0x000000a4 (164) This is the orignal open record from the other node before it got updated in the disconnect: [10 2018/07/23 19:23:40.363881 ../source3/smbd/smb2_server.c:2327 smbd_smb2_request_dispatch] smbd_smb2_request_dispatch: opcode[SMB2_OP_CREATE] mid = 4 ... [ 1 2018/07/23 19:23:40.378799 ../librpc/ndr/ndr.c:419 ndr_print_debug] ... seqnum : 0x000000a4 (164) ... backend_cookie : DATA_BLOB length=428 [0000] 56 46 53 5F 44 45 46 41 55 4C 54 5F 44 55 52 41 VFS_DEFA ULT_DURA [0010] 42 4C 45 5F 43 4F 4F 4B 49 45 5F 4D 41 47 49 43 BLE_COOK IE_MAGIC [0020] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 [0030] 00 00 00 00 00 00 00 00 15 60 B1 C7 FB 6A F5 53 ........ .`...j.S [ 1 2018/07/23 19:24:00.822124 ../librpc/ndr/ndr.c:419 ndr_print_debug] &global_blob: struct smbXsrv_open_globalB version : SMBXSRV_VERSION_0 (0) seqnum : 0x000000a6 (166) ... backend_cookie : DATA_BLOB length=428 [0000] 56 46 53 5F 44 45 46 41 55 4C 54 5F 44 55 52 41 VFS_DEFA ULT_DURA [0010] 42 4C 45 5F 43 4F 4F 4B 49 45 5F 4D 41 47 49 43 BLE_COOK IE_MAGIC [0020] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 [0030] 00 00 00 00 01 00 00 00 15 60 B1 C7 FB 6A F5 53 ........ .`...j.S Offset 0x34 in the backend_cookie is the cookie.allow_reconnect variable.
Created attachment 14453 [details] Possible patch
Comment on attachment 14453 [details] Possible patch Looks good, but we may want to move session->client = NULL; down a bit, after invalidate_vuid()...
(In reply to Stefan Metzmacher from comment #6) Should we do the same in smbXsrv_tcon_disconnect()?
(In reply to Ralph Böhme from comment #5) Tested the patch after applying an additional hack that simulated slow disconnect: --- a/source3/smbd/durable.c +++ b/source3/smbd/durable.c @@ -223,6 +223,7 @@ NTSTATUS vfs_default_durable_disconnect(struct files_struct *fsp, if (!null_timespec(ft.mtime)) { round_timespec(conn->ts_res, &ft.mtime); file_ntimes(conn, fsp->fsp_name, &ft); + smb_msleep(5000); } ok = mark_share_mode_disconnected(lck, fsp); This causes the disconnect to be stuck for 5 seconds, so the new session has to correctly rendezvous with the old session. With this I see the expected NT_STATUS_PENDING in the session setup, where the new session is blocked and waits until the old session is torn down.
(In reply to Stefan Metzmacher from comment #6) I don't see a point in doing so, session->client isn't used in invalidate_vuid(). I guess it's also an change to really related to this bug, is it?
(In reply to Stefan Metzmacher from comment #7) Makes sense. Looks like closing files via smbXsrv_session_logoff() -> file_close_user() is redundant anyway as files are already closed via smbXsrv_tcon_disconnect() -> close_cnum() -> file_close_conn() I'll leave as is, but add a comment...
Created attachment 14472 [details] Patch for 4.9, cherry-picked from master
Created attachment 14473 [details] Patch for 4.7 and 4.8 backported from master
Pushed to autobuild-v4-{9,8,7}-test.
(In reply to Karolin Seeger from comment #13) Pushed to all branches. Closing out bug report. Thanks!