Bug 13549 - Durable Reconnect fails because cookie.allow_reconnect is not set
Summary: Durable Reconnect fails because cookie.allow_reconnect is not set
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: unspecified
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-27 17:57 UTC by Ralph Böhme
Modified: 2018-09-06 07:05 UTC (History)
2 users (show)

See Also:


Attachments
Samba Debug Logs from two cluster nodes (2.06 MB, application/octet-stream)
2018-07-27 17:59 UTC, Ralph Böhme
no flags Details
Possible patch (2.70 KB, patch)
2018-08-29 15:25 UTC, Ralph Böhme
no flags Details
Patch for 4.9, cherry-picked from master (19.89 KB, patch)
2018-09-04 09:41 UTC, Ralph Böhme
slow: review? (jra)
metze: review+
Details
Patch for 4.7 and 4.8 backported from master (19.90 KB, patch)
2018-09-04 09:42 UTC, Ralph Böhme
slow: review? (jra)
metze: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ralph Böhme 2018-07-27 17:57:50 UTC
With a Samba debug build with additional debugging I was able to track down a failing Durabel Handle reconnect to cookie.allow_reconnect not being set in the cookie of the handle open record.

[ 0 2018/07/23 19:24:00.720815 ../source3/smbd/durable.c:593 vfs_default_durable_reconnect]
  vfs_default_durable_reconnect: !cookie.allow_reconnect

This happens because of a race condition, the reconnect is faster then the disconnect. In the disconnect we do a bunch of stuff including setting timestamps with utimes() before updating the open record with a cookie that has cookie.allow_reconnect set to true.

What happens is that the utimes() in the disconnect is so slow, allowing the reconnect to overtake the disconnect.

The reconnect then finds the old open record with the seqnum

node0/win10.log:

[ 1 2018/07/23 19:24:00.716142 ../librpc/ndr/ndr.c:419 ndr_print_debug]
...
          seqnum                   : 0x000000a4 (164)

This is the orignal open record from the other node before it got updated in the disconnect:

[10 2018/07/23 19:23:40.363881 ../source3/smbd/smb2_server.c:2327 smbd_smb2_request_dispatch]
  smbd_smb2_request_dispatch: opcode[SMB2_OP_CREATE] mid = 4
...

[ 1 2018/07/23 19:23:40.378799 ../librpc/ndr/ndr.c:419 ndr_print_debug]
...
          seqnum                   : 0x000000a4 (164)
...
                  backend_cookie           : DATA_BLOB length=428
  [0000] 56 46 53 5F 44 45 46 41   55 4C 54 5F 44 55 52 41   VFS_DEFA ULT_DURA
  [0010] 42 4C 45 5F 43 4F 4F 4B   49 45 5F 4D 41 47 49 43   BLE_COOK IE_MAGIC
  [0020] 20 20 20 20 20 20 20 20   20 20 20 20 20 20 20 20
  [0030] 00 00 00 00 00 00 00 00   15 60 B1 C7 FB 6A F5 53   ........ .`...j.S

[ 1 2018/07/23 19:24:00.822124 ../librpc/ndr/ndr.c:419 ndr_print_debug]
       &global_blob: struct smbXsrv_open_globalB
          version                  : SMBXSRV_VERSION_0 (0)
          seqnum                   : 0x000000a6 (166)
...
                  backend_cookie           : DATA_BLOB length=428
  [0000] 56 46 53 5F 44 45 46 41   55 4C 54 5F 44 55 52 41   VFS_DEFA ULT_DURA
  [0010] 42 4C 45 5F 43 4F 4F 4B   49 45 5F 4D 41 47 49 43   BLE_COOK IE_MAGIC
  [0020] 20 20 20 20 20 20 20 20   20 20 20 20 20 20 20 20
  [0030] 00 00 00 00 01 00 00 00   15 60 B1 C7 FB 6A F5 53   ........ .`...j.S

Offset 0x34 in the backend_cookie is the cookie.allow_reconnect variable.
Comment 5 Ralph Böhme 2018-08-29 15:25:45 UTC
Created attachment 14453 [details]
Possible patch
Comment 6 Stefan Metzmacher 2018-08-29 20:26:14 UTC
Comment on attachment 14453 [details]
Possible patch

Looks good, but we may want to move session->client = NULL; down a bit,
after invalidate_vuid()...
Comment 7 Stefan Metzmacher 2018-08-29 20:28:36 UTC
(In reply to Stefan Metzmacher from comment #6)

Should we do the same in smbXsrv_tcon_disconnect()?
Comment 8 Ralph Böhme 2018-08-30 13:22:56 UTC
(In reply to Ralph Böhme from comment #5)
Tested the patch after applying an additional hack that simulated slow disconnect:

--- a/source3/smbd/durable.c
+++ b/source3/smbd/durable.c
@@ -223,6 +223,7 @@ NTSTATUS vfs_default_durable_disconnect(struct files_struct *fsp,
                if (!null_timespec(ft.mtime)) {
                        round_timespec(conn->ts_res, &ft.mtime);
                        file_ntimes(conn, fsp->fsp_name, &ft);
+                       smb_msleep(5000);
                }
 
                ok = mark_share_mode_disconnected(lck, fsp);

This causes the disconnect to be stuck for 5 seconds, so the new session has to correctly rendezvous with the old session.

With this I see the expected NT_STATUS_PENDING in the session setup, where the new session is blocked and waits until the old session is torn down.
Comment 9 Ralph Böhme 2018-08-30 13:26:09 UTC
(In reply to Stefan Metzmacher from comment #6)
I don't see a point in doing so, session->client isn't used in invalidate_vuid().

I guess it's also an change to really related to this bug, is it?
Comment 10 Ralph Böhme 2018-08-30 13:54:25 UTC
(In reply to Stefan Metzmacher from comment #7)
Makes sense.

Looks like closing files via

  smbXsrv_session_logoff() -> file_close_user()

is redundant anyway as files are already closed via

  smbXsrv_tcon_disconnect() -> close_cnum() -> file_close_conn()

I'll leave as is, but add a comment...
Comment 11 Ralph Böhme 2018-09-04 09:41:40 UTC
Created attachment 14472 [details]
Patch for 4.9, cherry-picked from master
Comment 12 Ralph Böhme 2018-09-04 09:42:16 UTC
Created attachment 14473 [details]
Patch for 4.7 and 4.8 backported from master
Comment 13 Karolin Seeger 2018-09-04 10:01:15 UTC
Pushed to autobuild-v4-{9,8,7}-test.
Comment 14 Karolin Seeger 2018-09-06 07:05:33 UTC
(In reply to Karolin Seeger from comment #13)
Pushed to all branches.
Closing out bug report.

Thanks!