Bug 15186 - File creation error with extended attributes with left-over file on Lustre with disabled user_xattr
Summary: File creation error with extended attributes with left-over file on Lustre wi...
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.17.0
Hardware: All Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-26 07:48 UTC by Michael Weiser
Modified: 2022-12-15 16:34 UTC (History)
1 user (show)

See Also:


Attachments
EA-filtering VFS module PoC (5.29 KB, patch)
2022-09-26 07:48 UTC, Michael Weiser
no flags Details
Minimalist program to interrogate EAs on Windows (1.96 KB, text/x-csrc)
2022-09-26 07:49 UTC, Michael Weiser
no flags Details
Patch to fuse sshfs to simulate Lustre behaviour (1.08 KB, patch)
2022-09-26 07:51 UTC, Michael Weiser
no flags Details
git-am fix for 4.17.next. (5.26 KB, patch)
2022-10-28 19:36 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Weiser 2022-09-26 07:48:23 UTC
Created attachment 17531 [details]
EA-filtering VFS module PoC

As discussed on the samba list[1,2] we're seeing a "File already exists" error
message in Windows Explorer when copying files on a Lustre filesystem exported
via samba. It was first described by Christian Kuntz in reports to the samba and
Lustre lists[3,4]. We were able to gather some additional details:

- copying files from a local NTFS filesystem on Windows 10 to a Lustre share
  works fine
- duplicating files *on* the Lustre share shows the problematic behaviour: for
  each file Windows complains that the file already exists and indeed, a
  zero-sized file of that name has been created where before there was none
- copying a file off of the Lustre share onto the local NTFS works fine
  but copying it back onto the Lustre share again shows the problem - so it is
  travelling with the files and suggests that it's not related to
  sever-side-copying
- switching option ea support to no works around the problem - so it seems to
  be related to extended attributes

The relevant part of a level 10 debug log reads:

[2022/09/21 15:50:57.127478, 10, pid=1967443, effective(200737, 130),
real(200737, 0), class=locking]
../../source3/locking/share_mode_lock.c:171(share_mode_memcache_store)
  share_mode_memcache_store: stored entry for file blorf/blorf/blorf/blim.txt
  epoch d7ce6bcd04632a key 3188158982:144115340681419950:0
[2022/09/21 15:50:57.127601, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:335(get_ea_names_from_fsp)
  get_ea_names_from_fsp: ea_namelist size = 11
[2022/09/21 15:50:57.127612, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:262(get_ea_value_fsp)
  get_ea_value: EA lustre.lov is of length 56
[2022/09/21 15:50:57.127616, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../lib/util/util.c:570(dump_data)
  [0000] D0 0B D1 0B 01 00 00 00   AE 1C 00 00 00 00 00 00   ........ ........
  [0010] 88 23 00 00 02 00 00 00   00 00 10 00 01 00 00 00   .#...... ........
  [0020] 26 CB 2C 00 00 00 00 00   00 00 00 00 00 00 00 00   &.,..... ........
  [0030] 00 00 00 00 08 00 00 00                             ........
[2022/09/21 15:50:57.127641, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:508(get_ea_list_from_fsp)
  get_ea_list_from_file: total_len = 71, lustre.lov, val len = 56
[2022/09/21 15:50:57.127646, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:520(get_ea_list_from_fsp)
  get_ea_list_from_file: total_len = 75
[2022/09/21 15:50:57.127651, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:726(canonicalize_ea_name)
  canonicalize_ea_name: LUSTRE.LOV -> lustre.lov
[2022/09/21 15:50:57.127655, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:786(set_ea)
  set_ea: ea_name user.lustre.lov ealen = 56
[2022/09/21 15:50:57.127660, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/trans2.c:810(set_ea)
  set_ea: setting ea name user.lustre.lov on file blorf/blorf/blorf/blim.txt
  by file descriptor.
[2022/09/21 15:50:57.127668, 10, pid=1967443, effective(200737, 130),
real(200737, 0)] ../../source3/smbd/open.c:6066(create_file_unixpath)
  create_file_unixpath: NT_STATUS_EAS_NOT_SUPPORTED

So create_file_unixpath errors out with NT_STATUS_EAS_NOT_SUPPORTED because it
cannot set extended attribute user.lustre.lov in Lustre. It's this codepath:

https://github.com/samba-team/samba/blob/d9dda4b7af284ecbee4d04a89bd16fc0098e2931/source3/smbd/open.c#L6395

It is assumed that Windows Explorer does expect a CreateFile call with extended
attributes to fail atomically and gets confused by the fact that samba leaves
the file it just created but couldn't set the attributes on lying around after
erroring out of the open. And indeed, patching the code path like so avoids the
problem:

--- samba-4.17.0.orig/source3/smbd/open.c       2022-09-06 16:22:03.302113000 +0200
+++ samba-4.17.0/source3/smbd/open.c    2022-09-21 12:38:22.724000000 +0200
@@ -6098,6 +6098,13 @@
            ((info == FILE_WAS_CREATED) || (info == FILE_WAS_OVERWRITTEN))) {
                status = set_ea(conn, fsp, ea_list);
                if (!NT_STATUS_IS_OK(status)) {
+                       DEBUG(10, ("set_ea failed: %s\n", nt_errstr(status)));
+                       if (info == FILE_WAS_CREATED) {
+                               DEBUG(10, ("file was created - unlinking\n"));
+                               SMB_VFS_UNLINKAT(conn, dirfsp,
+                                               smb_fname_atname, 0);
+                       }
+
                        goto fail;
                }
        }

The extended attribute lustre.lov is part of a set of attributes Lustre
exposes for every file reflecting internal housekeeping data:

$ getfattr -m - blarf
# file: blarf
lustre.lov
trusted.link
trusted.lma
trusted.lov

$ getfattr -n lustre.lov blarf
# file: blarf
lustre.lov=0s0AvRCwEAAABgJAEAAAAAAIQjAAACAAAAAAAQAAEAAAAAqAAAAAAAAAAAAAAAAAAAAAAAAAYAAAA=

Unprivileged users only get to see the lustre.lov attribute:

> getfattr -m - blarf
# file: blarf
lustre.lov

> getfattr -n lustre.lov blarf
# file: blarf
lustre.lov=0s0AvRCwEAAABgJAEAAAAAAIQjAAACAAAAAAAQAAEAAAAAqAAAAAAAAAAAAAAAAAAAAAAAAAYAAAA=

That attribute is exposed to Windows by samba and even travels with the file to
a local NTFS file system which explains the third observation above. This was
verified with a small program leveraging BackupRead (attached):

C:\Users\user>get_ea "blorf\blarf"
LUSTRE.LOV=<binary garbage because I couldn't be bothered to base64 encode it>

Lustre exposes the lustre.lov attribute to unprivileged users and allows to set
it within certain limits. Lustre has user xattrs disabled by default and returns
ENOTSUP when trying to set them.

samba on the other hand maps all EAs with non-standard namespaces into the user
namespace, turning lustre.lov into user.lustre.lov upon file creation which is
rejected by Lustre and causes above error and behaviour.

Finally, I was also able to avoid the problem by writing a small VFS module
which hooks getxattr and listxattr and filters out the lustre.lov attribute so
it's never exposed to the client (attached).

Additional info: For debugging purposes I was able to replicate the exact same
behaviour with fuse and sshfs: sshfs does not normally implement extended
attributes. Adding getxattr and listxattr functions which synthesize a dummy
lustre.lov attribute one can observe exactly the same behaviour without
spinning up a whole Lustre. (attached as well)

Apart from this specific case the same problem is likely to occur with any
filesystem which returns an error when attempting to set a particular extended
attribute upon file creation.

Possible solutions suggested so far:

- VFS module filtering out the lustre.lov attribute so it is not seen by clients
- making user attribute mapping symmetric so all non-standard-namespace EAs
  like lustre.lov are left alone upon getting and setting (Daniel Kobras, [5])
- removing the created file upon which the set_ea operation failed in the error
  path before returning

[1] https://lists.samba.org/archive/samba/2022-September/241918.html
[2] https://lists.samba.org/archive/samba/2022-September/241932.html
[3] https://lists.samba.org/archive/samba/2022-April/240312.html
[4] http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-April/018030.html
[5] http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-September/018286.html
Comment 1 Michael Weiser 2022-09-26 07:49:44 UTC
Created attachment 17532 [details]
Minimalist program to interrogate EAs on Windows
Comment 2 Michael Weiser 2022-09-26 07:51:19 UTC
Created attachment 17533 [details]
Patch to fuse sshfs to simulate Lustre behaviour

Meant as a reproducer and development aid for environments where no Lustre filesystem is easily avaialable.
Comment 3 Samba QA Contact 2022-10-28 07:25:16 UTC
This bug was referenced in samba master:

34c6db64c2ff62673f8df218487cda4139c10843
Comment 4 Jeremy Allison 2022-10-28 19:36:36 UTC
Created attachment 17606 [details]
git-am fix for 4.17.next.

Cherry-pick from master (with added BUG: line in documentation patch, got missed in the master commit).
Comment 5 Jule Anger 2022-10-31 14:29:46 UTC
Pushed to autobuild-v4-17-test.
Comment 6 Samba QA Contact 2022-10-31 22:04:03 UTC
This bug was referenced in samba v4-17-test:

f4507b399cfd19ab37e6eada57ee15504ad9979a
5c32c822edd622d608b20a6c813a19c5d8bdced4
Comment 7 Jule Anger 2022-11-03 09:56:49 UTC
Closing out bug report.

Thanks!
Comment 8 Samba QA Contact 2022-12-15 16:34:24 UTC
This bug was referenced in samba v4-17-stable (Release samba-4.17.4):

f4507b399cfd19ab37e6eada57ee15504ad9979a
5c32c822edd622d608b20a6c813a19c5d8bdced4