Created attachment 17531 [details] EA-filtering VFS module PoC As discussed on the samba list[1,2] we're seeing a "File already exists" error message in Windows Explorer when copying files on a Lustre filesystem exported via samba. It was first described by Christian Kuntz in reports to the samba and Lustre lists[3,4]. We were able to gather some additional details: - copying files from a local NTFS filesystem on Windows 10 to a Lustre share works fine - duplicating files *on* the Lustre share shows the problematic behaviour: for each file Windows complains that the file already exists and indeed, a zero-sized file of that name has been created where before there was none - copying a file off of the Lustre share onto the local NTFS works fine but copying it back onto the Lustre share again shows the problem - so it is travelling with the files and suggests that it's not related to sever-side-copying - switching option ea support to no works around the problem - so it seems to be related to extended attributes The relevant part of a level 10 debug log reads: [2022/09/21 15:50:57.127478, 10, pid=1967443, effective(200737, 130), real(200737, 0), class=locking] ../../source3/locking/share_mode_lock.c:171(share_mode_memcache_store) share_mode_memcache_store: stored entry for file blorf/blorf/blorf/blim.txt epoch d7ce6bcd04632a key 3188158982:144115340681419950:0 [2022/09/21 15:50:57.127601, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:335(get_ea_names_from_fsp) get_ea_names_from_fsp: ea_namelist size = 11 [2022/09/21 15:50:57.127612, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:262(get_ea_value_fsp) get_ea_value: EA lustre.lov is of length 56 [2022/09/21 15:50:57.127616, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../lib/util/util.c:570(dump_data) [0000] D0 0B D1 0B 01 00 00 00 AE 1C 00 00 00 00 00 00 ........ ........ [0010] 88 23 00 00 02 00 00 00 00 00 10 00 01 00 00 00 .#...... ........ [0020] 26 CB 2C 00 00 00 00 00 00 00 00 00 00 00 00 00 &.,..... ........ [0030] 00 00 00 00 08 00 00 00 ........ [2022/09/21 15:50:57.127641, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:508(get_ea_list_from_fsp) get_ea_list_from_file: total_len = 71, lustre.lov, val len = 56 [2022/09/21 15:50:57.127646, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:520(get_ea_list_from_fsp) get_ea_list_from_file: total_len = 75 [2022/09/21 15:50:57.127651, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:726(canonicalize_ea_name) canonicalize_ea_name: LUSTRE.LOV -> lustre.lov [2022/09/21 15:50:57.127655, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:786(set_ea) set_ea: ea_name user.lustre.lov ealen = 56 [2022/09/21 15:50:57.127660, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/trans2.c:810(set_ea) set_ea: setting ea name user.lustre.lov on file blorf/blorf/blorf/blim.txt by file descriptor. [2022/09/21 15:50:57.127668, 10, pid=1967443, effective(200737, 130), real(200737, 0)] ../../source3/smbd/open.c:6066(create_file_unixpath) create_file_unixpath: NT_STATUS_EAS_NOT_SUPPORTED So create_file_unixpath errors out with NT_STATUS_EAS_NOT_SUPPORTED because it cannot set extended attribute user.lustre.lov in Lustre. It's this codepath: https://github.com/samba-team/samba/blob/d9dda4b7af284ecbee4d04a89bd16fc0098e2931/source3/smbd/open.c#L6395 It is assumed that Windows Explorer does expect a CreateFile call with extended attributes to fail atomically and gets confused by the fact that samba leaves the file it just created but couldn't set the attributes on lying around after erroring out of the open. And indeed, patching the code path like so avoids the problem: --- samba-4.17.0.orig/source3/smbd/open.c 2022-09-06 16:22:03.302113000 +0200 +++ samba-4.17.0/source3/smbd/open.c 2022-09-21 12:38:22.724000000 +0200 @@ -6098,6 +6098,13 @@ ((info == FILE_WAS_CREATED) || (info == FILE_WAS_OVERWRITTEN))) { status = set_ea(conn, fsp, ea_list); if (!NT_STATUS_IS_OK(status)) { + DEBUG(10, ("set_ea failed: %s\n", nt_errstr(status))); + if (info == FILE_WAS_CREATED) { + DEBUG(10, ("file was created - unlinking\n")); + SMB_VFS_UNLINKAT(conn, dirfsp, + smb_fname_atname, 0); + } + goto fail; } } The extended attribute lustre.lov is part of a set of attributes Lustre exposes for every file reflecting internal housekeeping data: $ getfattr -m - blarf # file: blarf lustre.lov trusted.link trusted.lma trusted.lov $ getfattr -n lustre.lov blarf # file: blarf lustre.lov=0s0AvRCwEAAABgJAEAAAAAAIQjAAACAAAAAAAQAAEAAAAAqAAAAAAAAAAAAAAAAAAAAAAAAAYAAAA= Unprivileged users only get to see the lustre.lov attribute: > getfattr -m - blarf # file: blarf lustre.lov > getfattr -n lustre.lov blarf # file: blarf lustre.lov=0s0AvRCwEAAABgJAEAAAAAAIQjAAACAAAAAAAQAAEAAAAAqAAAAAAAAAAAAAAAAAAAAAAAAAYAAAA= That attribute is exposed to Windows by samba and even travels with the file to a local NTFS file system which explains the third observation above. This was verified with a small program leveraging BackupRead (attached): C:\Users\user>get_ea "blorf\blarf" LUSTRE.LOV=<binary garbage because I couldn't be bothered to base64 encode it> Lustre exposes the lustre.lov attribute to unprivileged users and allows to set it within certain limits. Lustre has user xattrs disabled by default and returns ENOTSUP when trying to set them. samba on the other hand maps all EAs with non-standard namespaces into the user namespace, turning lustre.lov into user.lustre.lov upon file creation which is rejected by Lustre and causes above error and behaviour. Finally, I was also able to avoid the problem by writing a small VFS module which hooks getxattr and listxattr and filters out the lustre.lov attribute so it's never exposed to the client (attached). Additional info: For debugging purposes I was able to replicate the exact same behaviour with fuse and sshfs: sshfs does not normally implement extended attributes. Adding getxattr and listxattr functions which synthesize a dummy lustre.lov attribute one can observe exactly the same behaviour without spinning up a whole Lustre. (attached as well) Apart from this specific case the same problem is likely to occur with any filesystem which returns an error when attempting to set a particular extended attribute upon file creation. Possible solutions suggested so far: - VFS module filtering out the lustre.lov attribute so it is not seen by clients - making user attribute mapping symmetric so all non-standard-namespace EAs like lustre.lov are left alone upon getting and setting (Daniel Kobras, [5]) - removing the created file upon which the set_ea operation failed in the error path before returning [1] https://lists.samba.org/archive/samba/2022-September/241918.html [2] https://lists.samba.org/archive/samba/2022-September/241932.html [3] https://lists.samba.org/archive/samba/2022-April/240312.html [4] http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-April/018030.html [5] http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-September/018286.html
Created attachment 17532 [details] Minimalist program to interrogate EAs on Windows
Created attachment 17533 [details] Patch to fuse sshfs to simulate Lustre behaviour Meant as a reproducer and development aid for environments where no Lustre filesystem is easily avaialable.
This bug was referenced in samba master: 34c6db64c2ff62673f8df218487cda4139c10843
Created attachment 17606 [details] git-am fix for 4.17.next. Cherry-pick from master (with added BUG: line in documentation patch, got missed in the master commit).
Pushed to autobuild-v4-17-test.
This bug was referenced in samba v4-17-test: f4507b399cfd19ab37e6eada57ee15504ad9979a 5c32c822edd622d608b20a6c813a19c5d8bdced4
Closing out bug report. Thanks!
This bug was referenced in samba v4-17-stable (Release samba-4.17.4): f4507b399cfd19ab37e6eada57ee15504ad9979a 5c32c822edd622d608b20a6c813a19c5d8bdced4