Bug 14420 - zero_file_id = no causes read/write failures for Mac clients
Summary: zero_file_id = no causes read/write failures for Mac clients
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.12.3
Hardware: x64 Mac OS X
: P5 major (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-28 19:40 UTC by Gordon
Modified: 2020-08-26 14:28 UTC (History)
3 users (show)

See Also:


Attachments
Dataset that reproduces issue, plus smb.conf used and logs (51.10 MB, application/x-7z-compressed)
2020-06-28 19:40 UTC, Gordon
no flags Details
Samba log, rsync log, network trace, smaller dataset (13.94 MB, application/x-7z-compressed)
2020-06-30 21:40 UTC, Gordon
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gordon 2020-06-28 19:40:43 UTC
Created attachment 16094 [details]
Dataset that reproduces issue, plus smb.conf used and logs

The default value for vfs_fruit's `zero_file_id` was recently changed to `no`, although the latest version of the Samba [documentation](https://www.samba.org/samba/docs/current/man-html/vfs_fruit.8.html) still mentions the old default.

I'm not sure why this change was made, as it seems like all the problems mentioned in the documentation ("usability issues or even data loss") still exist.

Specifically:
- Time Machine backups will consistently fail with this option set to `no`.
- Basic file I/O operations, especially involving lots of small files, will consistently fail if this is set to `no.`.

In this example I have a dataset consisting of 16,000 directories, each containing two empty directories and a single text file (it's a slightly sanitized version of a file-based database I use). Using rsync, I copy this dataset from my local disk to an empty Samba share (in this example, "Container" is the parent directory of the directory "the_records" that is to be copied):

```
rsync -a --progress ~/Container/ /Volumes/Ubuntu Ryzen/
```

With `zero_file_id = no `, the operation will always fail (although not on the same file) with the following:

```
rsync: [receiver] write failed on "/Volumes/Ubuntu Ryzen/the_records/001f721f-ed2b-4e09-9245-05d5fd275f4a.event/data.txt": Is a directory (21)
```

If I set `zero_file_id = yes`, make no other changes and repeat the copy, it finishes without error every time.

Configuration:
Client: macOS Catalina 10.15.5 (19F101)
Server: Ubuntu Server 20.04 LTS x64
Server Filesystem: Verified issue with ext4 and ZFS.
Rsync: 3.2.0 (installed via Homebrew on macOS)

Verified issue with the following Samba versions:
- 4.11.6-ubuntu (default version for Ubuntu 20.04)
- 4.12.3, self-compiled.
- Same smb.conf used in both cases.

Attachment contains the following:
- Rsync log for failed copy.
- Samba log for same copy operation.
- smb.conf file used.
- The dataset that consistently reproduces the issue.

Quick repro:
1. macOS client (only tested with Catalina)
2 Install rsync on client. Recommended not to use macOS built-in rsync as it's woefully outdated. Instead, easiest to grab latest via homebrew: `brew install rsync`.
3. Create samba share using attached smb.conf (make sure you change the share path to something that exists).
4. Extract test dataset "dataset" from attached archive file.
5. Mount samba share on client.
6. Copy dataset to share from macOS client: `rsync -a --progress ~/dataset /Volumes/test_share/`
Comment 1 Ralph Böhme 2020-06-29 05:23:02 UTC
Can you please upload a parallel network trace plus Samba log level 10?
Comment 2 Gordon 2020-06-30 00:08:44 UTC
(In reply to Ralph Böhme from comment #1)

I originally attempted to use log level 10 (the uploaded log is level 5) but it filled up my entire hard drive with a log file over 200 GB in size before the copy even finished - in fact it was pegging my drive I/O just writing the log file, so the copy was moving at an incredibly glacial pace. Is a log file this large actually useful? If it's absolutely necessary I can try logging to a larger external drive…
Comment 3 Jeremy Allison 2020-06-30 01:24:25 UTC
This message:

rsync: [receiver] write failed on "/Volumes/Ubuntu Ryzen/the_records/001f721f-ed2b-4e09-9245-05d5fd275f4a.event/data.txt": Is a directory (21)

says that the server somehow is getting confused and thinks the client is issuing a SMB2_WRITE on a directory handle (which is disallowed).

A wireshark trace containing the NT_STATUS_ error returned that matches this error on the client would help greatly.
Comment 4 Ralph Böhme 2020-06-30 04:41:03 UTC
(In reply to Jeremy Allison from comment #3)
This is the symptom, not the cause. 

If the amount of tracedata gets that big, first try to minimize the test dataset.

In any case, we need log level 10 debug logs and parallel network trace.
Comment 5 Jeremy Allison 2020-06-30 17:23:55 UTC
(In reply to Ralph Böhme from comment #4)

Ah, sorry. Was just trying to get some signal from the noise :-). We certainly can't do anything without more data.
Comment 6 Gordon 2020-06-30 21:40:09 UTC
Created attachment 16102 [details]
Samba log, rsync log, network trace, smaller dataset
Comment 7 Gordon 2020-06-30 21:44:05 UTC
I was able to get 100% reproducibility with a much smaller dataset which shows the issue after only a few seconds. I used level 10 logging and also got a network trace (following instructions at (https://wiki.samba.org/index.php/Bug_Reporting).

See attachment "Samba log, rsync log, network trace, smaller dataset".

Note that since I wasn't comfortable uploading network traffic for my entire home network, I used two systems connected directly with link-local addresses for this second test.