Created attachment 17162 [details]
tar of relevant config and log files
This is an upstream report of https://bugs.debian.org/cgi-bin/bugreport.cgi?archive=no&bug=1005721, and seems to also be https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1914420 .
> What led up to the situation?
Copying around 50GB consisting of 3-4MB JPG and 400MB to 4.9GB MP4 files from an SD card mounted by a Windows 10 Version 10.0.19044.1526 client over a home mesh WiFi to an AMD64 NAS running Debian 11 and smbd Version 4.13.13-Debian stored in a usershare on a ZFS filesystem.
> What exactly did you do (or not do) that was effective (or ineffective)?
I first tried copying a large directory from the SD card(D:\DCIM\). The Windows GUI client reported an error and asked if I wanted to try again. Trying again fixed the current file, but usually later another file failed and had to be repeated.
When I tried Beyond Compare 4's directory comparison function and selected the failed files to by re-copied, and more copied, but this time I saved the error text: "An unexpected network error occurred"
> What was the outcome of this action?
When windows GUI client failed to copy a file, it left a zero-byte file on the NAS. When I repeated the copy with beyond-compare, most of the failed files were correctly copied and only one remained.
It seems each time the "network error occurred" smbd actually crashed and left a coredump. I tried installing libc6-dbg, but after reproducing was unable to get the backtrace to resolve calls inside libc6.so.6. I reviewed the source code of volume_label() in loadparm.c - it seems the only libc function call there is a call to strcpy(), so perhaps the label pointer was NULL or otherwise invalid?
I copied a total of 47.2GB split over 291 files from this SD card, I would guess I had to repeat a file at most 10 times, perhaps a failure rate around 3%.
> Draw us a picture of your environment
│LAPTOP├────┤WiFi Mesh Point and DHCP Server│
└──────┘wifi │5ghz wifi
│Debian 11 ├───────────────┤WiFi Mesh Point│
│AMD64 NAS │ 1gig eth └───────────────┘
│Samba 4.13 │
I've saved a copy of the coredump, but would prefer not to post it publicly, but will share with anyone from samba.org if you like. I've hopefully attached all other relevant config and log files. Recent crashes are in var/log/samba/log.rsaxvc-laptop
I believe this has been occasionally happening since I set up Debian 11 on the NAS, in late 2021, but only yesterday did it happen enough in a day to wonder why.
I understand 4.13 is out of maintenance with upstream - when I have time I'll try to reproduce against samba.org master or otherwise report back. I'll try to catch it in GDB and increase the log level to 10 when I do so.
I've tried to reproduce this from a few other machines, and it only seems Windows(I tried only Windows 10) causes it.
I'll also note I have two malformed usershare configs.
It took a few minutes of running windirstat while copying and deleting files from a ramdisk mount, but I was able to catch this with tcpdump running on the nas.
I'm not familiar with SMB2, but there appear to be requests and responses. Many times it's a request and the nas responds. But right before the connection dies with TCP reset, client sends "GetInfo Request FILE_INFO/SMB2_FILE_STANDARD_INFO File: srvsvc" and immediately "GetInfo Request FS_INFO/FileFsFullSizeInformation File:" the nas responds with a "GetInfo Response" to the first request, the client sends two more packets(a Bind and a CreateRequestFile) but the server only responds with a TCP reset.
I can share the pcap privately if needed.
So, it seems to be related to the "GetInfo Request FS_INFO/FileFsFullSizeInformation"
I've been seeing what appears to be same issue for probably a year on Ubuntu Server 22.04. Recently updated to 22.10 and it's still there in Samba 4.16.4 so wanted to see if there is anything I can provide to help.
# smbd -V
Server runs an Ivy Bridge-EP Xeon, but I was seeing the same error previously on a desktop Ivy Bridge CPU.
Am running ZFS (2.1.5-1ubuntu6) and smbd (4.16.4-Ubuntu) as distributed with Ubuntu Server 22.10. Network is 10GbE, and client is running Ubuntu 22.10 desktop. Am fairly certain I have seen the same issue on the M1 Mac Mini as well but cannot swear to it.
This tends to happy during heavy usage and it's still fairly unusual, so guessing it may be able to happen any time but the odds are very much against it during light use.
I've uploaded a pair of backtraces and my smb.conf. Happy to provide additional detail or the core files themselves if it helps.
Created attachment 17608 [details]
Two backtraces plus smb.conf
smb.conf contains some unused bits I was playing with long ago. Rather than take them out, I'm leaving them in just in case they have some unexpected (by me) relevance.
This is very similar to https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1817027 where the same thing happens, but with push_ucs2_talloc() rather than strlen.
Thanks for all the hard work with network diagrams, but the problem is quite simple and can be deduced entirely from the code.
lp_volume() at a time in the past could never return NULL, as it returned a static 'fstring'. This was changed to return allocated memory, firstly on talloc_tos() implicitly, and then on talloc_tos() explicity.
Sadly this is not constant because it can contain macros, the simple fix would be to make the volume parameter unable to use smb.conf macros.
However, given it uses macros, it must of course allocate memory, and any talloc() can fail to allocate at any given time.
This impacts both getting the volume name and getting the volume ID, which is just the MD4 of the unicode version of the volume name.
(In reply to Allen Belletti from comment #3)
Thanks so much for the backtraces, these make the problem very clear.
I forgot to mention I think the fix is to compute and store the volume name and ID at connection time, giving an error then if that fails (unlikely, as not under memory pressure then).
Created attachment 17881 [details]
/proc/meminfo a few minutes after smbd was crashing
> However, given it uses macros, it must of course allocate memory, and any talloc() can fail to allocate at any given time.
I'm not familiar with talloc(), but if it helps I've attached /proc/meminfo from after smbd was crashing. I have 3.8GB of memory available to the OS, with less than 10% reserved by kernel and processes. Is it possible to tell if this is caused by a system out of memory condition, or a talloc pool hitting its limit? ZFS can use a bit of memory, but I think I have all the memory intensive features disabled.