Bug 15711 - Incoherent directory listings
Summary: Incoherent directory listings
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.19.7
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Volker Lendecke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-09 18:50 UTC by Andrea Venturoli
Modified: 2024-09-23 11:51 UTC (History)
1 user (show)

See Also:


Attachments
Directory listing test (1.50 KB, text/x-c++src)
2024-09-09 18:50 UTC, Andrea Venturoli
no flags Details
Sample smb.conf (508 bytes, text/plain)
2024-09-10 10:05 UTC, Andrea Venturoli
no flags Details
"smbd -i -d 3" output (14.59 KB, text/x-log)
2024-09-10 14:42 UTC, Andrea Venturoli
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrea Venturoli 2024-09-09 18:50:35 UTC
Created attachment 18437 [details]
Directory listing test

Hello.

Since a long time (I chose 4.19.7 just because that's what I'm using now), I've been plagued with intermittent "file not found" errors: what I mean is Samba suddenly claims some random file/directory is not there, although it is. Subsequent accesses might work or fail intermittently.
As a test, I can try "rsync"ing from a shared directory to a local one: while the original does not change I'll often see addition/deletion (which should obviously not happen).
This things tends to happen rarely if the fileserver is lightly loaded; however, if the machine where Samba runs is under load (not necessarily Samba load), it gets bad to the point of being unusable.



At first I thought this might be a problem of my specific setup, but then someone on the mailing list reported the same problem:
https://lists.samba.org/archive/samba/2024-June/249118.html.

The only common thing in our two cases seems to be ZFS (on FreeBSD on my side, on Ubuntu for the other user), but that's just a speculation.



I've written a program that can be used to show this behaviour:
_ it must be launched with a Samba shared directory as an argument;
_ it opens that directory and takes a list of the content;
_ then it repeatedly does the same listing and compares it with the original one;
_ sooner or later (after 100-1000 iterations) it fails in one of three ways:
  a) some files appear in the list which was not there in the beginning (so the first listing was wrong);
  b) some files disappear from the directory;
  c) the directoty itself vanishes and the listing iself fails.

Again, just to be 100% clear: nothing changed on that directory in the meantime; it's just Samba that doesn't get it right.



I'm still trying to get more info, but I thought it might be useful to open a bug report already.
I've got Samba compiled with debug info and I'm trying to get something useful out of it. Since I'm not familiar with the code, if someone has any idea I could try in order to speed up, I'd welcome very much his/her suggestion.
Comment 1 Volker Lendecke 2024-09-10 05:35:45 UTC
What is the client system you're running your test on?
Comment 2 Andrea Venturoli 2024-09-10 06:00:28 UTC
(In reply to Volker Lendecke from comment #1)

Currently Windows 10 Professional (latest patch level).
I'll try and see if I can reproduce on different systems.

Please keep in mind that this is not systematic to reproduce: one day it will plague my work continuously and the next one everything will work fine all the day through :(
Comment 3 Andrea Venturoli 2024-09-10 10:05:36 UTC
Created attachment 18439 [details]
Sample smb.conf
Comment 4 Andrea Venturoli 2024-09-10 10:09:02 UTC
(In reply to Andrea Venturoli from comment #2)

I was able to reproduce this with fusefs-smbnetfs too.

Also, I pinpointed this to the cohexistance of full_audit and shadow_copy2 VFS.
See attached sample smb.conf.
If I remove either of the VFS modules, the test program can run for thousands of iterations.
It will bang in a few seconds, however, if both are active.
Comment 5 Volker Lendecke 2024-09-10 10:55:22 UTC
What happens if you revert the order of the two modules?
Comment 6 Andrea Venturoli 2024-09-10 12:55:01 UTC
(In reply to Volker Lendecke from comment #5)

I'd say it solves :-O
At least on the test fileserver I put up, this makes a huge difference; on the "real" fileserver it's a bit harder to trigger, but I think at least it improved.

Is this documented somewhere I have overlooked?
Or is to be considered a bug?

Also, I wonder if this could have been the cause of other sporadic, but very severe, troubles I reported on the mailing list in the past; however only time can (possibly) tell.

Thanks a lot!
Comment 7 Andrea Venturoli 2024-09-10 13:26:04 UTC
(In reply to Andrea Venturoli from comment #6)

Of course soon after doing some testing, all positive and replying, my "real" fileserver started doing this again :(
So no: inverting the two VFS seems to solve in some cases, but not always.
Strange...
Comment 8 Volker Lendecke 2024-09-10 13:33:05 UTC
Do you have a pure test server that exposes this? If so, it would be great if you could run smbd under valgrind, which makes smbd an order of magnitude slower. Just to rule out pure memory corruption or something similar.
Comment 9 Andrea Venturoli 2024-09-10 14:42:06 UTC
(In reply to Volker Lendecke from comment #8)

Not sure what you mean as "pure test server".
I've got a dedicated FreeBSD jail, i.e. a Samba instance just for this test.



I tried "valgrind smbd -i", but it fails with
--13213:0: aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low.
--13213:0: aspacem   Increase it and rebuild.  Exiting now.

I'll try to get around this.


Meanwhile I also tried "smbd  -i -d 3", then I "ls"ed a directory and maybe what it said is useful?
Notice the "smbd_dirptr_get_entry: Could not open ...: NT_STATUS_INSUFFICIENT_RESOURCES".
Maybe an hint on where to attach gdb?
Comment 10 Andrea Venturoli 2024-09-10 14:42:35 UTC
Created attachment 18440 [details]
"smbd -i -d 3" output
Comment 11 Andrea Venturoli 2024-09-11 12:35:33 UTC
I did some debugging with "vfs objects=full_audit shadow_copy2" (in this order).
The frequent failure is as follows:

_ openat_pathref_fsp_nosymlink takes a two step approach:
  first it calls SMB_VFS_OPENAT with O_PATH flags;
  this is supposed to fail, since it's not implemented in FreeBSD;
  so, if it gets fd==-1 and errno=ENOSYS, it tries again without O_PATH;

_ SMB_VFS_OPENAT calls smb_full_audit_openat, which in turn calls shadow_copy2_openat:
  with O_PATH, the latter simply fails with result=-1 and errno=ENOSYS without even trying;
  the former should log the operation and return what it got from the latter;

_ however smb_full_audit_openat, while logging, incurs in some other operation which alters errno to ENOBUFS;

_ so openat_pathref_fsp_nosymlink errs out without trying again without O_PATH.



Conclusion:
_ this just explains why it's so easy to reproduce the problem with shadow_copy2 after full_audit;
_ however it still doesn't explain why and where ENOBUFS (or possibly something different in other setups) comes from;
_ if the caller relies on errno, IMVVVVVHO vfs_full_audit should save it before logging and recover it before returning.

Still hunting for the root of the problem.
Comment 12 Andrea Venturoli 2024-09-16 15:05:25 UTC
The problem with full_audit seems to be the following:
_ it opens a socket to /var/run/log;
_ then it sends data to syslog via calls to sendto;
_ these calls can return ERR#55 'No buffer space available'.

So, for some reason vfs_full_audit hogs syslogd and translate this to a Samba error.
IMVHO this sould be fixed.

A possible workaround is setting "full_audit:syslog=false", but, then again, this might no be as useful as the default.