The Samba-Bugzilla – Bug 2910
smbd 3.x crashes under high load with many MoveFile and directory listings
Last modified: 2005-08-24 10:17:51 UTC
Samba 3.0.14 server on Fedore Core 3 Linux server.
Samba server provides one share with 2 directories "In", "Out".
Win2K and WinXP clients (~10) are accessing the In directory with FileCreate and
directory listings quite heavily (~every 100ms per client).
The same Win clients pick a file randomly from the In directory and move it to
the Out directory ~every 500ms.
smbd starts with ~3% system load per client connection.
After some minutes, one of the smbd processes uses 99% system load, and all Win
clients are blocked. I.e., it seems the Win clients are all waiting for the same
file to be released or so (just a guess).
After some minutes (while everything is blocked, smbd on 99%), it seems that
Samba recovers, the Win clients continue working. However, the defect smbd
process stays on ~70% CPU load.
This all repeats every couple of minutes.
THIS DID NOT HAPPEN WITH Samba 2.2.8a.
Samba 2.2.8a is completely stable under exactly the same test.
FYI: Using a Windows Share is also stable, only Samba 3.0.x has this problem.
Tested in the past:
Linux RedHat 8.0, 9.0, kernel 2.4, kernel 2.6, Fedore Core 2, Core 3.
Samba 3.0.0 - 3.0.14 (more or less every production release).
Many different settings in the smb.conf file (e.g. oplocks on/off, different
Socket options, ...).
Independent from the used Linux system.
Independent from the Windows clients (W2K, XP, SP1, SP2).
It really seems that Samba 3.0.x has a bug here.
I can provide my smb.conf file on request. Write to Martin.Toeltsch@symena.com
Created attachment 1333 [details]
This is the smb.conf file used for the test environment described in my problem
Could you do us a favor and retry this with the current released version,
3.0.14a and possibly also with latest SVN? In particular the open code has
undergone heavy changes between 3.0.14a and the current version.
What kind of test program do you use? I would like to possibly automate this
test so that if we have a bug in the current code we can be more certain that we
do not regress in the future. To write this I would either need your test
program (I assume it's a Windows program) or a sniff of the traffic. Could you
Created attachment 1334 [details]
MSVC 6.0 Samba stress test project
I zipped the relevant files of the MSVC project of my stress test program. In
the Release directory you can find the compiled exe, and the required stlport
DLL as well. Should be able to run. If not, hope you can compile.
It's a Win32 console application, 2 threads: "optimizer" and "client".
You need 3 directories on a share (e.g. "s:"), e.g. "In", "Spool", "Out".
Start the program with "sambatest s:\in s:\spool s:\out"
When being asked for nr. files, chose e.g. 10.
Start the optimizer by pressing "o" (this thread writes files into the In dir).
Start the client by pressing "c" (this thread lists the In dir and moves files
to the Spool dir and the Out dir).
On other Win machines, start sambatest again with the same directories (In,
Use 1 as number of files (just important to be >0).
Only start client by pressing "c".
Start other clients on other machines.
Let the test run and observe. On our setup, after a couple of minutes, all the
test programs just stop to produce output on the terminal and block everything.
Stop the programs by pressing "q", that's it.
This is a little hard to reproduce if we need to have 10 simultaneous Windows
clients. vmware sessions only go so far and I don't have a lab at home :-). What
is the lowest number of Windows clients you've been able to reproduce this with ?
Have you been able to try the 3.0.20 code ? There was a bug in 3.0.14a with
deferred opens (which are on by default) which may have caused this problem. If
you still have your test environment set up please try setting :
defer sharing violations = no
in the [global] section of your smb.conf and see if this fixes it.
Ok, I'm running your stress test code with 3 vmware clients (2 running the
client and 1 running client and optimizer) against a 3.0.20 pre-release smbd
(current SAMBA_3_0 svn tree) and it seems to be running ok. I'm betting you were
running into the deferred open bug. Can you confirm I'm running enough clients
to test this properly please ?
Jeremy, I'm in the process of writing a Samba4 test that has exactly the same
behaviour, just to test my oplock code. May take a little while, but this indeed
look like a nice test.
BTW, I've run 3 simultaneous tests on a single workstation using the virtual IP
address trick. Windows will open several connections.
Marin, please don't reply directoly to the email@example.com
address. It is mostly a placeholder these days. It's better to
keep all coorespondence in the bug report itself.
------- Mails from Martin------------------
Thanks for the quick reply (I'm impressed).
I up-loaded the MSVC 6.0 project. Perhaps it helps you.
In the Release directory you will find the application and
one required DLL.
I also included a quick description of the usage.
Please note, it was never intended to be used outside,
so please do not expect comfort ;-)
I'm going to install Samba 3.0.20 on the Linux server.
If I can do anything else, let me know.
(1) thanks for correcting the version to 3.0.14a, the Bugzilla page did not
provide the list box with 3.0.14a.
(2) The latest version I tried was indeed 3.0.14a.
(3) I can test every version today that comes by rpm archive. Sorry, I don't
have time for compiling a source code snapshot or so.
What is SVN? Where can I get it?
(4) I'm going to test the deferred open bug right now with 3.0.14a.
Keep you posted ...
3 clients and 1 optimizer should be enough usually.
The more clients are running the higher the chance that
you can observe the bug.
However, I applied defer sharing violations = no to
smb.conf and did not see any problems (using Samba 3.0.14a)
for the last hour.
This is a good sign. It really seems that it's the deferred
I keep the stress test working for another couple of hours ...
Closing this out - pretty sure this was the known bug which is now fixed.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.