My environment: Samba 3.0.14 server on Fedore Core 3 Linux server. Samba server provides one share with 2 directories "In", "Out". Win2K and WinXP clients (~10) are accessing the In directory with FileCreate and directory listings quite heavily (~every 100ms per client). The same Win clients pick a file randomly from the In directory and move it to the Out directory ~every 500ms. Problem description: smbd starts with ~3% system load per client connection. After some minutes, one of the smbd processes uses 99% system load, and all Win clients are blocked. I.e., it seems the Win clients are all waiting for the same file to be released or so (just a guess). After some minutes (while everything is blocked, smbd on 99%), it seems that Samba recovers, the Win clients continue working. However, the defect smbd process stays on ~70% CPU load. This all repeats every couple of minutes. THIS DID NOT HAPPEN WITH Samba 2.2.8a. Samba 2.2.8a is completely stable under exactly the same test. FYI: Using a Windows Share is also stable, only Samba 3.0.x has this problem. Tested in the past: Linux RedHat 8.0, 9.0, kernel 2.4, kernel 2.6, Fedore Core 2, Core 3. Samba 3.0.0 - 3.0.14 (more or less every production release). Many different settings in the smb.conf file (e.g. oplocks on/off, different Socket options, ...). Conclusion: Independent from the used Linux system. Independent from the Windows clients (W2K, XP, SP1, SP2). It really seems that Samba 3.0.x has a bug here. I can provide my smb.conf file on request. Write to Martin.Toeltsch@symena.com Best regards Martin
Created attachment 1333 [details] smb.conf This is the smb.conf file used for the test environment described in my problem description.
Could you do us a favor and retry this with the current released version, 3.0.14a and possibly also with latest SVN? In particular the open code has undergone heavy changes between 3.0.14a and the current version. What kind of test program do you use? I would like to possibly automate this test so that if we have a bug in the current code we can be more certain that we do not regress in the future. To write this I would either need your test program (I assume it's a Windows program) or a sniff of the traffic. Could you provide either? Thanks, Volker
Created attachment 1334 [details] MSVC 6.0 Samba stress test project I zipped the relevant files of the MSVC project of my stress test program. In the Release directory you can find the compiled exe, and the required stlport DLL as well. Should be able to run. If not, hope you can compile. Usage: It's a Win32 console application, 2 threads: "optimizer" and "client". You need 3 directories on a share (e.g. "s:"), e.g. "In", "Spool", "Out". Start the program with "sambatest s:\in s:\spool s:\out" When being asked for nr. files, chose e.g. 10. Start the optimizer by pressing "o" (this thread writes files into the In dir). Start the client by pressing "c" (this thread lists the In dir and moves files to the Spool dir and the Out dir). On other Win machines, start sambatest again with the same directories (In, Spool, Out). Use 1 as number of files (just important to be >0). Only start client by pressing "c". Start other clients on other machines. Let the test run and observe. On our setup, after a couple of minutes, all the test programs just stop to produce output on the terminal and block everything. Stop the programs by pressing "q", that's it.
This is a little hard to reproduce if we need to have 10 simultaneous Windows clients. vmware sessions only go so far and I don't have a lab at home :-). What is the lowest number of Windows clients you've been able to reproduce this with ? Have you been able to try the 3.0.20 code ? There was a bug in 3.0.14a with deferred opens (which are on by default) which may have caused this problem. If you still have your test environment set up please try setting : defer sharing violations = no in the [global] section of your smb.conf and see if this fixes it. Thanks, Jeremy.
Ok, I'm running your stress test code with 3 vmware clients (2 running the client and 1 running client and optimizer) against a 3.0.20 pre-release smbd (current SAMBA_3_0 svn tree) and it seems to be running ok. I'm betting you were running into the deferred open bug. Can you confirm I'm running enough clients to test this properly please ? Jeremy.
Jeremy, I'm in the process of writing a Samba4 test that has exactly the same behaviour, just to test my oplock code. May take a little while, but this indeed look like a nice test. BTW, I've run 3 simultaneous tests on a single workstation using the virtual IP address trick. Windows will open several connections. Volker
Marin, please don't reply directoly to the samba-bugs@samba.org address. It is mostly a placeholder these days. It's better to keep all coorespondence in the bug report itself. ------- Mails from Martin------------------ Servus Volker, Thanks for the quick reply (I'm impressed). I up-loaded the MSVC 6.0 project. Perhaps it helps you. In the Release directory you will find the application and one required DLL. I also included a quick description of the usage. Please note, it was never intended to be used outside, so please do not expect comfort ;-) I'm going to install Samba 3.0.20 on the Linux server. If I can do anything else, let me know. ------------------------------------------------ Hi all, (1) thanks for correcting the version to 3.0.14a, the Bugzilla page did not provide the list box with 3.0.14a. (2) The latest version I tried was indeed 3.0.14a. (3) I can test every version today that comes by rpm archive. Sorry, I don't have time for compiling a source code snapshot or so. What is SVN? Where can I get it? (4) I'm going to test the deferred open bug right now with 3.0.14a. Keep you posted ... Cheers Martin ------------------------------------------------ Jeremy, 3 clients and 1 optimizer should be enough usually. The more clients are running the higher the chance that you can observe the bug. However, I applied defer sharing violations = no to smb.conf and did not see any problems (using Samba 3.0.14a) for the last hour. This is a good sign. It really seems that it's the deferred open bug. I keep the stress test working for another couple of hours ... Regards Martin
Closing this out - pretty sure this was the known bug which is now fixed. Jeremy.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.