The Samba-Bugzilla – Bug 6837
"Too many open files" when trying to access large number of files
Last modified: 2009-12-30 13:33:34 UTC
I recently upgraded to Samba 3.4.2 from Samba 3.3.7, and now when I try to compile setup for GIMP (which I cross-compile from Linux, and keep on a Samba share), the Inno Setup compiler dies with "File does not exist" error during the initial part of compile. It does this while it's reading version information from all files that are to be included in the installer (the file it tries to read is definitely there and otherwise is accessible). Afterwards my file manager (I'm using a console mode one, from which I also invoke the compiler) complains with "Too many open files. Cannot read folder contents." and doesn't display the directory contents.
Looking at compile logs, the error happens after the compiler has read 1007 files from 48 directories (the file it's trying to access is in a new directory).
Samba 3.4.2 is running as domain server, on Gentoo Linux amd64 (kernel 2.6.31-gentoo-r1, gcc (Gentoo 4.3.2-r4 p1.7, pie-10.1.5) 4.3.2), the client is Windows 7 RTM.
I looked a bit more into this: it seems that simply copying files will sometimes trigger this problem when reading reaches the 49th directory.
What does ulimit -n say for the max number of open files allowable ?
Can you increase the 1024 and retry?
Increasing to 2048 just causes the error to appear later (after 2038 files).
I did some other tests, and it appears that the problem doesn't happen when using Windows XP as a client. If I compare smbstatus output when reading files from XP and 7, there's a much larger number of files shown locked when reading from Windows 7 (although only a single file should be open at any given time).
I also tried this at work, where we have a Samba 3.0.33 as a fileserver, and couldn't reproduce the issue there. smbstatus also never showed more than a single file locked (I tried reading the files from XP and Win7 RC7100 clients).
Please upload network traces of Win7 against Samba 3.0 (the working version) and Win7 against 3.4 (the broken version) together with the respective debug level 10 logs of smbd. We need to know why Windows 7 does not close the files against the new version. Information on how to create useful network traces can be found under
Here's Windows 7 against the failing server:
While doing the capture, I noticed that the problem is somewhat timing sensitive - if I had the GUI Wireshark capturing, it apparently slowed everything down just enough that the problem didn't happen. The capture was made with ulimit -n at 1024.
I haven't been able to make a capture at work yet (where I couldn't reproduce this problem in my initial tests), however I now suspect that the reason I couldn't reproduce the problem there is that the network at work is only 100Mbps, while I have gigabit at home.
This issue is still there in 3.4.3, and easily reproducible by just doing a drag-n-drop of the kernel source tree onto a samba share with oplocks enabled. It works fine with oplocks disabled.
It seems Win7 and Win2K8-R2 behave differently than older versions of Windows. With both Samba 3.0 and 3.4, smbstatus will show open files up to the ulimit -n limit. But the error never shows up on 3.0; once the limit is reached, new files are closed immediately after usage. With 3.4, this doesn't happen, and Windows reports an error. With "oplocks = 0" set on the share, smbstatus reports just one file open at a time.
I'm in the process of getting debug level 10 logs and wireshark captures now, but they're going to be huge, since this problem occurs when copying thousands of files in one shot. Is there a better way of posting these than attaching to the bug report? Using gzip -9 gives me 2 ~60MB log.smbd files.
I'm guessing it might be due to a change in the mapping of the error message from errno == EMFILE between older versions of Samba and 3.4.x. Easiest way to reproduce should be to create a directory with 200 files in it, then set ulimit to 150 file descriptors before running Samba. That should make it much easier to reproduce with a managable log size. Justin, can you try this ? I'd be really interested in what NTSTATUS we return to the client at the point we run out of fd's. I'm on the shuttle today but might be able to drive and visit your office to investigate this tomorrow (don't have a good Win7 test environment here yet, due to MSDN difficulties :-( ).
Hmmm. No, don't see any difference in the error mapping between 3.0.x and 3.4 (it's a fairly obvious one). I need to see the log trace then...
Created attachment 5035 [details]
Working 3.0.36 debug level 10 log
Created attachment 5036 [details]
Non-working 3.4.3 debug level 10 log
Ok, we figured this out. The reason it's not affecting 3.0.x is 3.0.x completely ignores any set ulimit and requests 10,000 open files (or infinity), whatever it can get as root. 3.4.x is a good citizen and obeys the system ulimit. As Win7 doesn't seem to cope with the NT_STATUS_TOO_MANY_OPEN_FILES error (more assumptions that all the world is a Windows server...) we need to figure out what the "normal" max file limit is for a Windows server, as we know that Win7 will keep under that. Then we'll change 3.3.x and 3.4.x to ignore ulimit if it's set lower than that and force at least that many fd's (plus the tdb fudge factor) to be available - logging an error message on the way.
Marking as blocker as we really shouldn't ship something that's broken for Win7. Should be able to get this fixed this week.
After some more experimentation, log.smbd shows that numopen never goes above 1025 with a Win7 client. I was able to copy a linux kernel source tree (~38,000 files) with no problems with ulimit set as low as 1036. So AFAICT, 1025 + TDB fudge factor should do it.
Are you sure about that? In my tests, I raised ulimit -n to 2048, and still had samba fail.
Here's a level 10 debug log with ulimit -n 2048:
http://eternallybored.org/misc/samba/samba.log.lzma (too big to attach)
smbstatus output at the time of failure shows 2040 files open:
Yes, I'm positive that in my test environment, it never goes above 1025. I'm running Windows 7 final, 32-bit.
I tried another test with Windows 2008 R2, copying one kernel source tree from a Samba 3.4.3 share to the Win2K8 box, and simultaneously copying another kernel source tree from the 2K8 box to the Samba share, and numopen got up to 1027, but never over that number. The copy processes succeeded with my ulimit still set to 1036.
Created attachment 5037 [details]
Proposed fix for 3.4.4.
Justin if you can test this and make sure it works I'll apply to all current branches.
I'm running Windows 7 x64 here, and I used a directory with 100 subdirs, each with 30 small files for testing, as with this setup, I can reproduce the problem every time with ulimit -n 2048. I can also reproduce the problem with GIMP's make install target directory in about 1 out of 3 tries (with ulimit -n 2048; smbstatus will always show over 1500 open files when I do this).
Well Jason could reliably reproduce it, but this was with Win32. Can you test with Win32 and the patch and see if the problem is fixed. If so then we'll look again at the Win64 issue.
Created attachment 5039 [details]
Modifed patch for 3.4.4
Interesting. The patch as-is isn't working for me. Looks like 1030 is too low. It looks to me like Samba 3.4 is actually setting rlim_max to the system's current rlim_max + fudge factor at startup. The current patch just changes rlim_max to 1030 if it's below that. But I was only able to get a successful copy completed with rlimit set to at least 1036. Right now, the smbd log is showing numopen == 1021 when the open starts failing, even though rlim_max is set at 1050.
This slightly modified patch works for me with both Win7 and Win2K8/R2.
Ok, I'm confused :-).
The file_init() function calls lp_max_open_files(), which returns the value from max_open_files(), which should now be MIN_OPEN_FILES_WINDOWS after the patch.
Then it calls set_maxfiles() after adding in MAX_OPEN_FUDGEFACTOR, so this should be setting rlp.rlim_max to be (MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR), which should set the hard limit to be 1050.
Your change should set the hard limit to be 1070 (the new MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR), which seems more than we need.
Maybe I can visit tomorrow and investigate this first hand.
Right. But using stock 3.4.3 without the patch, and rlp.rlim_max set to the default 1024, Samba will add the fudge factor to that and set the new limit to 1044. My earlier testing showed that we need rlp.rlim_max >= 1036 to work properly; but that only works properly because Samba then adds the fudge factor, setting rlp.rlim_max = 1056.
Doh ! Ok - yeah, that makes sense, sorry :-).
Created attachment 5047 [details]
git-am fix for 3.4.4.
Once reviewed I'll reassign to Karolin to push.
3.3.x and below are not affected.
I installed 32bit Windows 7 in VMWare and the result is identical as with 64bit version - "Too many open files" after 2038 files have been read (ulimit -n 2048).
Ok, so we have a disconnect here. Justin reports the problem is fixed - so I need to know *exactly* how you're reproducing this. Justin is uing Windows explorer and drag and drop. Is that what you're doing ?
I'm using FAR Manager <http://www.farmanager.com/> and I'm simply copying to nul. (I found the problem when using Inno Setup to create the installer for GIMP though - as one of the first steps of compilation, it tries to read version information from all files that are to be included in the installer).
BTW, I just did the same thing from Win7 32bit in VMWare to the host (Win7 64bit), and I could see around 3000 files open in Computer Management -> Shared Folders -> Open files.
Ok, can you try on your systems using Windows explorer, and check if this is fixed in the case Julian tested - in which case we've an application specific bug here.
Explorer seems much slower - I can't get it to open more than ~950 files at once on Win64, and ~1020 on Win32 (in VMWare). Just for comparision, with the initial phase of GIMP compile with Inno Setup, I get around 1700 files on first run and between 1800 and 2000 on subsequent runs according to smbstatus.
Ok, so if this is the case that http://www.farmanager.com/ forces many files open and the Windows client redirector doesn't restrict the number based on error returns then we're going to need to fall back to setting our fd limit as high as Windows, as we'll never know when we'll run into an app that needs this. Shame...
I just used FAR because it was easy to reproduce the problem with it. Here's a small perl script that also triggers the problem:
Ok, in a test against W2K3 I get "
maximum fnum is 16384
so maybe we should just try for that.
Created attachment 5048 [details]
New patch to try and fix farmanager
Ups the default to :
MIN_OPEN_FILES_WINDOWS = 16384
and sets :
MAX_OPEN_FILES = (MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR)
Created attachment 5098 [details]
git-am format patch for 3.4.4.
I've noticed something very suspicious here. Against a Windows server, even when NT status codes are negotiated and used, NT_STATUS_TOO_MANY_OPENED_FILES is never returned - the server returns a DOS error code of ERRDOS, ERRnofids.
I wonder if the clients require this....
I will test this tomorrow with the new code I added with a Win7 client.
does this mean that we have a valid error to return for "too many files" and we will be able to back out the patch to increase open files beyond the ulimit set limits ?
That's what I'm planning to test this morning :-).
No - forcing DOS errors doesn't make a difference to client behavior. We need to keep the 16k file limit minimum.
Comment on attachment 5098 [details]
git-am format patch for 3.4.4.
Pushed to v3-4-test. Will be included in 3.4.4.
Can we close the bug report, Jeremy?
Date: Wed, 30 Dec 2009 12:54:16 +0100
From: Jean-Jacques Moulis <email@example.com>
Subject: [Samba] Matlab not working when run of a samba 3.4.x share
X-Mailer: Mahogany 0.67.1 'Constance', running under Windows XP (build 2600,
Service Pack 3)
Matlab when installed on a samba 3.4.x share is not working.
(it works with 3.3.x and earlier versions)
The application stops with a java related message.
("matlab -nojvm" works)
the detailed log messages and Google led me to bug 6837.
Even if this bug is said to be Windows 7 related,
the patch for it fixes the matlab problem in 3.4.x
(which is present in XP and probably affects other java applications)
This message is for the record as the problem is to be
fixed in 3.4.4
Yes I think we can close this one out.