Bug 6837 - "Too many open files" when trying to access large number of files
Summary: "Too many open files" when trying to access large number of files
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.4
Classification: Unclassified
Component: File services (show other bugs)
Version: 3.4.2
Hardware: x64 Windows 7
: P3 regression
Target Milestone: ---
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-22 13:46 UTC by Jernej Simončič
Modified: 2009-12-30 13:33 UTC (History)
3 users (show)

See Also:


Attachments
Working 3.0.36 debug level 10 log (247.30 KB, application/x-lzma)
2009-12-01 12:35 UTC, Justin Maggard
no flags Details
Non-working 3.4.3 debug level 10 log (195.18 KB, application/x-lzma)
2009-12-01 12:36 UTC, Justin Maggard
no flags Details
Proposed fix for 3.4.4. (1.80 KB, patch)
2009-12-01 15:54 UTC, Jeremy Allison
no flags Details
Modifed patch for 3.4.4 (1.80 KB, patch)
2009-12-01 20:15 UTC, Justin Maggard
no flags Details
git-am fix for 3.4.4. (2.36 KB, patch)
2009-12-02 12:33 UTC, Jeremy Allison
no flags Details
New patch to try and fix farmanager (2.13 KB, patch)
2009-12-02 17:00 UTC, Jeremy Allison
no flags Details
git-am format patch for 3.4.4. (2.66 KB, patch)
2009-12-17 19:02 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jernej Simončič 2009-10-22 13:46:10 UTC
I recently upgraded to Samba 3.4.2 from Samba 3.3.7, and now when I try to compile setup for GIMP (which I cross-compile from Linux, and keep on a Samba share), the Inno Setup compiler dies with "File does not exist" error during the initial part of compile. It does this while it's reading version information from all files that are to be included in the installer (the file it tries to read is definitely there and otherwise is accessible). Afterwards my file manager (I'm using a console mode one, from which I also invoke the compiler) complains with "Too many open files. Cannot read folder contents." and doesn't display the directory contents.

Looking at compile logs, the error happens after the compiler has read 1007 files from 48 directories (the file it's trying to access is in a new directory).

Samba 3.4.2 is running as domain server, on Gentoo Linux amd64 (kernel 2.6.31-gentoo-r1, gcc (Gentoo 4.3.2-r4 p1.7, pie-10.1.5) 4.3.2), the client is Windows 7 RTM.
Comment 1 Jernej Simončič 2009-10-23 16:01:23 UTC
I looked a bit more into this: it seems that simply copying files will sometimes trigger this problem when reading reaches the 49th directory.
Comment 2 Jeremy Allison 2009-10-23 18:52:23 UTC
What does ulimit -n say for the max number of open files allowable ?
Jeremy.
Comment 3 Jernej Simončič 2009-10-24 04:57:57 UTC
It's 1024.
Comment 4 Volker Lendecke 2009-10-31 04:35:01 UTC
Can you increase the 1024 and retry?

Volker
Comment 5 Jernej Simončič 2009-10-31 06:03:49 UTC
Increasing to 2048 just causes the error to appear later (after 2038 files).

I did some other tests, and it appears that the problem doesn't happen when using Windows XP as a client. If I compare smbstatus output when reading files from XP and 7, there's a much larger number of files shown locked when reading from Windows 7 (although only a single file should be open at any given time).

I also tried this at work, where we have a Samba 3.0.33 as a fileserver, and couldn't reproduce the issue there. smbstatus also never showed more than a single file locked (I tried reading the files from XP and Win7 RC7100 clients).
Comment 6 Volker Lendecke 2009-10-31 06:42:55 UTC
Please upload network traces of Win7 against Samba 3.0 (the working version) and Win7 against 3.4 (the broken version) together with the respective debug level 10 logs of smbd. We need to know why Windows 7 does not close the files against the new version. Information on how to create useful network traces can be found under

http://wiki.samba.org/index.php/Capture_Packets

Volker
Comment 7 Jernej Simončič 2009-11-11 13:13:59 UTC
Here's Windows 7 against the failing server:
<http://eternallybored.org/misc/capture.tar.bz2>

While doing the capture, I noticed that the problem is somewhat timing sensitive - if I had the GUI Wireshark capturing, it apparently slowed everything down just enough that the problem didn't happen. The capture was made with ulimit -n at 1024.

I haven't been able to make a capture at work yet (where I couldn't reproduce this problem in my initial tests), however I now suspect that the reason I couldn't reproduce the problem there is that the network at work is only 100Mbps, while I have gigabit at home.
Comment 8 Justin Maggard 2009-11-30 19:59:58 UTC
This issue is still there in 3.4.3, and easily reproducible by just doing a drag-n-drop of the kernel source tree onto a samba share with oplocks enabled.  It works fine with oplocks disabled.

It seems Win7 and Win2K8-R2 behave differently than older versions of Windows.  With both Samba 3.0 and 3.4, smbstatus will show open files up to the ulimit -n limit.  But the error never shows up on 3.0; once the limit is reached, new files are closed immediately after usage.  With 3.4, this doesn't happen, and Windows reports an error.  With "oplocks = 0" set on the share, smbstatus reports just one file open at a time.

I'm in the process of getting debug level 10 logs and wireshark captures now, but they're going to be huge, since this problem occurs when copying thousands of files in one shot.  Is there a better way of posting these than attaching to the bug report?  Using gzip -9 gives me 2 ~60MB log.smbd files.
Comment 9 Jeremy Allison 2009-12-01 10:32:07 UTC
I'm guessing it might be due to a change in the mapping of the error message from errno == EMFILE between older versions of Samba and 3.4.x. Easiest way to reproduce should be to create a directory with 200 files in it, then set ulimit to 150 file descriptors before running Samba. That should make it much easier to reproduce with a managable log size. Justin, can you try this ? I'd be really interested in what NTSTATUS we return to the client at the point we run out of fd's. I'm on the shuttle today but might be able to drive and visit your office to investigate this tomorrow (don't have a good Win7 test environment here yet, due to MSDN difficulties :-( ).

Jeremy.
Comment 10 Jeremy Allison 2009-12-01 10:50:14 UTC
Hmmm. No, don't see any difference in the error mapping between 3.0.x and 3.4 (it's a fairly obvious one). I need to see the log trace then...
Jeremy.
Comment 11 Justin Maggard 2009-12-01 12:35:45 UTC
Created attachment 5035 [details]
Working 3.0.36 debug level 10 log
Comment 12 Justin Maggard 2009-12-01 12:36:25 UTC
Created attachment 5036 [details]
Non-working 3.4.3 debug level 10 log
Comment 13 Jeremy Allison 2009-12-01 13:22:31 UTC
Ok, we figured this out. The reason it's not affecting 3.0.x is 3.0.x completely ignores any set ulimit and requests 10,000 open files (or infinity), whatever it can get as root. 3.4.x is a good citizen and obeys the system ulimit. As Win7 doesn't seem to cope with the NT_STATUS_TOO_MANY_OPEN_FILES error (more assumptions that all the world is a Windows server...) we need to figure out what the "normal" max file limit is for a Windows server, as we know that Win7 will keep under that. Then we'll change 3.3.x and 3.4.x to ignore ulimit if it's set lower than that and force at least that many fd's (plus the tdb fudge factor) to be available - logging an error message on the way.
Jeremy.
Comment 14 Jeremy Allison 2009-12-01 13:23:27 UTC
Marking as blocker as we really shouldn't ship something that's broken for Win7. Should be able to get this fixed this week.
Jeremy.
Comment 15 Justin Maggard 2009-12-01 14:06:49 UTC
After some more experimentation, log.smbd shows that numopen never goes above 1025 with a Win7 client.  I was able to copy a linux kernel source tree (~38,000 files) with no problems with ulimit set as low as 1036.  So AFAICT, 1025 + TDB fudge factor should do it.
Comment 16 Jernej Simončič 2009-12-01 14:31:20 UTC
Are you sure about that? In my tests, I raised ulimit -n to 2048, and still had samba fail.
Comment 17 Jernej Simončič 2009-12-01 15:10:16 UTC
Here's a level 10 debug log with ulimit -n 2048:
http://eternallybored.org/misc/samba/samba.log.lzma (too big to attach)

smbstatus output at the time of failure shows 2040 files open:
http://eternallybored.org/misc/samba/smbstatus
Comment 18 Justin Maggard 2009-12-01 15:46:44 UTC
Yes, I'm positive that in my test environment, it never goes above 1025.  I'm running Windows 7 final, 32-bit.

I tried another test with Windows 2008 R2, copying one kernel source tree from a Samba 3.4.3 share to the Win2K8 box, and simultaneously copying another kernel source tree from the 2K8 box to the Samba share, and numopen got up to 1027, but never over that number.  The copy processes succeeded with my ulimit still set to 1036.
Comment 19 Jeremy Allison 2009-12-01 15:54:24 UTC
Created attachment 5037 [details]
Proposed fix for 3.4.4.

Justin if you can test this and make sure it works I'll apply to all current branches.
Thanks !
Jeremy.
Comment 20 Jernej Simončič 2009-12-01 16:34:15 UTC
I'm running Windows 7 x64 here, and I used a directory with 100 subdirs, each with 30 small files for testing, as with this setup, I can reproduce the problem every time with ulimit -n 2048. I can also reproduce the problem with GIMP's make install target directory in about 1 out of 3 tries (with ulimit -n 2048; smbstatus will always show over 1500 open files when I do this).
Comment 21 Jeremy Allison 2009-12-01 19:22:33 UTC
Well Jason could reliably reproduce it, but this was with Win32. Can you test with Win32 and the patch and see if the problem is fixed. If so then we'll look again at the Win64 issue.
Jeremy.
Comment 22 Justin Maggard 2009-12-01 20:15:01 UTC
Created attachment 5039 [details]
Modifed patch for 3.4.4

Interesting.  The patch as-is isn't working for me.  Looks like 1030 is too low.  It looks to me like Samba 3.4 is actually setting rlim_max to the system's current rlim_max + fudge factor at startup.  The current patch just changes rlim_max to 1030 if it's below that.  But I was only able to get a successful copy completed with rlimit set to at least 1036.  Right now, the smbd log is showing numopen == 1021 when the open starts failing, even though rlim_max is set at 1050.

This slightly modified patch works for me with both Win7 and Win2K8/R2.
Comment 23 Jeremy Allison 2009-12-01 21:05:26 UTC
Ok, I'm confused :-). 

The file_init() function calls lp_max_open_files(), which returns the value from max_open_files(), which should now be MIN_OPEN_FILES_WINDOWS after the patch.

Then it calls set_maxfiles() after adding in MAX_OPEN_FUDGEFACTOR, so this should be setting rlp.rlim_max to be (MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR), which should set the hard limit to be 1050.

Your change should set the hard limit to be 1070 (the new MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR), which seems more than we need.

Maybe I can visit tomorrow and investigate this first hand.

Jeremy.
Comment 24 Justin Maggard 2009-12-01 21:23:50 UTC
Right.  But using stock 3.4.3 without the patch, and rlp.rlim_max set to the default 1024, Samba will add the fudge factor to that and set the new limit to 1044.  My earlier testing showed that we need rlp.rlim_max >= 1036 to work properly; but that only works properly because Samba then adds the fudge factor, setting rlp.rlim_max = 1056.
Comment 25 Jeremy Allison 2009-12-01 22:25:29 UTC
Doh ! Ok - yeah, that makes sense, sorry :-).
Jeremy.
Comment 26 Jeremy Allison 2009-12-02 12:33:31 UTC
Created attachment 5047 [details]
git-am fix for 3.4.4.

Once reviewed I'll reassign to Karolin to push.
Comment 27 Jeremy Allison 2009-12-02 13:12:26 UTC
3.3.x and below are not affected.
Jeremy.
Comment 28 Jernej Simončič 2009-12-02 13:45:17 UTC
I installed 32bit Windows 7 in VMWare and the result is identical as with 64bit version - "Too many open files" after 2038 files have been read (ulimit -n 2048).
Comment 29 Jeremy Allison 2009-12-02 13:49:51 UTC
Ok, so we have a disconnect here. Justin reports the problem is fixed - so I need to know *exactly* how you're reproducing this. Justin is uing Windows explorer and drag and drop. Is that what you're doing ?
Jeremy.
Comment 30 Jernej Simončič 2009-12-02 14:24:39 UTC
I'm using FAR Manager <http://www.farmanager.com/> and I'm simply copying to nul. (I found the problem when using Inno Setup to create the installer for GIMP though - as one of the first steps of compilation, it tries to read version information from all files that are to be included in the installer).

BTW, I just did the same thing from Win7 32bit in VMWare to the host (Win7 64bit), and I could see around 3000 files open in Computer Management -> Shared Folders -> Open files.
Comment 31 Jeremy Allison 2009-12-02 14:31:28 UTC
Ok, can you try on your systems using Windows explorer, and check if this is fixed in the case Julian tested - in which case we've an application specific bug here.
Jeremy.
Comment 32 Jernej Simončič 2009-12-02 15:22:03 UTC
Explorer seems much slower - I can't get it to open more than ~950 files at once on Win64, and ~1020 on Win32 (in VMWare). Just for comparision, with the initial phase of GIMP compile with Inno Setup, I get around 1700 files on first run and between 1800 and 2000 on subsequent runs according to smbstatus.
Comment 33 Jeremy Allison 2009-12-02 16:08:51 UTC
Ok, so if this is the case that http://www.farmanager.com/ forces many files open and the Windows client redirector doesn't restrict the number based on error returns then we're going to need to fall back to setting our fd limit as high as Windows, as we'll never know when we'll run into an app that needs this. Shame...
Jeremy.
Comment 34 Jernej Simončič 2009-12-02 16:26:53 UTC
I just used FAR because it was easy to reproduce the problem with it. Here's a small perl script that also triggers the problem:
http://eternallybored.org/misc/samba/readdir.pl
Comment 35 Jeremy Allison 2009-12-02 16:55:42 UTC
Ok, in a test against W2K3 I get "

maximum fnum is 16384

so maybe we should just try for that.

Jeremy.
Comment 36 Jeremy Allison 2009-12-02 17:00:51 UTC
Created attachment 5048 [details]
New patch to try and fix farmanager

Ups the default to :

MIN_OPEN_FILES_WINDOWS = 16384

and sets :

MAX_OPEN_FILES = (MIN_OPEN_FILES_WINDOWS + MAX_OPEN_FUDGEFACTOR)
Comment 37 Jeremy Allison 2009-12-17 19:02:35 UTC
Created attachment 5098 [details]
git-am format patch for 3.4.4.
Comment 38 Jeremy Allison 2009-12-17 20:18:20 UTC
I've noticed something very suspicious here. Against a Windows server, even when NT status codes are negotiated and used, NT_STATUS_TOO_MANY_OPENED_FILES is never returned - the server returns a DOS error code of ERRDOS, ERRnofids.

I wonder if the clients require this....

Jeremy.
Comment 39 Jeremy Allison 2009-12-17 20:35:50 UTC
I will test this tomorrow with the new code I added with a Win7 client.
Jeremy.
Comment 40 Simo Sorce 2009-12-18 08:02:02 UTC
Jeremy,
does this mean that we have a valid error to return for "too many files" and we will be able to back out the patch to increase open files beyond the ulimit set limits ?
Comment 41 Jeremy Allison 2009-12-18 10:15:48 UTC
That's what I'm planning to test this morning :-).

Jeremy.
Comment 42 Jeremy Allison 2009-12-18 14:07:53 UTC
No - forcing DOS errors doesn't make a difference to client behavior. We need to keep the 16k file limit minimum.
Jeremy.
Comment 43 Volker Lendecke 2009-12-29 03:08:49 UTC
Comment on attachment 5098 [details]
git-am format patch for 3.4.4.

Looks good
Comment 44 Karolin Seeger 2009-12-30 02:36:49 UTC
Pushed to v3-4-test. Will be included in 3.4.4.
Can we close the bug report, Jeremy?
Comment 45 Jeremy Allison 2009-12-30 13:33:15 UTC
Date: Wed, 30 Dec 2009 12:54:16 +0100
From: Jean-Jacques Moulis <jj@isy.liu.se>
To: samba@lists.samba.org
Subject: [Samba] Matlab not working when run of a samba 3.4.x  share
X-Mailer: Mahogany 0.67.1 'Constance', running under Windows XP (build 2600,
        Service Pack 3)

Matlab when installed on a samba 3.4.x share is not working.
(it works with 3.3.x and earlier versions)
The application stops with a java related message.
("matlab -nojvm" works)

the detailed log messages and Google led me to bug 6837.

Even if this bug is said to be Windows 7 related,
the patch for it fixes the matlab problem in 3.4.x
(which is present in XP and probably affects other java applications)

This message is for the record as the problem is to be
fixed in 3.4.4
Comment 46 Jeremy Allison 2009-12-30 13:33:34 UTC
Yes I think we can close this one out.

Jeremy.