Under very heavy load (several hundred users logging off and several hundred logging on simultaneously) smbd performance will bottom out and require smbd to be stopped, all terminate and started. The system will be fine for about 24 hours until the next blast by our users.
3.0.20 does not suffer this problem. 21a and 21b do (21c not tested yet)
There were invasive changes to locking/locking.c that I feel may be the cause, but as yet cannot prove.
We have a very high number of cookies per user (500-2000) in their roaming profiles. This may be exacerbating the problem and would explain the load relationship. After testing 20b, we plan to move cookies out of the roaming profiles using folder redirection and retry 21b (or 21c).
I've looked very carefully at the changes in locking/locking.c and can't see anything that would cause this issue. I do need more data on the machine and what all the smbd's are doing (the AIX equivalent of /proc/locks) when the machine is in this state to make any real progress on this.
(In reply to comment #1)
> I've looked very carefully at the changes in locking/locking.c and can't see
> anything that would cause this issue. I do need more data on the machine and
> what all the smbd's are doing (the AIX equivalent of /proc/locks) when the
> machine is in this state to make any real progress on this.
I truly believe you have. :-)
The system is 6-way Power4 LPAR, 12GB with a few dozen filesystems in our EMC SAN. AIX 5.2 TL-08-1 (latest code). Workstations are XP-SP2.
When I trussed a random smbd, it was still running, but very slowly. Since there are several hundred running (upwards of 1000), it's very hard for me to get specifics, but the truss-ed procs are still moving files back and forth within users profiles. That seems to be the catalyst. Can you use any information from the perfpmr?
(In reply to comment #2)
Ok, We installed 3.0.20b today and it happened. So it looks like locking.c is off the hook. But I'm now wondering about the remaining changes between 20 and 20b. I'm currently setting up 3.0.20a for install tomorrow.
Jeremy, I see several race fixes, namely the tdb clear-if-first and the NTcreate&X. Perhaps there is another place to look? I'm not sure what the changes are.
Sorry for the goose chase.
No problem, it's starting to look like it might be an AIX kernel locking bug,
but these cab be the devil to track down. It took Sun a long time to catch theirs.
(In reply to comment #4)
> No problem, it's starting to look like it might be an AIX kernel locking bug,
> but these cab be the devil to track down. It took Sun a long time to catch
I wasn't suggesting that. :-) 3.0.20 is still stable for us. I was just pointing out that 20b was as unstable as 21[ab] so that locking.c was not the cause. We will move to 3.0.20a tomorrow morning, but will not have load enough until Monday.
I still think there is a tdb issue of sorts as we beat the crap out of 3.0.20 (go 20! go 20!) and it didn't falter for 48 hours, but put up 20b today and it didn't last 5 hours. 21a and 21b needed stop/start once a day (10 or 11 am our time depending on load).
I'm going to diff the crap out of 20 and 20a/b and try my best to find this. As always, I appreciate any insight you (and others) have to offer.
Created attachment 1847 [details]
re-introduce clearing of TDB_CLEAR_IF_FIRST flag in tdb_reopen_all
Between 3.0.20 and 3.0.20a tdb.c had 3 major modifications: a race fix, search optimization and removal of flag clearing in tdb_reopen_all().
This reintroduces the clearing of the TDB_CLEAR_IF_FIRST flag for tdb_reopen_all() which under heavy load causes many code paths to be entered unnecessarily. tdb_reopen_all() is called after a child is spawned from the main smbd.
The symptom of this problem is locking contention for certain tdb's, primarily locking.tdb.
With this simple patch, our load issue is gone.
I'm glad you kept working on this.... THANKS !
I'll look at this one asap.
Ok, that change was an actual bugfix that could allow races in the initialization code. I can't just revert this I'm afraid. Bill, do you have a phone number we can discuss this on ? Or email me and I'll give you mine.
Bill, I really need to know why this fixes the load problems on AIX. It looks like if it does do so it's a scalability problem with fcntl locks in the AIX kernel code. The only difference in code paths I can see with reintroducing this change is that in the tdb_reopen() call after the fork it won't get the active read lock to mark it as open. Was that particular call causing the problems in your case ? Why was that so, it should just be stacking read-locks within the kernel ?
Can you contact me asap about this. I can see a way to fix this if we know we're in daemon mode (not in inetd mode, but people rarely run that way anymore). I'd like to chat about this some.
Created attachment 1848 [details]
Patch (based on your code).
Bill, this hopefully should be the same as your fix.
(In reply to comment #10)
> Created an attachment (id=1848) 
> Patch (based on your code).
> Bill, this hopefully should be the same as your fix.
Rebuilding my 3.0.21c tree with your patch. Will implement for tomorrow morning and will post results here by Tuesday as load varies on Fridays. The intent is the same and I'm confident it will work. :-)
(In reply to comment #11)
> (In reply to comment #10)
> > Created an attachment (id=1848) 
> > Patch (based on your code).
> > Bill, this hopefully should be the same as your fix.
> > Jeremy.
> Rebuilding my 3.0.21c tree with your patch. Will implement for tomorrow morning
> and will post results here by Tuesday as load varies on Fridays. The intent is
> the same and I'm confident it will work. :-)
> Much thanks!
We're all good here! I'd say it's fixed :-)
Ding dong the witch is dead.... :-).