3569 – smbd hangs and spins on locking.tdb - requires smbd restart

Bug 3569 - smbd hangs and spins on locking.tdb - requires smbd restart

Summary: smbd hangs and spins on locking.tdb - requires smbd restart

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.0
Classification:	Unclassified
Component:	File Services (show other bugs)
Version:	3.0.21a
Hardware:	Other AIX

Importance:	P3 critical
Target Milestone:	none
Assignee:	Samba Bugzilla Account
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-03-01 10:29 UTC by William Jojo
Modified:	2006-04-14 10:20 UTC (History)
CC List:	0 users

See Also:

Attachments
re-introduce clearing of TDB_CLEAR_IF_FIRST flag in tdb_reopen_all (406 bytes, patch) 2006-04-06 13:32 UTC, William Jojo	no flags	Details
Patch (based on your code). (2.58 KB, patch) 2006-04-06 17:21 UTC, Jeremy Allison	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description William Jojo 2006-03-01 10:29:30 UTC

Under very heavy load (several hundred users logging off and several hundred logging on simultaneously) smbd performance will bottom out and require smbd to be stopped, all terminate and started. The system will be fine for about 24 hours until the next blast by our users.

3.0.20 does not suffer this problem. 21a and 21b do (21c not tested yet)

There were invasive changes to locking/locking.c that I feel may be the cause, but as yet cannot prove.

We have a very high number of cookies per user (500-2000) in their roaming profiles. This may be exacerbating the problem and would explain the load relationship. After testing 20b, we plan to move cookies out of the roaming profiles using folder redirection and retry 21b (or 21c).

Comment 1 Jeremy Allison 2006-03-01 10:31:51 UTC

I've looked very carefully at the changes in locking/locking.c and can't see anything that would cause this issue. I do need more data on the machine and what all the smbd's are doing (the AIX equivalent of /proc/locks) when the machine is in this state to make any real progress on this.
Jeremy.

Comment 2 William Jojo 2006-03-01 11:17:07 UTC

(In reply to comment #1)
> I've looked very carefully at the changes in locking/locking.c and can't see
> anything that would cause this issue. I do need more data on the machine and
> what all the smbd's are doing (the AIX equivalent of /proc/locks) when the
> machine is in this state to make any real progress on this.
> Jeremy.

I truly believe you have. :-)

The system is 6-way Power4 LPAR, 12GB with a few dozen filesystems in our EMC SAN. AIX 5.2 TL-08-1 (latest code). Workstations are XP-SP2.

When I trussed a random smbd, it was still running, but very slowly. Since there are several hundred running (upwards of 1000), it's very hard for me to get specifics, but the truss-ed procs are still moving files back and forth within users profiles. That seems to be the catalyst. Can you use any information from the perfpmr?

Comment 3 William Jojo 2006-03-02 09:17:07 UTC

(In reply to comment #2)

Ok, We installed 3.0.20b today and it happened. So it looks like locking.c is off the hook. But I'm now wondering about the remaining changes between 20 and 20b. I'm currently setting up 3.0.20a for install tomorrow.

Jeremy, I see several race fixes, namely the tdb clear-if-first and the NTcreate&X. Perhaps there is another place to look? I'm not sure what the changes are.

Sorry for the goose chase.

Bill

Comment 4 Jeremy Allison 2006-03-02 12:28:12 UTC

No problem, it's starting to look like it might be an AIX kernel locking bug,
but these cab be the devil to track down. It took Sun a long time to catch theirs.
Jeremy.

Comment 5 William Jojo 2006-03-02 13:02:21 UTC

(In reply to comment #4)
> No problem, it's starting to look like it might be an AIX kernel locking bug,
> but these cab be the devil to track down. It took Sun a long time to catch
> theirs.
> Jeremy.

I wasn't suggesting that. :-) 3.0.20 is still stable for us. I was just pointing out that 20b was as unstable as 21[ab] so that locking.c was not the cause. We will move to 3.0.20a tomorrow morning, but will not have load enough until Monday.

I still think there is a tdb issue of sorts as we beat the crap out of 3.0.20 (go 20! go 20!) and it didn't falter for 48 hours, but put up 20b today and it didn't last 5 hours. 21a and 21b needed stop/start once a day (10 or 11 am our time depending on load).

I'm going to diff the crap out of 20 and 20a/b and try my best to find this. As always, I appreciate any insight you (and others) have to offer.

Cheers,

Bill

Comment 6 William Jojo 2006-04-06 13:32:12 UTC

Created attachment 1847 [details]
re-introduce clearing of TDB_CLEAR_IF_FIRST flag in tdb_reopen_all

Between 3.0.20 and 3.0.20a tdb.c had 3 major modifications: a race fix, search optimization and removal of flag clearing in tdb_reopen_all().

This reintroduces the clearing of the TDB_CLEAR_IF_FIRST flag for tdb_reopen_all() which under heavy load causes many code paths to be entered unnecessarily. tdb_reopen_all() is called after a child is spawned from the main smbd.

The symptom of this problem is locking contention for certain tdb's, primarily locking.tdb.

With this simple patch, our load issue is gone.

Comment 7 Jeremy Allison 2006-04-06 13:48:36 UTC

I'm glad you kept working on this.... THANKS !
I'll look at this one asap.
Jeremy.

Comment 8 Jeremy Allison 2006-04-06 14:01:22 UTC

Ok, that change was an actual bugfix that could allow races in the initialization code. I can't just revert this I'm afraid. Bill, do you have a phone number we can discuss this on ? Or email me and I'll give you mine.
Jeremy.

Comment 9 Jeremy Allison 2006-04-06 14:20:49 UTC

Bill, I really need to know why this fixes the load problems on AIX. It looks like if it does do so it's a scalability problem with fcntl locks in the AIX kernel code. The only difference in code paths I can see with reintroducing this change is that in the tdb_reopen() call after the fork it won't get the active read lock   to mark it as open. Was that particular call causing the problems in your case ? Why was that so, it should just be stacking read-locks within the kernel ?

Can you contact me asap about this. I can see a way to fix this if we know we're in daemon mode (not in inetd mode, but people rarely run that way anymore). I'd like to chat about this some.

Jeremy.

Comment 10 Jeremy Allison 2006-04-06 17:21:41 UTC

Created attachment 1848 [details]
Patch (based on your code).

Bill, this hopefully should be the same as your fix.
Jeremy.

Comment 11 William Jojo 2006-04-06 19:11:35 UTC

(In reply to comment #10)
> Created an attachment (id=1848) [edit]
> Patch (based on your code).
> Bill, this hopefully should be the same as your fix.
> Jeremy.

Rebuilding my 3.0.21c tree with your patch. Will implement for tomorrow morning and will post results here by Tuesday as load varies on Fridays. The intent is the same and I'm confident it will work. :-)

Much thanks!

Bill

Comment 12 William Jojo 2006-04-14 06:12:15 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > Created an attachment (id=1848) [edit]
> > Patch (based on your code).
> > Bill, this hopefully should be the same as your fix.
> > Jeremy.
> Rebuilding my 3.0.21c tree with your patch. Will implement for tomorrow morning
> and will post results here by Tuesday as load varies on Fridays. The intent is
> the same and I'm confident it will work. :-)
> Much thanks!
> Bill

Jeremy,

We're all good here! I'd say it's fixed :-)

Bill

Comment 13 Jeremy Allison 2006-04-14 10:20:29 UTC

Ding dong the witch is dead.... :-).
Thanks Bill.
Jeremy.