Bug 5955 - Windbind becomes unresponsive when load on winbind is high
Summary: Windbind becomes unresponsive when load on winbind is high
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.2
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.2.4
Hardware: x86 Linux
: P3 normal
Target Milestone: ---
Assignee: Michael Adam
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-09 06:45 UTC by Erik Sørnes
Modified: 2014-07-23 18:38 UTC (History)
3 users (show)

See Also:


Attachments
log.wb-AHUS (454.48 KB, text/plain)
2008-12-09 09:32 UTC, Erik Sørnes
no flags Details
log.wb-BUILTIN (94.69 KB, text/plain)
2008-12-09 09:33 UTC, Erik Sørnes
no flags Details
log.wbinfo (16.29 KB, text/plain)
2008-12-09 09:33 UTC, Erik Sørnes
no flags Details
log.winbindd (958.94 KB, text/plain)
2008-12-09 09:43 UTC, Erik Sørnes
no flags Details
log.winbindd-dc-connect (28.59 KB, text/plain)
2008-12-09 09:44 UTC, Erik Sørnes
no flags Details
log.winbindd-idmap (109.40 KB, text/plain)
2008-12-09 09:44 UTC, Erik Sørnes
no flags Details
patch to speed up fill_grent_mem() (1.75 KB, patch)
2008-12-09 10:15 UTC, Michael Adam
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Sørnes 2008-12-09 06:45:59 UTC
After upgrading from samba 3.0.24 to 3.2.4 on our fileserver, We notice that the filserveren sometimes become unresponsive.
We can trigger the issue by running the command "getent passwd" or "getfacl *" on a folder with files that has lot of acls. A lot of acl's meaning about 2-10.

When the server hangs, "wbinfo -p" returns:

Ping to winbindd failed
could not ping winbindd!

Commands like smbstatus, getfacl, ls -lat also hangs.

strace wbinfo -p returns

fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
fcntl64(4, F_GETFD)                     = 0
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
connect(4, {sa_family=AF_FILE, path="/tmp/.winbindd/pipe"}, 110) = 0
select(5, [4], NULL, NULL, {0, 0})      = 0 (Timeout)
write(4, "0\10\0\0\0\0\0\0\0\0\0\0lp\0\0\0\10\0\0\0\0\0\0\0\0\0\0"..., 2096) = 2096
select(5, [4], NULL, NULL, {5, 0})      = 0 (Timeout)
select(5, [4], NULL, NULL, {5, 0})      = 0 (Timeout)
select(5, [4], NULL, NULL, {5, 0})      = 0 (Timeout)
select(5, [4], NULL, NULL, {5, 0})      = 0 (Timeout)
select(5, [4], NULL, NULL, {5, 0})      = 0 (Timeout)

we use an openldap 2.2.24 backend, the server has been joined to a windows2003 domain with "net ads join"

log level set to 10 doesn't give any useful output. The log shows winbind just stops responding at some point without an errormessage.

Restart of the winbindprosesses solves the problem. The really big problem here is that this causes the nightly backup to fail. So have been without backup on our system for 2 month.

We did not have this problem at all with samba 3.0.24

-regards
Erik
Comment 1 Erik Sørnes 2008-12-09 06:46:50 UTC
We have reproduced the problem on a testserver. 
Upgrading the testserver to 3.3.0-pre2 does not help
Comment 2 Erik Sørnes 2008-12-09 08:31:49 UTC
It seems to me winbind starts becoming unresponsive when it treats the group <Domain name>+Domain Users. What is special about this group is that they have extremely many members >4000, maybe more.
Comment 3 Michael Adam 2008-12-09 08:41:09 UTC
Hi Erik,

could you please post your (relevant, global) samba configuration?

And winbind log files of a reproduced problem (all files log.w* usually).

What precisely do you mean, saying "we use an openldap 2.2.24 backend":
Do you use an ldap "passdb backend"?
If so - why? A member server does not have to maintain a user database of its own.

Thanks, Michael
Comment 4 Erik Sørnes 2008-12-09 08:58:33 UTC
Hi Erik,

>could you please post your (relevant, global) samba configuration?

[global]
        workgroup = ahus
        netbios name = kjenn
        security = ADS
        realm = AHUS.NO
        encrypt passwords = yes
        lock directory = /var/lock/samba
        log level = 1
        name resolve order = host
        wins server = 10.225.3.6
        ldap admin dn= cn=Manager,dc=ahus,dc=no
        ldap idmap suffix = ou=Idmap
        ldap suffix = dc=ahus,dc=no
        idmap backend = ldap:ldap://127.0.0.1
        idmap uid = 10000-200000
        idmap gid = 10000-200000
        winbind enum users = yes
        winbind enum groups = yes
        winbind separator = +
        winbind use default domain = yes
        winbind nested groups = yes
        template shell = /bin/bash
        template homedir = /home/%U
        default devmode = yes
        socket options = TCP_NODELAY


>What precisely do you mean, saying "we use an openldap 2.2.24 backend":

We use it to store uid/gid-sid-mappings in the openldap server. This is because we have an offline node, the data being replicated with to this offline node with drbd. The other node must be activated manually and we need  the uid/gid-sid-mappings to be the same on both nodes.

>Do you use an ldap "passdb backend"? 
No only for uid/gid-sid-mapping

I will suply logfiles shortly

-regards
Erik
Comment 5 Erik Sørnes 2008-12-09 09:21:42 UTC
I will upload all of the log.w-logs with log level=10 while running the command

chgrp SIAADM+Domain\ Users testfil (2 times after each other)
While running the command, (wbinfo -p returns "unable to ping winbind")
After a while (maybe one minute, the command fails, and we are again able to ping winbind).


PS. This was run on the testserver LORDVADER. same smb.conf, same sambaversion (3.2.4) etc..., etc...

-regards
Erik
Comment 6 Erik Sørnes 2008-12-09 09:32:46 UTC
Created attachment 3797 [details]
log.wb-AHUS
Comment 7 Erik Sørnes 2008-12-09 09:33:20 UTC
Created attachment 3798 [details]
log.wb-BUILTIN
Comment 8 Erik Sørnes 2008-12-09 09:33:43 UTC
Created attachment 3799 [details]
log.wbinfo
Comment 9 Erik Sørnes 2008-12-09 09:43:25 UTC
Created attachment 3800 [details]
log.winbindd
Comment 10 Erik Sørnes 2008-12-09 09:44:03 UTC
Created attachment 3801 [details]
log.winbindd-dc-connect
Comment 11 Erik Sørnes 2008-12-09 09:44:27 UTC
Created attachment 3802 [details]
log.winbindd-idmap
Comment 12 Michael Adam 2008-12-09 10:12:04 UTC
Ok, I see in the logs that the group has some 7300 members.
I recently wrote a patch that hast not been released yet
but will be in 3.2.6 and 3.3.0 to speed up an internal routine
(fill_grent_mem) that collects the member information for a group.

When it bit me, it was with a group of 250,000 members, and I am
not completely sure that it will solve your problem, but well,
you could give it a try!
I will attach a patch next.

Cheers - Michael

PS: for debugging, it helps if you put "debug hires timestamp = yes"
and "debug pid = yes" in your smb.conf.
Comment 13 Michael Adam 2008-12-09 10:15:10 UTC
Created attachment 3804 [details]
patch to speed up fill_grent_mem()

this patch should apply to 3.2.4
though I have not tested it now
it has gone into the v3-2-test tree after the 3.2.4 release
Comment 14 Erik Sørnes 2008-12-09 12:10:26 UTC
I have applied the patch to our testsystem, LORDVADER, and it helps alot to the problem:

After patch
chgrp  SIAADM+Domain\ Users takes 1 min and 15 sek, and succeeds. We can ping winbind (but slow response...) all the time during setting of the group. This is with debug level 10. With debug level 1, setting of the group ownership takes 53 seconds. To read the acl of the testfile with getfacl takes now 5 seconds on the testsystem. Yes, I think this patch will solve our problem with the backup atleast, although 53 seconds is a long time.

To summarize, without the patch, setting the group results in an error message, and we are not able to ping winbind while it is being applied, so it is certainly an approvement! Maybe our backup will work with this... it looks like it.

Our testsystem is a Pentium III 500Mhz system. Maybe this is the best response time we can expect, 1min and 15 sek to apply group ownership of Domain Users to a file? This server does not have any user load and no other applications running on it.

-regards
Erik
Comment 15 Michael Adam 2008-12-09 15:45:45 UTC
Thanks for your feedback, Erik!

(In reply to comment #14)
> I have applied the patch to our testsystem, LORDVADER, and it helps 
> alot to the problem:
> 
> After patch
> chgrp  SIAADM+Domain\ Users takes 1 min and 15 sek, and succeeds. We can ping
> winbind (but slow response...) all the time during setting of the group. This
> is with debug level 10. With debug level 1, setting of the group ownership
> takes 53 seconds. To read the acl of the testfile with getfacl takes now 5
> seconds on the testsystem. Yes, I think this patch will solve our problem with
> the backup atleast, although 53 seconds is a long time.
> 
> To summarize, without the patch, setting the group results in an 
> error message, and we are not able to ping winbind while it is being
> applied, so it is certainly an approvement!
> Maybe our backup will work with this... it looks like it.

Ok that is good news at least, while not perfect of course...

> Our testsystem is a Pentium III 500Mhz system. Maybe this is the best response
> time we can expect, 1min and 15 sek to apply group ownership of Domain 
> Users to a file? This server does not have any user load and no other 
> applications running on it.

Ok, the system that showed timeouts with 250K group members was as
machine with 8 cores, 2.5GHz each... 
How much ram does your system have?

The problem is that some of the jobs related to the group membership
functions are done by the main winbindd process, which is why it
winbindd unresponsive (or less responsive) for some time if there
are large groups to handle.

We definitely need to track this down further.
We need to find out what introduced this regression from 3.0.24.
3.2 should not perform worse than 3.0.24.b

I did not test with the chgrp call yet in a comparable system.
What is the time for a getent group call?
And what is the time for an id call to a user that is member of that group?
Can you provide the corresponding timings with 3.0.24 on the same test system?

One more thing: I would like to check whether the ldap idmap backend
introduces additional latency. Could you try on a test-system if
"idmap backend = tdb" has the same timings or if they are better?
(I expect them to be better.)

Cheers - Michael
Comment 16 Erik Sørnes 2008-12-10 02:48:08 UTC
>How much ram does your system have?

640MB

>What is the time for a getent group call?

"getent group" takes 10 minutes and 21 seconds.
There are 4 trusted domains, 3 of them containing more than 10000 object.

>And what is the time for an id call to a user that is member of that group?
erso is a member of SIAADM+erso:
wbinfo -n SIAADM+erso  : immediately ( < 1 sec)
wbinfo -S <sid output from above>  : immediately ( < 1 sec)
wbinfo -r SIAADM+erso  : about 1 sec.
chown SIAADM+erso testfil : immediately ( < 1 sec)

>Can you provide the corresponding timings with 3.0.24 on the same test system?

With 3.0.24 installed on same system:

chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with 3.2.4+patch)
getfacl testfil (same file as above): immediately ( < 1 sec.) 
getent group: 1 min. 44 sec. 
wbinfo -n SIAADM+erso  : immediately ( < 1 sec)
wbinfo -S <sid output from above>  : immediately ( < 1 sec)
wbinfo -r SIAADM+erso  : 2 sec.
chown SIAADM+erso testfil : immediately ( < 1 sec)

-kind regards
Erik
Comment 17 Michael Adam 2008-12-10 03:24:18 UTC
(In reply to comment #16)
> >How much ram does your system have?
> 
> 640MB
> 
> >What is the time for a getent group call?
> 
> "getent group" takes 10 minutes and 21 seconds.
> There are 4 trusted domains, 3 of them containing more than 10000 object.

Oh, alright. That is interesting.
But I actually meant "getent group SIAADM+Domain\ Users"

> >And what is the time for an id call to a user that is member of that group?
> erso is a member of SIAADM+erso:
> wbinfo -n SIAADM+erso  : immediately ( < 1 sec)
> wbinfo -S <sid output from above>  : immediately ( < 1 sec)
> wbinfo -r SIAADM+erso  : about 1 sec.
> chown SIAADM+erso testfil : immediately ( < 1 sec)

Ok, I think this is for a user _not_ member of that group.
I was asking for "id SIAADM+username" for a user that IS a member.

> >Can you provide the corresponding timings with 3.0.24 on the same test system?
> 
> With 3.0.24 installed on same system:
> 
> chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with
> 3.2.4+patch)

That is chgrp, I suppose?

> getfacl testfil (same file as above): immediately ( < 1 sec.) 
> getent group: 1 min. 44 sec. 
> wbinfo -n SIAADM+erso  : immediately ( < 1 sec)
> wbinfo -S <sid output from above>  : immediately ( < 1 sec)
> wbinfo -r SIAADM+erso  : 2 sec.
> chown SIAADM+erso testfil : immediately ( < 1 sec)

Hmmm, the timings of the getent group and the chgrp calls
seem to say that we have a drop in performance by a factor 
of 6 here.

We need to find out why this is...

I'd really be interested in timing values with "idmap backend = tdb"
for comparison.

Michael
Comment 18 Michael Adam 2008-12-10 06:41:49 UTC
Erik, 

BTW: you should be able to turn off the behaviour by 
setting "winbind enum users = no" in your smb.conf.
This (among other things) enables a workaround for the 
"Domain Users" group that prevents its members from
being expanded. usually, this is not necessary anyways.

Of course this will only really help you if the Domain Users
group is the only group of that size in your environment.

Is there a particular reason you explicitly set
"winbind enum users = yes" and "winbind enum groups = yes" ?

Cheers - Michael
Comment 19 Erik Sørnes 2008-12-10 08:23:27 UTC
Some unfortunate mispellings in my last message.

>> erso is a member of SIAADM+erso:
>> wbinfo -n SIAADM+erso  : immediately ( < 1 sec)
>> wbinfo -S <sid output from above>  : immediately ( < 1 sec)
>> wbinfo -r SIAADM+erso  : about 1 sec.
>> chown SIAADM+erso testfil : immediately ( < 1 sec)

>Ok, I think this is for a user _not_ member of that group.
>I was asking for "id SIAADM+username" for a user that IS a member.

The correct is erso is a member of SIAADM+Domain\ Users. 
Thats right, SIAADM+erso is a member of the group with over 7000 members and the response times are correct.

>But I actually meant "getent group SIAADM+Domain\ Users"
I misundrestood you there. With debuglevel 1:

with samba 3.0.24:                         about 1 second.
with samba 3.2.4 - with patch from you:    about 4 seconds

>> >Can you provide the corresponding timings with 3.0.24 on the same test system?
>> 
>> With 3.0.24 installed on same system:
>> 
>> chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with
>> 3.2.4+patch)

>That is chgrp, I suppose?

Correct, I writing-error by me: chgrop SIAADM+Domain\ Users testfil : 9 sec

>I'd really be interested in timing values with "idmap backend = tdb"
>for comparison.

I will provide that as soon as possible

>Is there a particular reason you explicitly set
>"winbind enum users = yes" and "winbind enum groups = yes" ?

I cannot recall the exact reason we did this. This is the first "large" sambaserver in our enterprise, which was first installed som years ago. Maybe the tutorial we followed recomended us to do it, maybe it was the "Turning off group enumeration may cause some programs to behave oddly." -warning in the man-pages for smb.conf. I am not at all sure we need this entries set to "yes"

Setting this entries to no, seems to make the problem go away on the server. I am very unsertain about the implications setting "winbind enum groups = no" for the windows users of the samba file server. Maybe you can shed som light on this? 

Comment 20 Björn Jacke 2008-12-10 08:37:22 UTC
user/group enumeration is a bad thing, you should never enable it except for testing purposes. Programs that *do* user/group enumerations are evil (mc for example) in environments with a lot of users. I know no program that refuses to work with enumeration turned off. Samba will work just fine without. The only excepltion is the listing of all users in KDM for example, but well ... :)
Comment 21 Erik Sørnes 2008-12-10 12:53:29 UTC
>I'd really be interested in timing values with "idmap backend = tdb"
>for comparison.

Yes that is _much_ faster.
Samba 3.0.24 + your patch, idmap-backend = tdb, log level 1, winbind enum user/group = yes, testserver LORDVADER 500Mhz/600Mb memory:

getent group SIAADM+Domain\ Users takes 1-2 secs.
chgrp SIAADM+Domain\ Users testfil takes 2-3 secs.
getfacl <file above> takes 5 sec first time, 1-2 sec after that

-regards
Erik

Comment 22 Michael Adam 2008-12-10 14:40:52 UTC
(In reply to comment #21)
> >I'd really be interested in timing values with "idmap backend = tdb"
> >for comparison.
> 
> Yes that is _much_ faster.
> Samba 3.0.24 + your patch, idmap-backend = tdb, log level 1, winbind enum
> user/group = yes, testserver LORDVADER 500Mhz/600Mb memory:

I assume you mean 3.2.4 instead of 3.0.24 :-)
 
> getent group SIAADM+Domain\ Users takes 1-2 secs.
> chgrp SIAADM+Domain\ Users testfil takes 2-3 secs.
> getfacl <file above> takes 5 sec first time, 1-2 sec after that

Ok, that seems to show that the additional delay is caused
by the rewrite of the idmap code that has been done after 3.0.24.
(It has been added in 3.0.25, iirc.)

We should track down what makes this so much more time-consuming.

In the meanwhile, you have a valid workaround, I think:
use "winbind enum users = no". As Björn has detailed in comment #20,
there should be no harm in that.

Als an alternative, you can use idmap backend = tdb and 
regularly do a tdbbackup of winbindd_idmap.tdb and copy
the backup over to the other host, but of course this is
not as immediate as a replicating ldap server...

(A third way would be to use samba in conjunction with ctdb
and have ctdb manage the replication of the underlying 
tdb databases. While ctdb is designed for the use of samba
in active-active clusters, it could in principle also be
deployed in a failover setup like yours...)

As for tracking down the performance regression:

There has been another (slight) rewrite of the idmap code lately,
so if you are daring enough, you might want to try 3.3.0rc1.
(3.3.0 is scheduled to be released next week if all goes well!)
So we can see if that rewrite brings an improvement already.

Erik, thanks for all your testing and feedback!
Cheers - Michael
Comment 23 Erik Sørnes 2008-12-12 06:17:20 UTC
Thank you  very much for your help!
We have implemented "winbind enum users/groups = no". And the response times are great on this old hardware now! The backup is running again!

-regards 
Erik
Comment 24 Björn Jacke 2014-07-23 18:38:49 UTC
looks like this can be closed