After upgrading from samba 3.0.24 to 3.2.4 on our fileserver, We notice that the filserveren sometimes become unresponsive. We can trigger the issue by running the command "getent passwd" or "getfacl *" on a folder with files that has lot of acls. A lot of acl's meaning about 2-10. When the server hangs, "wbinfo -p" returns: Ping to winbindd failed could not ping winbindd! Commands like smbstatus, getfacl, ls -lat also hangs. strace wbinfo -p returns fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl64(4, F_GETFD) = 0 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 connect(4, {sa_family=AF_FILE, path="/tmp/.winbindd/pipe"}, 110) = 0 select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) write(4, "0\10\0\0\0\0\0\0\0\0\0\0lp\0\0\0\10\0\0\0\0\0\0\0\0\0\0"..., 2096) = 2096 select(5, [4], NULL, NULL, {5, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {5, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {5, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {5, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {5, 0}) = 0 (Timeout) we use an openldap 2.2.24 backend, the server has been joined to a windows2003 domain with "net ads join" log level set to 10 doesn't give any useful output. The log shows winbind just stops responding at some point without an errormessage. Restart of the winbindprosesses solves the problem. The really big problem here is that this causes the nightly backup to fail. So have been without backup on our system for 2 month. We did not have this problem at all with samba 3.0.24 -regards Erik
We have reproduced the problem on a testserver. Upgrading the testserver to 3.3.0-pre2 does not help
It seems to me winbind starts becoming unresponsive when it treats the group <Domain name>+Domain Users. What is special about this group is that they have extremely many members >4000, maybe more.
Hi Erik, could you please post your (relevant, global) samba configuration? And winbind log files of a reproduced problem (all files log.w* usually). What precisely do you mean, saying "we use an openldap 2.2.24 backend": Do you use an ldap "passdb backend"? If so - why? A member server does not have to maintain a user database of its own. Thanks, Michael
Hi Erik, >could you please post your (relevant, global) samba configuration? [global] workgroup = ahus netbios name = kjenn security = ADS realm = AHUS.NO encrypt passwords = yes lock directory = /var/lock/samba log level = 1 name resolve order = host wins server = 10.225.3.6 ldap admin dn= cn=Manager,dc=ahus,dc=no ldap idmap suffix = ou=Idmap ldap suffix = dc=ahus,dc=no idmap backend = ldap:ldap://127.0.0.1 idmap uid = 10000-200000 idmap gid = 10000-200000 winbind enum users = yes winbind enum groups = yes winbind separator = + winbind use default domain = yes winbind nested groups = yes template shell = /bin/bash template homedir = /home/%U default devmode = yes socket options = TCP_NODELAY >What precisely do you mean, saying "we use an openldap 2.2.24 backend": We use it to store uid/gid-sid-mappings in the openldap server. This is because we have an offline node, the data being replicated with to this offline node with drbd. The other node must be activated manually and we need the uid/gid-sid-mappings to be the same on both nodes. >Do you use an ldap "passdb backend"? No only for uid/gid-sid-mapping I will suply logfiles shortly -regards Erik
I will upload all of the log.w-logs with log level=10 while running the command chgrp SIAADM+Domain\ Users testfil (2 times after each other) While running the command, (wbinfo -p returns "unable to ping winbind") After a while (maybe one minute, the command fails, and we are again able to ping winbind). PS. This was run on the testserver LORDVADER. same smb.conf, same sambaversion (3.2.4) etc..., etc... -regards Erik
Created attachment 3797 [details] log.wb-AHUS
Created attachment 3798 [details] log.wb-BUILTIN
Created attachment 3799 [details] log.wbinfo
Created attachment 3800 [details] log.winbindd
Created attachment 3801 [details] log.winbindd-dc-connect
Created attachment 3802 [details] log.winbindd-idmap
Ok, I see in the logs that the group has some 7300 members. I recently wrote a patch that hast not been released yet but will be in 3.2.6 and 3.3.0 to speed up an internal routine (fill_grent_mem) that collects the member information for a group. When it bit me, it was with a group of 250,000 members, and I am not completely sure that it will solve your problem, but well, you could give it a try! I will attach a patch next. Cheers - Michael PS: for debugging, it helps if you put "debug hires timestamp = yes" and "debug pid = yes" in your smb.conf.
Created attachment 3804 [details] patch to speed up fill_grent_mem() this patch should apply to 3.2.4 though I have not tested it now it has gone into the v3-2-test tree after the 3.2.4 release
I have applied the patch to our testsystem, LORDVADER, and it helps alot to the problem: After patch chgrp SIAADM+Domain\ Users takes 1 min and 15 sek, and succeeds. We can ping winbind (but slow response...) all the time during setting of the group. This is with debug level 10. With debug level 1, setting of the group ownership takes 53 seconds. To read the acl of the testfile with getfacl takes now 5 seconds on the testsystem. Yes, I think this patch will solve our problem with the backup atleast, although 53 seconds is a long time. To summarize, without the patch, setting the group results in an error message, and we are not able to ping winbind while it is being applied, so it is certainly an approvement! Maybe our backup will work with this... it looks like it. Our testsystem is a Pentium III 500Mhz system. Maybe this is the best response time we can expect, 1min and 15 sek to apply group ownership of Domain Users to a file? This server does not have any user load and no other applications running on it. -regards Erik
Thanks for your feedback, Erik! (In reply to comment #14) > I have applied the patch to our testsystem, LORDVADER, and it helps > alot to the problem: > > After patch > chgrp SIAADM+Domain\ Users takes 1 min and 15 sek, and succeeds. We can ping > winbind (but slow response...) all the time during setting of the group. This > is with debug level 10. With debug level 1, setting of the group ownership > takes 53 seconds. To read the acl of the testfile with getfacl takes now 5 > seconds on the testsystem. Yes, I think this patch will solve our problem with > the backup atleast, although 53 seconds is a long time. > > To summarize, without the patch, setting the group results in an > error message, and we are not able to ping winbind while it is being > applied, so it is certainly an approvement! > Maybe our backup will work with this... it looks like it. Ok that is good news at least, while not perfect of course... > Our testsystem is a Pentium III 500Mhz system. Maybe this is the best response > time we can expect, 1min and 15 sek to apply group ownership of Domain > Users to a file? This server does not have any user load and no other > applications running on it. Ok, the system that showed timeouts with 250K group members was as machine with 8 cores, 2.5GHz each... How much ram does your system have? The problem is that some of the jobs related to the group membership functions are done by the main winbindd process, which is why it winbindd unresponsive (or less responsive) for some time if there are large groups to handle. We definitely need to track this down further. We need to find out what introduced this regression from 3.0.24. 3.2 should not perform worse than 3.0.24.b I did not test with the chgrp call yet in a comparable system. What is the time for a getent group call? And what is the time for an id call to a user that is member of that group? Can you provide the corresponding timings with 3.0.24 on the same test system? One more thing: I would like to check whether the ldap idmap backend introduces additional latency. Could you try on a test-system if "idmap backend = tdb" has the same timings or if they are better? (I expect them to be better.) Cheers - Michael
>How much ram does your system have? 640MB >What is the time for a getent group call? "getent group" takes 10 minutes and 21 seconds. There are 4 trusted domains, 3 of them containing more than 10000 object. >And what is the time for an id call to a user that is member of that group? erso is a member of SIAADM+erso: wbinfo -n SIAADM+erso : immediately ( < 1 sec) wbinfo -S <sid output from above> : immediately ( < 1 sec) wbinfo -r SIAADM+erso : about 1 sec. chown SIAADM+erso testfil : immediately ( < 1 sec) >Can you provide the corresponding timings with 3.0.24 on the same test system? With 3.0.24 installed on same system: chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with 3.2.4+patch) getfacl testfil (same file as above): immediately ( < 1 sec.) getent group: 1 min. 44 sec. wbinfo -n SIAADM+erso : immediately ( < 1 sec) wbinfo -S <sid output from above> : immediately ( < 1 sec) wbinfo -r SIAADM+erso : 2 sec. chown SIAADM+erso testfil : immediately ( < 1 sec) -kind regards Erik
(In reply to comment #16) > >How much ram does your system have? > > 640MB > > >What is the time for a getent group call? > > "getent group" takes 10 minutes and 21 seconds. > There are 4 trusted domains, 3 of them containing more than 10000 object. Oh, alright. That is interesting. But I actually meant "getent group SIAADM+Domain\ Users" > >And what is the time for an id call to a user that is member of that group? > erso is a member of SIAADM+erso: > wbinfo -n SIAADM+erso : immediately ( < 1 sec) > wbinfo -S <sid output from above> : immediately ( < 1 sec) > wbinfo -r SIAADM+erso : about 1 sec. > chown SIAADM+erso testfil : immediately ( < 1 sec) Ok, I think this is for a user _not_ member of that group. I was asking for "id SIAADM+username" for a user that IS a member. > >Can you provide the corresponding timings with 3.0.24 on the same test system? > > With 3.0.24 installed on same system: > > chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with > 3.2.4+patch) That is chgrp, I suppose? > getfacl testfil (same file as above): immediately ( < 1 sec.) > getent group: 1 min. 44 sec. > wbinfo -n SIAADM+erso : immediately ( < 1 sec) > wbinfo -S <sid output from above> : immediately ( < 1 sec) > wbinfo -r SIAADM+erso : 2 sec. > chown SIAADM+erso testfil : immediately ( < 1 sec) Hmmm, the timings of the getent group and the chgrp calls seem to say that we have a drop in performance by a factor of 6 here. We need to find out why this is... I'd really be interested in timing values with "idmap backend = tdb" for comparison. Michael
Erik, BTW: you should be able to turn off the behaviour by setting "winbind enum users = no" in your smb.conf. This (among other things) enables a workaround for the "Domain Users" group that prevents its members from being expanded. usually, this is not necessary anyways. Of course this will only really help you if the Domain Users group is the only group of that size in your environment. Is there a particular reason you explicitly set "winbind enum users = yes" and "winbind enum groups = yes" ? Cheers - Michael
Some unfortunate mispellings in my last message. >> erso is a member of SIAADM+erso: >> wbinfo -n SIAADM+erso : immediately ( < 1 sec) >> wbinfo -S <sid output from above> : immediately ( < 1 sec) >> wbinfo -r SIAADM+erso : about 1 sec. >> chown SIAADM+erso testfil : immediately ( < 1 sec) >Ok, I think this is for a user _not_ member of that group. >I was asking for "id SIAADM+username" for a user that IS a member. The correct is erso is a member of SIAADM+Domain\ Users. Thats right, SIAADM+erso is a member of the group with over 7000 members and the response times are correct. >But I actually meant "getent group SIAADM+Domain\ Users" I misundrestood you there. With debuglevel 1: with samba 3.0.24: about 1 second. with samba 3.2.4 - with patch from you: about 4 seconds >> >Can you provide the corresponding timings with 3.0.24 on the same test system? >> >> With 3.0.24 installed on same system: >> >> chown SIAADM+Domain\ Users testfil : 9 sec ( compared to 53 sec with >> 3.2.4+patch) >That is chgrp, I suppose? Correct, I writing-error by me: chgrop SIAADM+Domain\ Users testfil : 9 sec >I'd really be interested in timing values with "idmap backend = tdb" >for comparison. I will provide that as soon as possible >Is there a particular reason you explicitly set >"winbind enum users = yes" and "winbind enum groups = yes" ? I cannot recall the exact reason we did this. This is the first "large" sambaserver in our enterprise, which was first installed som years ago. Maybe the tutorial we followed recomended us to do it, maybe it was the "Turning off group enumeration may cause some programs to behave oddly." -warning in the man-pages for smb.conf. I am not at all sure we need this entries set to "yes" Setting this entries to no, seems to make the problem go away on the server. I am very unsertain about the implications setting "winbind enum groups = no" for the windows users of the samba file server. Maybe you can shed som light on this?
user/group enumeration is a bad thing, you should never enable it except for testing purposes. Programs that *do* user/group enumerations are evil (mc for example) in environments with a lot of users. I know no program that refuses to work with enumeration turned off. Samba will work just fine without. The only excepltion is the listing of all users in KDM for example, but well ... :)
>I'd really be interested in timing values with "idmap backend = tdb" >for comparison. Yes that is _much_ faster. Samba 3.0.24 + your patch, idmap-backend = tdb, log level 1, winbind enum user/group = yes, testserver LORDVADER 500Mhz/600Mb memory: getent group SIAADM+Domain\ Users takes 1-2 secs. chgrp SIAADM+Domain\ Users testfil takes 2-3 secs. getfacl <file above> takes 5 sec first time, 1-2 sec after that -regards Erik
(In reply to comment #21) > >I'd really be interested in timing values with "idmap backend = tdb" > >for comparison. > > Yes that is _much_ faster. > Samba 3.0.24 + your patch, idmap-backend = tdb, log level 1, winbind enum > user/group = yes, testserver LORDVADER 500Mhz/600Mb memory: I assume you mean 3.2.4 instead of 3.0.24 :-) > getent group SIAADM+Domain\ Users takes 1-2 secs. > chgrp SIAADM+Domain\ Users testfil takes 2-3 secs. > getfacl <file above> takes 5 sec first time, 1-2 sec after that Ok, that seems to show that the additional delay is caused by the rewrite of the idmap code that has been done after 3.0.24. (It has been added in 3.0.25, iirc.) We should track down what makes this so much more time-consuming. In the meanwhile, you have a valid workaround, I think: use "winbind enum users = no". As Björn has detailed in comment #20, there should be no harm in that. Als an alternative, you can use idmap backend = tdb and regularly do a tdbbackup of winbindd_idmap.tdb and copy the backup over to the other host, but of course this is not as immediate as a replicating ldap server... (A third way would be to use samba in conjunction with ctdb and have ctdb manage the replication of the underlying tdb databases. While ctdb is designed for the use of samba in active-active clusters, it could in principle also be deployed in a failover setup like yours...) As for tracking down the performance regression: There has been another (slight) rewrite of the idmap code lately, so if you are daring enough, you might want to try 3.3.0rc1. (3.3.0 is scheduled to be released next week if all goes well!) So we can see if that rewrite brings an improvement already. Erik, thanks for all your testing and feedback! Cheers - Michael
Thank you very much for your help! We have implemented "winbind enum users/groups = no". And the response times are great on this old hardware now! The backup is running again! -regards Erik
looks like this can be closed