Bug 5773 - windows client timeout while connecting samba joined ADS with 32K users
windows client timeout while connecting samba joined ADS with 32K users
Status: NEW
Product: Samba 3.0
Classification: Unclassified
Component: winbind
3.0.28a
Other Windows XP
: P3 regression
: none
Assigned To: Samba Bugzilla Account
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-18 08:47 UTC by Shekhar
Modified: 2008-09-21 23:50 UTC (History)
0 users

See Also:


Attachments
ethereal capture with 3.0.28 when winbindd is busy (7.18 KB, text/plain)
2008-09-18 08:48 UTC, Shekhar
no flags Details
ethereal capture with 3.0.28 when winbindd is free (73.86 KB, text/plain)
2008-09-18 08:48 UTC, Shekhar
no flags Details
Pictorial output of ethereal when winbindd is busy (182.05 KB, image/jpeg)
2008-09-18 08:49 UTC, Shekhar
no flags Details
Pictorial output of ethereal when winbindd is free (397.24 KB, image/jpeg)
2008-09-18 08:50 UTC, Shekhar
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Shekhar 2008-09-18 08:47:20 UTC
I am facing issue with winbindd in samba 3.0.28. I have a requirement to join ADS with 32K users. I am able to join ADS successfully. But when I try to access the samba server through windows XP, I am not able to do so when winbindd is doing some activity. so when I perform wbinfo -u or wbinfo -g, it takes huge time as expected but when I try to access the samba server through XP during this period, I am not able to do so. I get check spelling... kind of message. Now the issue is when winbindd is updaing idmap cached entries or secrets.tdb is being modified, clients don't get response from samba so they timeout. I have captured the ethereal output in both cases and attached with this bug. 
     So this means that winbindd makes samba block till it gets response from ADS so samba is not able to respond during keep alive period for client and it times out. When I discussed on irc , I got message that similar kind of issues are fixed on 3.0.32. So I upgraded for testing purpose and it indeed was fixed. 
Now my concern is the box which I am working on is on field and I can't upgrade package just like that. There is a lot of regression and validation need to be performed on new samba. So I would request if someone can provide the patches that would fix this issues, it would be great.
Comment 1 Shekhar 2008-09-18 08:48:22 UTC
Created attachment 3598 [details]
ethereal capture with 3.0.28 when winbindd is busy
Comment 2 Shekhar 2008-09-18 08:48:59 UTC
Created attachment 3599 [details]
ethereal capture with 3.0.28 when winbindd is free
Comment 3 Shekhar 2008-09-18 08:49:44 UTC
Created attachment 3600 [details]
Pictorial output of ethereal when winbindd is busy
Comment 4 Shekhar 2008-09-18 08:50:24 UTC
Created attachment 3601 [details]
Pictorial output of ethereal when winbindd is free
Comment 5 Gerald (Jerry) Carter 2008-09-18 09:25:22 UTC
(In reply to comment #0)

>      So this means that winbindd makes samba block till it gets response from
> ADS so samba is not able to respond during keep alive period for client and it
> times out. When I discussed on irc , I got message that similar kind of issues
> are fixed on 3.0.32. So I upgraded for testing purpose and it indeed was fixed. 
> Now my concern is the box which I am working on is on field and I can't upgrade
> package just like that. There is a lot of regression and validation need to be
> performed on new samba. So I would request if someone can provide the patches
> that would fix this issues, it would be great.

Not sure exactly what bug you found to be fixed.  The only relevant commit I can
remember is the following that went into Samba 3.2.0:

http://gitweb.samba.org/?p=samba.git;a=commitdiff;h=d05451c2c256e04870ebe6467f38585dad72f3a9
Comment 6 Shekhar 2008-09-18 10:21:09 UTC
Even I too don't know about it. But with 3.0.32, I am able to connect the samba server through win XP. Does this mean that the winbindd requests are marked as non-blocking for samba server? I am consistently seeing failure with 3.0.28 and its succeeding with 3.0.32. 
So you mean to say that this patch is not present in 3.0.32?

Thanks
Shekhar
Comment 7 Gerald (Jerry) Carter 2008-09-18 10:31:12 UTC
Yeah.  That winbind user and group enumeration optimization is
not present in 3.0.  I'll have to look at your traces to debug further.
Comment 8 Shekhar 2008-09-18 19:38:22 UTC
Gerry,

Another interesting thing to note that we are spawning samba through tcp server aka xinetd. I just did one experiment that when we run samba stand alone we don't see this issue at all. But when we spawn it through tcp server, I face this issue. So it is related to socket duping and forwarding to child process. We went through smbd/process.c and figured out that socket will be duped and inherited to spawned process but as samba is spawnws through tcp server, does it make any impact on the same code path? 
Comment 9 Shekhar 2008-09-18 22:45:28 UTC
Also, it works fine in WORKGROUP mode. It happens only in case of joining ADS. This is again strange. I just wanted to know effective behaviour of winbindd when joined ADS with 40K + users. Why does Winbindd blocks samba? Can't we change samba to address this issue? We can't change anything in client PC as they are standard. Please let me know how winbindd, samba and windows XP interaction would happen in case of joining ADS with huge number of user objects. My client is pressing me hard on this issue
Jerry, could you help me out here?
Comment 10 Shekhar 2008-09-19 00:06:18 UTC
Jerry,

Has anybody backported the patch that Jeremy submitted to maintenance releases aka 3.0.32 etc?

Thanks
Sagar
Comment 11 Shekhar 2008-09-19 07:01:17 UTC
Jerry,

Observed further on the system. I found that till the idmap_cache.tdb file is getting updated I am not able to make any access to the samba system through windows client. And unfortunately the time required to populate idmap file completely is variable. Sometimes it takes 10 minutes and sometimes 13 minutes. How can we reduce this turnaround time?

Comment 12 Gerald (Jerry) Carter 2008-09-19 07:20:48 UTC
(In reply to comment #10)
> Jerry,
> 
> Has anybody backported the patch that Jeremy submitted to maintenance releases
> aka 3.0.32 etc?

The patch is too intrusive IMO to be backported.  I simplya allows enumeration (which 
is always a bad idea anyways) to be done across multiple domains in parallel thus reducing
the time but not removing the nature of the problem where the parent winbindd process 
is too busy to respond to other requests.
Comment 13 Shekhar 2008-09-21 23:09:42 UTC
When I did further experiments, I observed quite a few interesting points and some anomalies as listed below:

1. Till idmap_cache.tdb is populated, we are not able to connect to system or share. But after complete population, we can. Interestingly after this, I delete the idmap_cache.tdb but even then I am able to connect to system. This I confirmed quite a few times. note that idmap file was not present here. Then I rebooted the system and observed that idmap file was getting regenerated. And this time it grew upto 7.3 MB only and not 14.5MB Till it became 7 MB, I was not able to access. But after that I was able to access the system. Then I rebooted the system again. I was not able to access system till it became 7 MB. 

In both cases I winbindd was eating 94% of CPU.  

2. What all files will be updated by winbindd in large ADS mode during run time? 

3. Which files will impact on samba's request to get delayed? I know, gencache.tdb, registry.tdb used to get synced every time but the syncing time depends on network and inherently the client connection service will depend on that.

4. when I unjoin ( go back to workgroup mode ) and join back then the idmap_cache.tdb file size becomes 14.5 MB and again I am not able to access either system or share till it is completely populated.  I don't understand this behaviour of idmap_cache.tdb file.

Any idea on this?
Comment 14 Shekhar 2008-09-21 23:50:55 UTC
So just a thought, if the caching by winbindd is made incremental, I think authentication of the system can be allowed. Has anybody faced similar issue before? Or can we do configuration in the system to make it incremental so that the requests won't be blocked?