Bug 8943 - Slow but responsive DC can lock up winbindd for > 10 minutes at a time (3.5 to master)
Slow but responsive DC can lock up winbindd for > 10 minutes at a time (3.5 t...
Status: ASSIGNED
Product: Samba 3.6
Classification: Unclassified
Component: Winbind
unspecified
All All
: P5 normal
: ---
Assigned To: Andrew Bartlett
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-15 04:38 UTC by Andrew Bartlett
Modified: 2012-05-31 21:23 UTC (History)
2 users (show)

See Also:


Attachments
Patch to slow down the S4 ldap server to simulate a slow VPN link. (884 bytes, patch)
2012-05-15 04:38 UTC, Andrew Bartlett
no flags Details
Only use ldap lib timeout, not alarm() as well (884 bytes, patch)
2012-05-15 07:14 UTC, Andrew Bartlett
no flags Details
Only use ldap lib timeout, not alarm() as well. Assumes a ldap lib with proper timeout handling (1.59 KB, patch)
2012-05-15 12:06 UTC, Andrew Bartlett
no flags Details
Allow parallel LDAP operation (912 bytes, patch)
2012-05-15 23:37 UTC, Andrew Bartlett
no flags Details
Change to not-quite-so-slow ldap server (895 bytes, patch)
2012-05-18 12:06 UTC, Andrew Bartlett
no flags Details
Convert LDAP_TIMELIMIT_EXCEEEDED to NT_STATUS_IO_TIMEOUT (906 bytes, patch)
2012-05-18 12:08 UTC, Andrew Bartlett
no flags Details
Use the new error code mapping to ensure we always pass IO_TIMEOUT to the callers (2.57 KB, patch)
2012-05-18 12:09 UTC, Andrew Bartlett
no flags Details
Back off on the ldap page size in an attempt to get a page in under 15 seconds (2.86 KB, patch)
2012-05-18 12:10 UTC, Andrew Bartlett
no flags Details
This testing patch helps with this work, by not requring that winbind be in use (956 bytes, patch)
2012-05-18 12:12 UTC, Andrew Bartlett
no flags Details
Patches for master, excluding the alarm patch (7.19 KB, patch)
2012-05-21 06:34 UTC, Andrew Bartlett
jra: review+
Details
Patches for 3.5, tested on a real-world NAS, against synthetic slow DC (7.26 KB, patch)
2012-05-21 06:35 UTC, Andrew Bartlett
jra: review+
Details
Patches for 3.6, tested on a real-world NAS, against synthetic slow DC (7.19 KB, patch)
2012-05-21 06:37 UTC, Andrew Bartlett
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Bartlett 2012-05-15 04:38:25 UTC
Created attachment 7561 [details]
Patch to slow down the S4 ldap server to simulate a slow VPN link.

Using the attached patch for Samba 4.0 to emulate a responsive DC on the end of a lossy 1s RTT VPN link, with 150 users created by:
 smbtorture ncacn_np:localhost rpc.samr.large-dc --target=samba3

Then on a winbindd joined to this DC, run:
 wbinfo -u &
 wbinfo --user-info=domain\\administrator

Because the default ldap timeout is 15, and the ldap server only sends a single ldap reply each second, we quickly start blocking the winbind child.

Then, the alarm in:

ldap_search_with_timeout() starts to fire at 15 seconds, causing an error of:
if (gotalarm != 0)
	return LDAP_TIMELIMIT_EXCEEDED;

causing the search to restart, after (presumably) it has eventually completed.

In any case, the caller code is expected to only try 3 times, but instead the winbindd process for that domain is unavailable for > 10 mins, not just 3 * 15 = 45 or > 3 * 150 = 450 seconds.


[abartlet@obed samba-3]$ bin/wbinfo -u
^Z
[14]+  Stopped                 bin/wbinfo -u
[abartlet@obed samba-3]$ bg
[14]+ bin/wbinfo -u &
[abartlet@obed samba-3]$ date
Tue May 15 14:24:38 EST 2012
[abartlet@obed samba-3]$ time bin/wbinfo -i s4obed\\administrator && date
S3OBED3\root
failed to call wbcGetpwnam: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not get info for user s4obed\administrator
[14]+  Done                    bin/wbinfo -u

real	5m50.327s
user	0m0.012s
sys	0m0.006s
[abartlet@obed samba-3]$ time bin/wbinfo -i s4obed\\administrator && date
failed to call wbcGetpwnam: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not get info for user s4obed\administrator

real	5m50.301s
user	0m0.004s
sys	0m0.007s
Comment 1 Andrew Bartlett 2012-05-15 04:55:48 UTC
eventually the system started to be responsive again, but with incorrect results at: Tue May 15 14:52:43 EST 2012 (> 20 mins later)

failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user s4obed\administrator
Comment 2 Volker Lendecke 2012-05-15 06:23:24 UTC
Please try to increase the parameter

winbind max domain connections

and re-test.
Comment 3 Andrew Bartlett 2012-05-15 07:14:31 UTC
Created attachment 7562 [details]
Only use ldap lib timeout, not alarm() as well

Starting case:

[abartlet@obed samba-3]$ bin/wbinfo -u &
[15] 5123
[abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date
Tue May 15 17:09:07 EST 2012
S3OBED3\root
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user s4obed\administrator
[15]+  Done                    bin/wbinfo -u

real	4m5.223s
user	0m0.009s
sys	0m0.008s
Tue May 15 17:13:12 EST 2012



Changing it to:

	winbind max domain connections = 3

does seem to help, with time-outs dropping to around 1 or 2 min.


[abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date
Tue May 15 16:51:29 EST 2012
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user s4obed\administrator

real	1m10.071s
user	0m0.001s
sys	0m0.004s
Tue May 15 16:52:39 EST 2012

[abartlet@obed samba-3]$ bin/wbinfo -u
^Z
[15]+  Stopped                 bin/wbinfo -u
[abartlet@obed samba-3]$ bg
[15]+ bin/wbinfo -u &
[abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date
Tue May 15 16:56:29 EST 2012
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user s4obed\administrator

real	2m20.140s
user	0m0.004s
sys	0m0.003s
Tue May 15 16:58:49 EST 2012
[abartlet@obed samba-3]$ S3OBED3\root



The attached patch to only use the internal ldap libs timeout (and not alarm()) has reduced the timeout back to 1 min (without the smb.conf change)

[abartlet@obed samba-3]$ bin/wbinfo -u
^Z
[15]+  Stopped                 bin/wbinfo -u
[abartlet@obed samba-3]$ bg
[15]+ bin/wbinfo -u &
[abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date
Tue May 15 17:00:56 EST 2012
S3OBED3\root
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user s4obed\administrator
[15]+  Done                    bin/wbinfo -u

real	1m0.667s
user	0m0.006s
sys	0m0.007s
Tue May 15 17:01:57 EST 2012
Comment 4 Volker Lendecke 2012-05-15 10:04:15 UTC
Comment on attachment 7562 [details]
Only use ldap lib timeout, not alarm() as well

To me this looks very similar to the patch to slow down Samba, not a patch to modify winbind's use of the ldap libraries.
Comment 5 Andrew Bartlett 2012-05-15 12:06:10 UTC
Created attachment 7563 [details]
Only use ldap lib timeout, not alarm() as well.

Assumes a ldap lib with proper timeout handling
Comment 6 Volker Lendecke 2012-05-15 12:13:55 UTC
Comment on attachment 7563 [details]
Only use ldap lib timeout, not alarm() as well.

Assumes a ldap lib with proper timeout handling

There's a problem with this patch: Last time I looked at the openldap libs, they were synchronous in the send piece of ldapsearch, and in that part they ignored the timeout value. Has this changed in the last, say, 12 months?
Comment 7 Andrew Bartlett 2012-05-15 12:20:47 UTC
On the flip side, how does keeping this code in help?

The only way it can help is if we rely on poor read() handling in libldap, without the EINTR handling like Samba uses:

	do {
		ret = recvfrom(s, buf, len, flags, from, fromlen);
#if defined(EWOULDBLOCK)
	} while (ret == -1 && (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK));
#else
	} while (ret == -1 && (errno == EINTR || errno == EAGAIN));
#endif


That is, how does setting alarm(), and having that set a static integer actually terminate the library call any earlier?

I'm sure this isn't the total solution - we need to handle the events in an async way, and do other interactions during the ldap query - but in my limited testing, it certainly seems to help.  

I only provide it as a clue to a better solution, and at the least a much reduced lockup.
Comment 8 Volker Lendecke 2012-05-15 12:46:03 UTC
The problem is -- there is no such thing as an asynchronous "standard" ldap library. That's the whole reason why I started tldap. I could not reliably make it asynchronous.
Comment 9 Andrew Bartlett 2012-05-15 22:51:14 UTC
BTW, the current OpenLDAP CVS tree does indeed seem to be based on both EINTR handling and non-blocking reads driven by poll() with a timeout.
Comment 10 Andrew Bartlett 2012-05-15 23:37:15 UTC
Created attachment 7565 [details]
Allow parallel LDAP operation

This shows (I think) that volker's suggestion of 'winbind max domain connections = 3' is of assistance, when the LDAP server is given more than one thread.

(Traditionally Samba4's ldap server has been single-threaded, but the location of the hacked-in sleep() would mean that other connections are denied service).

We still have a serious issue that we loop for too long trying to get the data, multiple times.
Comment 11 Volker Lendecke 2012-05-16 08:46:04 UTC
(In reply to comment #9)
> BTW, the current OpenLDAP CVS tree does indeed seem to be based on both EINTR
> handling and non-blocking reads driven by poll() with a timeout.

What about the connect() case? This was one call that did not properly time out in the past.
Comment 12 Andrew Bartlett 2012-05-17 01:48:14 UTC
The connect case appears to be connected to this LDAP option:

	async = LDAP_BOOL_GET( &ld->ld_options, LDAP_BOOL_CONNECT_ASYNC );

In any case, the alarm() on the search is separate, and with modern ldap libs clearly pointless.  It may not however have caused harm - in retesting with multiple LDAP server threads in the DC, the benefit seems to be reduced.

For connect:  while probably unwise outside our test environment where we have no other choice, the hack to use ldap_open_fd() is and then creating the socket ourselves is quite tempting. 

We do however still have a problem - once the search times out (by whichever means), we go back and re-search almost endlessly, and this is where I think the major issue is.  

The timeouts we see are not of the order of 15sec.  Even with a parallel client thread, we should either talk to the DC, and get the user list once, or we should die and not loop, chewing bandwith on what is 'clearly' a bandwith poor link.
Comment 13 Andrew Bartlett 2012-05-18 12:06:04 UTC
Created attachment 7566 [details]
Change to not-quite-so-slow ldap server

This helps create a more realistic situation that can actually be resolved.
Comment 14 Andrew Bartlett 2012-05-18 12:08:12 UTC
Created attachment 7567 [details]
Convert LDAP_TIMELIMIT_EXCEEEDED to NT_STATUS_IO_TIMEOUT

This means we do not loop 3x5x15 seconds because of:
 - 3 loops in the libads layer (unconditional)
 - 5 loops in the winbind_cache layer (conditational on NT_STATUS_IO_TIMEOUT)

This also means we will use IO_TIMEOUT to knock a DC offline.
Comment 15 Andrew Bartlett 2012-05-18 12:09:12 UTC
Created attachment 7568 [details]
Use the new error code mapping to ensure we always pass IO_TIMEOUT to the callers

With this patch, the winbindd_cache layer can actually see the IO_TIMEOUT error
Comment 16 Andrew Bartlett 2012-05-18 12:10:24 UTC
Created attachment 7569 [details]
Back off on the ldap page size in an attempt to get a page in under 15 seconds

By backing down in the page size, we can adjust our requests to fit within our 15 second request/response time budget.
Comment 17 Andrew Bartlett 2012-05-18 12:12:32 UTC
Created attachment 7570 [details]
This testing patch helps with this work, by not requring that winbind be in use

This may be useful in any case, with the page size backdown. 

With these patches, I now see without errors the first 250 users (two pages).  Either Samba4 or libads isn't doing the right thing for the remaining pages however.

That said, I think we are on the right path for a proper fix to this issue. 

Further insights from others most welcome.
Comment 18 Andrew Bartlett 2012-05-21 06:34:26 UTC
Created attachment 7572 [details]
Patches for master, excluding the alarm patch

These patches for master have been tested in my synthetic environment with wbinfo -u and net ads search objectclass=user.
Comment 19 Andrew Bartlett 2012-05-21 06:35:48 UTC
Created attachment 7573 [details]
Patches for 3.5, tested on a real-world NAS, against synthetic slow DC

I had to speed the DC up to only a 25ms delay, but with that this patch seems to behave correctly on a real NAS (rather than workstation loopback).
Comment 20 Andrew Bartlett 2012-05-21 06:37:36 UTC
Created attachment 7574 [details]
Patches for 3.6, tested on a real-world NAS, against synthetic slow DC

The same patches, but patched up for 3.6
Comment 21 Andrew Bartlett 2012-05-22 03:06:29 UTC
Comment on attachment 7563 [details]
Only use ldap lib timeout, not alarm() as well.

Assumes a ldap lib with proper timeout handling

BTW, I agree that not all ldap libs do correct time out handling, and so the alarm() patch could not be used unconditionally.
Comment 22 Andrew Bartlett 2012-05-22 03:07:45 UTC
Some thoughts for reviewers:

The primary issue with this approach is in a ldap
timeout situation, we would stall out for 1min while re-training down to
the smaller page size, and we will stay at the smaller page size until
winbindd restarts.  The smaller pages are less efficient (125 at a time
vs 1000 at a time), but not impossibly so. 

Most timeout situations (DC failover etc) would only reduce the page
size by half once, but continuing failover could eventually reduce it to
125. 

In a full timeout (unable to get even a small page out in 'ldap
timeout') it should mark the domain offline (as seen with wbinfo
--domain-status), but it doesn't seem to.  It does however return with
error quickly from wbinfo.

Compared with the current alternative outcomes without these patches, I
think they are a distinct improvement.
Comment 23 Jeremy Allison 2012-05-25 23:47:53 UTC
Comment on attachment 7574 [details]
Patches for 3.6, tested on a real-world NAS, against synthetic slow DC

Really nice work !
Comment 24 Jeremy Allison 2012-05-25 23:55:08 UTC
Comment on attachment 7572 [details]
Patches for master, excluding the alarm patch

Pushed to autobuild.
Comment 25 Jeremy Allison 2012-05-25 23:56:32 UTC
abartlett wrote:

"In a full timeout (unable to get even a small page out in 'ldap
timeout') it should mark the domain offline (as seen with wbinfo
--domain-status), but it doesn't seem to.  It does however return with
error quickly from wbinfo."

We should probably investigate that - but maybe not stop this bug for that.

Jeremy.
Comment 26 Jeremy Allison 2012-05-26 00:04:42 UTC
Comment on attachment 7573 [details]
Patches for 3.5, tested on a real-world NAS, against synthetic slow DC

LGTM.
Comment 27 Jeremy Allison 2012-05-26 00:05:52 UTC
Re-assigning to Karolin for inclusion in 3.5.next and 3.6.next.
Comment 28 Karolin Seeger 2012-05-28 19:31:54 UTC
(In reply to comment #25)
> abartlett wrote:
> 
> "In a full timeout (unable to get even a small page out in 'ldap
> timeout') it should mark the domain offline (as seen with wbinfo
> --domain-status), but it doesn't seem to.  It does however return with
> error quickly from wbinfo."
> 
> We should probably investigate that - but maybe not stop this bug for that.
> 
> Jeremy.

Pushed patches to v3-5-test and v3-6-test.
Re-assigning bug report to Andrew as a reminder for the issue mentioned above.
Feel free to close the bug report.

Thanks!