Created attachment 7561 [details] Patch to slow down the S4 ldap server to simulate a slow VPN link. Using the attached patch for Samba 4.0 to emulate a responsive DC on the end of a lossy 1s RTT VPN link, with 150 users created by: smbtorture ncacn_np:localhost rpc.samr.large-dc --target=samba3 Then on a winbindd joined to this DC, run: wbinfo -u & wbinfo --user-info=domain\\administrator Because the default ldap timeout is 15, and the ldap server only sends a single ldap reply each second, we quickly start blocking the winbind child. Then, the alarm in: ldap_search_with_timeout() starts to fire at 15 seconds, causing an error of: if (gotalarm != 0) return LDAP_TIMELIMIT_EXCEEDED; causing the search to restart, after (presumably) it has eventually completed. In any case, the caller code is expected to only try 3 times, but instead the winbindd process for that domain is unavailable for > 10 mins, not just 3 * 15 = 45 or > 3 * 150 = 450 seconds. [abartlet@obed samba-3]$ bin/wbinfo -u ^Z [14]+ Stopped bin/wbinfo -u [abartlet@obed samba-3]$ bg [14]+ bin/wbinfo -u & [abartlet@obed samba-3]$ date Tue May 15 14:24:38 EST 2012 [abartlet@obed samba-3]$ time bin/wbinfo -i s4obed\\administrator && date S3OBED3\root failed to call wbcGetpwnam: WBC_ERR_WINBIND_NOT_AVAILABLE Could not get info for user s4obed\administrator [14]+ Done bin/wbinfo -u real 5m50.327s user 0m0.012s sys 0m0.006s [abartlet@obed samba-3]$ time bin/wbinfo -i s4obed\\administrator && date failed to call wbcGetpwnam: WBC_ERR_WINBIND_NOT_AVAILABLE Could not get info for user s4obed\administrator real 5m50.301s user 0m0.004s sys 0m0.007s
eventually the system started to be responsive again, but with incorrect results at: Tue May 15 14:52:43 EST 2012 (> 20 mins later) failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user s4obed\administrator
Please try to increase the parameter winbind max domain connections and re-test.
Created attachment 7562 [details] Only use ldap lib timeout, not alarm() as well Starting case: [abartlet@obed samba-3]$ bin/wbinfo -u & [15] 5123 [abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date Tue May 15 17:09:07 EST 2012 S3OBED3\root failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user s4obed\administrator [15]+ Done bin/wbinfo -u real 4m5.223s user 0m0.009s sys 0m0.008s Tue May 15 17:13:12 EST 2012 Changing it to: winbind max domain connections = 3 does seem to help, with time-outs dropping to around 1 or 2 min. [abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date Tue May 15 16:51:29 EST 2012 failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user s4obed\administrator real 1m10.071s user 0m0.001s sys 0m0.004s Tue May 15 16:52:39 EST 2012 [abartlet@obed samba-3]$ bin/wbinfo -u ^Z [15]+ Stopped bin/wbinfo -u [abartlet@obed samba-3]$ bg [15]+ bin/wbinfo -u & [abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date Tue May 15 16:56:29 EST 2012 failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user s4obed\administrator real 2m20.140s user 0m0.004s sys 0m0.003s Tue May 15 16:58:49 EST 2012 [abartlet@obed samba-3]$ S3OBED3\root The attached patch to only use the internal ldap libs timeout (and not alarm()) has reduced the timeout back to 1 min (without the smb.conf change) [abartlet@obed samba-3]$ bin/wbinfo -u ^Z [15]+ Stopped bin/wbinfo -u [abartlet@obed samba-3]$ bg [15]+ bin/wbinfo -u & [abartlet@obed samba-3]$ date && time bin/wbinfo -i s4obed\\administrator; date Tue May 15 17:00:56 EST 2012 S3OBED3\root failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user s4obed\administrator [15]+ Done bin/wbinfo -u real 1m0.667s user 0m0.006s sys 0m0.007s Tue May 15 17:01:57 EST 2012
Comment on attachment 7562 [details] Only use ldap lib timeout, not alarm() as well To me this looks very similar to the patch to slow down Samba, not a patch to modify winbind's use of the ldap libraries.
Created attachment 7563 [details] Only use ldap lib timeout, not alarm() as well. Assumes a ldap lib with proper timeout handling
Comment on attachment 7563 [details] Only use ldap lib timeout, not alarm() as well. Assumes a ldap lib with proper timeout handling There's a problem with this patch: Last time I looked at the openldap libs, they were synchronous in the send piece of ldapsearch, and in that part they ignored the timeout value. Has this changed in the last, say, 12 months?
On the flip side, how does keeping this code in help? The only way it can help is if we rely on poor read() handling in libldap, without the EINTR handling like Samba uses: do { ret = recvfrom(s, buf, len, flags, from, fromlen); #if defined(EWOULDBLOCK) } while (ret == -1 && (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)); #else } while (ret == -1 && (errno == EINTR || errno == EAGAIN)); #endif That is, how does setting alarm(), and having that set a static integer actually terminate the library call any earlier? I'm sure this isn't the total solution - we need to handle the events in an async way, and do other interactions during the ldap query - but in my limited testing, it certainly seems to help. I only provide it as a clue to a better solution, and at the least a much reduced lockup.
The problem is -- there is no such thing as an asynchronous "standard" ldap library. That's the whole reason why I started tldap. I could not reliably make it asynchronous.
BTW, the current OpenLDAP CVS tree does indeed seem to be based on both EINTR handling and non-blocking reads driven by poll() with a timeout.
Created attachment 7565 [details] Allow parallel LDAP operation This shows (I think) that volker's suggestion of 'winbind max domain connections = 3' is of assistance, when the LDAP server is given more than one thread. (Traditionally Samba4's ldap server has been single-threaded, but the location of the hacked-in sleep() would mean that other connections are denied service). We still have a serious issue that we loop for too long trying to get the data, multiple times.
(In reply to comment #9) > BTW, the current OpenLDAP CVS tree does indeed seem to be based on both EINTR > handling and non-blocking reads driven by poll() with a timeout. What about the connect() case? This was one call that did not properly time out in the past.
The connect case appears to be connected to this LDAP option: async = LDAP_BOOL_GET( &ld->ld_options, LDAP_BOOL_CONNECT_ASYNC ); In any case, the alarm() on the search is separate, and with modern ldap libs clearly pointless. It may not however have caused harm - in retesting with multiple LDAP server threads in the DC, the benefit seems to be reduced. For connect: while probably unwise outside our test environment where we have no other choice, the hack to use ldap_open_fd() is and then creating the socket ourselves is quite tempting. We do however still have a problem - once the search times out (by whichever means), we go back and re-search almost endlessly, and this is where I think the major issue is. The timeouts we see are not of the order of 15sec. Even with a parallel client thread, we should either talk to the DC, and get the user list once, or we should die and not loop, chewing bandwith on what is 'clearly' a bandwith poor link.
Created attachment 7566 [details] Change to not-quite-so-slow ldap server This helps create a more realistic situation that can actually be resolved.
Created attachment 7567 [details] Convert LDAP_TIMELIMIT_EXCEEEDED to NT_STATUS_IO_TIMEOUT This means we do not loop 3x5x15 seconds because of: - 3 loops in the libads layer (unconditional) - 5 loops in the winbind_cache layer (conditational on NT_STATUS_IO_TIMEOUT) This also means we will use IO_TIMEOUT to knock a DC offline.
Created attachment 7568 [details] Use the new error code mapping to ensure we always pass IO_TIMEOUT to the callers With this patch, the winbindd_cache layer can actually see the IO_TIMEOUT error
Created attachment 7569 [details] Back off on the ldap page size in an attempt to get a page in under 15 seconds By backing down in the page size, we can adjust our requests to fit within our 15 second request/response time budget.
Created attachment 7570 [details] This testing patch helps with this work, by not requring that winbind be in use This may be useful in any case, with the page size backdown. With these patches, I now see without errors the first 250 users (two pages). Either Samba4 or libads isn't doing the right thing for the remaining pages however. That said, I think we are on the right path for a proper fix to this issue. Further insights from others most welcome.
Created attachment 7572 [details] Patches for master, excluding the alarm patch These patches for master have been tested in my synthetic environment with wbinfo -u and net ads search objectclass=user.
Created attachment 7573 [details] Patches for 3.5, tested on a real-world NAS, against synthetic slow DC I had to speed the DC up to only a 25ms delay, but with that this patch seems to behave correctly on a real NAS (rather than workstation loopback).
Created attachment 7574 [details] Patches for 3.6, tested on a real-world NAS, against synthetic slow DC The same patches, but patched up for 3.6
Comment on attachment 7563 [details] Only use ldap lib timeout, not alarm() as well. Assumes a ldap lib with proper timeout handling BTW, I agree that not all ldap libs do correct time out handling, and so the alarm() patch could not be used unconditionally.
Some thoughts for reviewers: The primary issue with this approach is in a ldap timeout situation, we would stall out for 1min while re-training down to the smaller page size, and we will stay at the smaller page size until winbindd restarts. The smaller pages are less efficient (125 at a time vs 1000 at a time), but not impossibly so. Most timeout situations (DC failover etc) would only reduce the page size by half once, but continuing failover could eventually reduce it to 125. In a full timeout (unable to get even a small page out in 'ldap timeout') it should mark the domain offline (as seen with wbinfo --domain-status), but it doesn't seem to. It does however return with error quickly from wbinfo. Compared with the current alternative outcomes without these patches, I think they are a distinct improvement.
Comment on attachment 7574 [details] Patches for 3.6, tested on a real-world NAS, against synthetic slow DC Really nice work !
Comment on attachment 7572 [details] Patches for master, excluding the alarm patch Pushed to autobuild.
abartlett wrote: "In a full timeout (unable to get even a small page out in 'ldap timeout') it should mark the domain offline (as seen with wbinfo --domain-status), but it doesn't seem to. It does however return with error quickly from wbinfo." We should probably investigate that - but maybe not stop this bug for that. Jeremy.
Comment on attachment 7573 [details] Patches for 3.5, tested on a real-world NAS, against synthetic slow DC LGTM.
Re-assigning to Karolin for inclusion in 3.5.next and 3.6.next.
(In reply to comment #25) > abartlett wrote: > > "In a full timeout (unable to get even a small page out in 'ldap > timeout') it should mark the domain offline (as seen with wbinfo > --domain-status), but it doesn't seem to. It does however return with > error quickly from wbinfo." > > We should probably investigate that - but maybe not stop this bug for that. > > Jeremy. Pushed patches to v3-5-test and v3-6-test. Re-assigning bug report to Andrew as a reminder for the issue mentioned above. Feel free to close the bug report. Thanks!
(In reply to Karolin Seeger from comment #28) There is likely much more to do here across multiple protocols, but one bug per bug so closing.