The documentation mentions that the "ldap timeout" parameter is a connection timeout when in fact it restricts the length of time for an entire LDAP operation. The issue shows itself in the case where you have 2 AD controllers with an inter-domain trust which talk to each over a slow VPN link. By default "ldap timeout" is set to 15s, and what happens is that after default connection, winbind downloads 15s worth of AD data from the trusted domain before timing out. It then sits in an endless loop constantly trying to download the first section of the AD user list.
It seems the problem lies within libads/ldap.c:ldap_search_with_timeout(); since winbind uses the synchronous ldap_search_ext_s() call, it will only return once the entire dataset has been retrieved, and this can take much longer than 15s over a VPN link.
The solution should be to use the asynchronous ldap_search_ext() call to initiate the download, and use polling via ldap_result() to ensure that the timeout is only implemented when there is a break in the results being retrieved from the remote AD server.
in 3.2 there is now a separate ldap connection timeout.
Can you attach the used smb.conf and level 10 log file where you see the mentioned timeouts?
Thanks for taking the time to revisit an old bug report - I'd just about given up on progressing this issue further :)
The system I was working on at the time is now actually live, and so of course this restricts my ability to recreate the broken scenario and give you the logs required. However, I did spend a couple of weeks recompiling directly from source adding extra debug where required and so I'm reasonably confident that the report above is accurate.
I think the key part to understand is to look at the smb.conf manual here: http://us1.samba.org/samba/docs/man/manpages-3/smb.conf.5.html#LDAPTIMEOUT. This explains that "ldap timeout" is used to determine how long Samba should wait before failing the connect, but this isn't quite true.
According to this post on the openldap-software list: http://www.openldap.org/lists/openldap-software/200801/msg00265.html the timeout value is enforced by the server *as well as the client* which describes the issue I was seeing.
The Samba server in question was based in the UK but was connected over VPN to a Windows server in Hong Kong with several hundred users. When I started winbindd, it would connect to the PDC in Hong Kong correctly, and I could see the LDAP transfer using tcpdump. But then it would just abort after 15s, and so winbindd would get stuck in a loop downloading the first group of records, aborting part way through and then trying again. And it wasn't particularly clear without looking at the source code *why* the timeout was occurring, i.e. the time for the Hong Kong server to perform the search and send the data back to the winbindd was exceeding "ldap timeout".
In the end, my solution for the 3.0.x installation was to raise "ldap timeout" to about 4 mins which was enough for the complete transfer to occur. Unfortunately this had the side effect that if one of the servers was down then winbindd had to wait 4 mins to notice that the connection was unavailable which was completely agonising when several of the remote servers were taken down for maintenance.
The UK server in question is still running the 3.0.x series, and so if this is the case I will unlikely be able to install the newer version unless something breaks again. However, if "ldap connection timeout" works as advertised then this is a good starting point - though I would strongly recommend that the "ldap timeout" description in the smb.conf man page is revised to note that the timeout covers the time to complete all LDAP operations and return the results to the client, and not just the connection timeout.
mark, that link does not reflect the 3.2 man page status. As mentioned, there is now a connection timeout parameter and the ldap timeout parameter description is corrected.
Samba usually does paged result searches which AD servers should also support. To see which exact ldap searches might cause your problem, we need a level 10 log.
> mark, that link does not reflect the 3.2 man page status. As mentioned, there
> is now a connection timeout parameter and the ldap timeout parameter
> description is corrected.
Really? It seems to match the 3.2.3 documentation man pages I have on my local workstation (Debian Lenny). Note that both ldap timeout parameters are present at the link above too which suggests the web page references the 3.2 series.
The main problem I had is that the description of the "ldap timeout" still reads like this:
"When Samba connects to an ldap server that server may be down or unreachable. To prevent Samba from hanging whilst waiting for the connection this parameter specifies in seconds how long Samba should wait before failing the connect. The default is to only wait fifteen seconds for the ldap server to respond to the connect request."
So while the "ldap connection timeout" documentation has been updated, the "ldap timeout" parameter has not been changed to reflect this. I would like to suggest the following wording based upon my experience:
"When Samba sends a request to an ldap server after the initial connection, that server may be down or unreachable. In order to prevent Samba hanging whilst waiting for the connection, this parameter specifies in seconds the maximum time Samba will wait after issuing the request and receiving the completed response from the server. If this timeout occurs during the middle of an ldap response, all data will be discarded and the request aborted. The default is to only wait fifteen seconds for the ldap server to respond to the request."
I think this better explains exactly how the parameter affects the connection and probably would have saved me a lot of head scratching :) Assuming that I am using gitweb correctly, then the current codebase is still the same: see http://gitweb.samba.org/?p=samba.git;a=blob;f=source3/libads/ldap.c;h=cf8a7ebb1b3750f844df59faa6294d7c2ed38a47;hb=HEAD and the ldap_search_with_timeout() function. This is the location where my connections were timing out. You can see there exactly how the timeout is set in conjunction with the openldap link I sent above.
> Samba usually does paged result searches which AD servers should also support.
> To see which exact ldap searches might cause your problem, we need a level 10
Hmmm let me try and remember. It was the initial connection to the remote DC in Hong Kong that caused this failure. What would happen is that first I would start Samba and then Winbind; Winbind would connect to the local DC and obtain a list of all the trusted DCs, including the one in Hong Kong. When it came to the turn of the Hong Kong server over the slow VPN link, Winbind would connect but then always drop after 15s. From the tcpdump I could see that Winbind was simply downloading the complete list of users + attributes from the Hong Kong LDAP (i.e. I could see changing strings of usernames/SID flying past the console) - alas I probably can't give you any more information about the actual query without finding time to break it again and see what happens :(
I had no idea that AD servers supported paging, but a quick search of the openldap client API shows that only the asynchronous API (which Samba does not use in ldap_search_with_timeout) supports the retrieval of partially-received data from a request.