If no DCs in the local site are working, but other DCs are available, winbind gives up because the name cache is used inappropriately. The root cause is that a site-specific namecache entry is returned as a response to a query for all sites.
When winbind first tries to connect it calls get_sorted_dc_list(), which ends up calling internal_resolve_name() twice; once for the local site, and again for all sites if the first lookup fails. internal_resolve_name() takes a site parameter, but stores and retrieves results in the namecache without site being part of the key. So the second call returns the first result again, but all the site local DCs are known to fail at this point.
In ads_find_dc() the same problem is worked around by clearing the cache before increasing the search scope. To me it sounds better to add site to the namecache key instead. Would this change cause a problem anywhere?
This was discussed by Uri Simchoni on samba-technical in the thread "[PATCH] libads: fixes to generation of custom krb5.conf":
> The internal_resolve_name() call may return a cached result. However,
> the cache key is <domain,name type> tuple, not <site,domain,name
> type>, so calling internal_resolve_name with/without a site may yield
> the same result. "Luckily", in the case of Kerberos, name resolving
> results are NOT cached, so calling internal_resolve_name with a
> kerberos name type always yields a correct result.
> Incidentally - this caching scheme may seem like a bug. It certainly
> allows bugs to creep in (as any caching scheme would - once you
> duplicate state you call for bugs), but notice that in the case of
> LDAP, a bug is avoided by clearing the cache when moving from a
> site-specific search to site-less search (well, maybe not entirely
> avoided - there could be race conditions).