When winbindd is connecting to netlogon to resolve the DC for a trusted domain, Kerberos settings are sometimes ignored, leading to high latency. My analysis is that cm_open_connection() should be calling create_local_private_krb5_conf_for_domain() but isn't.
In this deployment there are hundreds of DCs and about 30 trusted domains. Most of the DCs in DNS are firewalled from the client, and many others have >200ms latency. Some of the domains have no reachable DC.
Symptoms are contention on the netlogon mutex as the lock is held for several seconds per request. Periodically a child will give up with "cm_prepare_connection: mutex grab failed", which blacklists this DC in the connection cache.
Periodically, presumably Winbind tries to refresh the status of trusted domains, forking a child for each one in fork_child_dc_connect(). That calls get_dcs() to see if any DC is available. Where the domain is not our primary domain, get_dcs() uses get_dc_name_via_netlogon(), which via cm_connect_netlogon() calls cm_open_connection().
cm_open_connection() checks the server affinity cache for a DC. If there is an entry and is_ipaddress(saf_servername) is false, dcip_to_name() is not needed, so create_local_private_krb5_conf_for_domain() is never called either. This means KRB5_CONFIG is never set in this child, and Kerberos defaults to /etc/krb5.conf. In my case this file is blank, so dns_lookup_kdc = true is implied.
Once connected to the DC, cm_prepare_connection() grabs the mutex and with it held calls cli_session_setup_spnego(). In this deployment where many of the KDCs in DNS time out, this call is slow; the mutex is contended and makes other winbind processes time out and switch DCs.
MIT Kerberos 1.12.1