I have a self-compiled Samba 4.11.16 running on CentOS 7. I'm running 2 domain controllers, dc1, and dc2. I believe there is a bug in the way that DC failover is working in Winbindd under Linux with the current configuration. My Samba configuration on dc1: # Global parameters [global] netbios name = DC1 realm = AD.EECS.YORKU.CA workgroup = EECSYORKUCA dns forwarder = 130.63.94.4 server role = active directory domain controller idmap_ldb:use rfc2307 = yes interfaces = 127.0.0.1 130.63.94.66 <- dc1 ip bind interfaces only = yes [netlogon] path = /local/samba/sysvol/ad.eecs.yorku.ca/scripts read only = no guest ok = no [sysvol] path = /local/samba/sysvol read only = no guest ok = no (Similar on dc2 with change in netbios name + IP) dc1 krb5.conf: [libdefaults] default_realm = AD.EECS.YORKU.CA dns_lookup_realm = false dns_lookup_kdc = false rdns = false forwardable = true renew_lifetime = 7d [realms] AD.EECS.YORKU.CA = { kdc = 127.0.0.1 kdc = 130.63.94.67 <- dc2 ip } [domain_realm] ad.eecs.yorku.ca = AD.EECS.YORKU.CA .ad.eecs.yorku.ca = AD.EECS.YORKU.CA eecs.yorku.ca = AD.EECS.YORKU.CA .eecs.yorku.ca = AD.EECS.YORKU.CA [dc2 configuration is similar expect second kdc is dc1] Client krb5.conf [libdefaults] default_realm = AD.EECS.YORKU.CA dns_lookup_realm = false dns_lookup_kdc = true rdns = false forwardable = true renew_lifetime = 7d [realms] AD.EECS.YORKU.CA = { kdc = 130.63.94.66 kdc = 130.63.94.67 } [domain_realm] ad.eecs.yorku.ca = AD.EECS.YORKU.CA .ad.eecs.yorku.ca = AD.EECS.YORKU.CA eecs.yorku.ca = AD.EECS.YORKU.CA .eecs.yorku.ca = AD.EECS.YORKU.CA Replication is working perfectly with no errors. If I stop the DC processes on either DC, Windows clients appear to failover properly. The problem seems to affect my Linux clients (CentOS 7) running winbind. Sometimes, the failover works fine, but most times, this is what happens: 1) connect host to DC2 - everything is working fine 2) shutdown DC services on DC2 3) host appears to connect to DC1 as expected. wbinfo -u and wbinfo -g report proper output. 4) Execute command such as "whoami" on host or as root try to "sudo jas" and I get back "no such user" 5) At some point later - at least 20 minutes, but I've also seen it happen over an hour later - it just magically starts working. I ran winbind in debug mode, interactive, and I will show the output below, but briefly this is what happens: 1) get_dc_list returns preferred server list: "dc1.ad.eecs.yorku.ca, *" exactly as it should, but then when I run "whoami" or "su" commands, the system tries to connect to ldap port 389 on dc2. When that fails, it doesn't connect to dc1. Even a reboot on the client doesn't fix it! However, at some point later which seems to range from say 20 minutes up to an hour, it starts contacting the proper dc1, and whoami/sudo begins to work. This is exactly what I saw: winbindd version 4.11.16 started. Copyright Andrew Tridgell and the Samba Team 1992-2019 lp_load_ex: refreshing parameters Initialising global parameters rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384) Processing section "[global]" Registered MSG_REQ_POOL_USAGE Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED lp_load_ex: refreshing parameters Initialising global parameters rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384) Processing section "[global]" added interface enp0s3 ip=130.63.97.152 bcast=130.63.97.255 netmask=255.255.255.0 added interface enp0s3 ip=130.63.97.152 bcast=130.63.97.255 netmask=255.255.255.0 tdb '/local/samba/locks/winbindd_cache.tdb' is valid Created backup '/local/samba/locks/winbindd_cache.tdb.bak' of tdb '/local/samba/locks/winbindd_cache.tdb' add_trusted_domain: Added domain [BUILTIN] [(null)] [S-1-5-32] add_trusted_domain: Added domain [J2] [(null)] [S-1-5-21-4255622434-1312408701-3568591385] add_trusted_domain: Added domain [EECSYORKUCA] [AD.EECS.YORKU.CA] [S-1-5-21-1981678738-1545235886-4256466701] connection_ok: Connection to (null) for domain EECSYORKUCA is not connected Successfully contacted LDAP server 130.63.94.66 get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" Connecting to 130.63.94.66 at port 445 ldb_wrap open of secrets.ldb GENSEC backend 'gssapi_spnego' registered GENSEC backend 'gssapi_krb5' registered GENSEC backend 'gssapi_krb5_sasl' registered GENSEC backend 'spnego' registered GENSEC backend 'schannel' registered GENSEC backend 'naclrpc_as_system' registered GENSEC backend 'sasl-EXTERNAL' registered GENSEC backend 'ntlmssp' registered GENSEC backend 'ntlmssp_resume_ccache' registered GENSEC backend 'http_basic' registered GENSEC backend 'http_ntlm' registered GENSEC backend 'http_negotiate' registered GENSEC backend 'krb5' registered GENSEC backend 'fake_gssapi_krb5' registered winbindd_dual_list_trusted_domains: [ 12149]: list trusted domains ads: trusted_domains ldb_wrap open of secrets.ldb Connecting to 130.63.94.66 at port 135 Connecting to 130.63.94.66 at port 49152 Connecting to 130.63.94.66 at port 135 Connecting to 130.63.94.66 at port 49152 winbindd_dual_list_trusted_domains: [ 12149]: list trusted domains ads: trusted_domains get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" Successfully contacted LDAP server 130.63.94.66 get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" [jas@j2 jas]# su - jas winbindd_interface_version: [nss_winbind (12180)]: request interface version (version = 31) winbindd_getpwnam_send: [nss_winbind (12180)] getpwnam jas Connecting to 130.63.94.66 at port 135 <- perfect - it's dc1. Connecting to 130.63.94.66 at port 49153 <- great... idmap backend ad not found load_module_absolute_path: Module '/xsys/pkg/samba-4.11.16/lib/idmap/ad.so' loaded Connecting to 130.63.94.67 at port 389 <- but this shouldn't be happening - that's dc2. su: user jas does not exist [jas@j2 jas]# winbindd_interface_version: [nss_winbind (12183)]: request interface version (version = 31) winbindd_interface_version: [nss_winbind (12182)]: request interface version (version = 31) winbindd_getgroups_send: [nss_winbind (12182)] getgroups root winbindd_getgroups_send: [nss_winbind (12183)] getgroups root get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" Successfully contacted LDAP server 130.63.94.66 get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *" And every time I try to "su - jas", it again tries to connect to dc2!! It even creates a krb5.conf._JOIN_ file containing: [realms] AD.EECS.YORKU.CA = { kdc = 130.63.94.66 } But eventually....... after a long period of time... It works: # su - jas winbindd_interface_version: [nss_winbind (13046)]: request interface version (version = 31) winbindd_getpwnam_send: [nss_winbind (13046)] getpwnam jas resolve_hosts: Attempting host lookup for name dc1.ad.eecs.yorku.ca<0x20> Connecting to 130.63.94.66 at port 389 .... ----- Over the course of the last approximately 8 months, I've been researching moving our environment (hundreds of mixed Linux and Windows systems) under Samba DC (including a lot of NFSv4 Krb). From time to time, I mail the Samba list with a variety of questions - sometimes about bugs such as the above, or functionality. Most of my messages to the list receive no response. I recognize that everyone is busy. However, I hope that none of my questions have in any way offended the Samba developers, as this is certainly not my intention. I love the Samba product, and am very excited to soon start a large rollout in my Department. I hope that any issues I report can help to improve stability for everyone.
I can confirm this on our (3 DC) site. We reboot our DCs in the night, and occasionally our domain member servers "getent group" starts reporting only the local groups during reboot of a DC. And while this happens, "wbinfo --ping-dc" still succeeds, and also "wbinfo -g" / "wbinfo -u" still report all AD groups. The situation auto-corrects itself after few minutes. I have not taken the winbind debug steps from the original poster, but i think the problem is the same. The problem happens only every few nights, most of the time, the domain member servers don't even notice the DCs reboot, as they should.
I'm experiencing a similar problem - no failover whatsoever from my 2nd DC. Initially, I could sometimes query the network with the 1st DC down, but having a long lag time. However, after trying some of the suggested fixes regarding resolv.conf and krb5.conf, logins and queries no longer work at all if DC1 is down. I am seeing SRV queries in the DNS query logs on DC2, but the failover is not happening. Adding "options rotate" to resolv.conf on the clients has caused the number of queries to go up by a factor of 10 or more, but failover is not happening.