14597 – DC failover

Bug 14597 - DC failover

Summary: DC failover

Status:	NEW

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	4.11.16
Hardware:	x64 Linux

Importance:	P5 major (vote)
Target Milestone:	---
Assignee:	Samba QA Contact
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-12-11 20:18 UTC by Jason Keltz
Modified:	2021-03-05 15:03 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Keltz 2020-12-11 20:18:44 UTC

I have a self-compiled Samba 4.11.16 running on CentOS 7.
I'm running 2 domain controllers, dc1, and dc2.
I believe there is a bug in the way that DC failover is working in Winbindd under Linux with the current configuration.

My Samba configuration on dc1:

# Global parameters
[global]
        netbios name = DC1
        realm = AD.EECS.YORKU.CA
        workgroup = EECSYORKUCA
        dns forwarder = 130.63.94.4
        server role = active directory domain controller
        idmap_ldb:use rfc2307 = yes
        interfaces = 127.0.0.1 130.63.94.66 <- dc1 ip
        bind interfaces only = yes

[netlogon]
        path = /local/samba/sysvol/ad.eecs.yorku.ca/scripts
        read only = no
        guest ok = no

[sysvol]
        path = /local/samba/sysvol
        read only = no
        guest ok = no

(Similar on dc2 with change in netbios name + IP)

dc1 krb5.conf:

[libdefaults]
 default_realm = AD.EECS.YORKU.CA
 dns_lookup_realm = false
 dns_lookup_kdc = false
 rdns = false
 forwardable = true
 renew_lifetime = 7d

[realms]
 AD.EECS.YORKU.CA = {
  kdc = 127.0.0.1
  kdc = 130.63.94.67 <- dc2 ip
 }

[domain_realm]
 ad.eecs.yorku.ca = AD.EECS.YORKU.CA
 .ad.eecs.yorku.ca = AD.EECS.YORKU.CA
 eecs.yorku.ca = AD.EECS.YORKU.CA
 .eecs.yorku.ca = AD.EECS.YORKU.CA

[dc2 configuration is similar expect second kdc is dc1]

Client krb5.conf

[libdefaults]
 default_realm = AD.EECS.YORKU.CA
 dns_lookup_realm = false
 dns_lookup_kdc = true
 rdns = false
 forwardable = true
 renew_lifetime = 7d

[realms]
 AD.EECS.YORKU.CA = {
  kdc = 130.63.94.66
  kdc = 130.63.94.67
 }

[domain_realm]
 ad.eecs.yorku.ca = AD.EECS.YORKU.CA
 .ad.eecs.yorku.ca = AD.EECS.YORKU.CA
 eecs.yorku.ca = AD.EECS.YORKU.CA
 .eecs.yorku.ca = AD.EECS.YORKU.CA

Replication is working perfectly with no errors.
If I stop the DC processes on either DC, Windows clients appear to failover properly.

The problem seems to affect my Linux clients (CentOS 7) running winbind.

Sometimes, the failover works fine, but most times, this is what happens:

1) connect host to DC2 - everything is working fine
2) shutdown DC services on DC2
3) host appears to connect to DC1 as expected.  wbinfo -u and wbinfo -g report proper output.
4) Execute command such as "whoami" on host or as root try to "sudo jas"  and I get back "no such user"
5) At some point later - at least 20 minutes, but I've also seen it happen over an hour later - it just magically starts working.

I ran winbind in debug mode, interactive, and I will show the output below, but briefly this is what happens:

1) get_dc_list returns preferred server list: "dc1.ad.eecs.yorku.ca, *" exactly as it should, but then when I run "whoami" or "su" commands, the system tries to connect to ldap port 389 on dc2.  When that fails, it doesn't connect to dc1.    Even a reboot on the client doesn't fix it!  However, at some point later which seems to range from say 20 minutes up to an hour, it starts contacting the proper dc1, and whoami/sudo begins to work. 

This is exactly what I saw:

winbindd version 4.11.16 started.
Copyright Andrew Tridgell and the Samba Team 1992-2019
lp_load_ex: refreshing parameters
Initialising global parameters
rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384)
Processing section "[global]"
Registered MSG_REQ_POOL_USAGE
Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED
lp_load_ex: refreshing parameters
Initialising global parameters
rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384)
Processing section "[global]"
added interface enp0s3 ip=130.63.97.152 bcast=130.63.97.255 netmask=255.255.255.0
added interface enp0s3 ip=130.63.97.152 bcast=130.63.97.255 netmask=255.255.255.0
tdb '/local/samba/locks/winbindd_cache.tdb' is valid
Created backup '/local/samba/locks/winbindd_cache.tdb.bak' of tdb '/local/samba/locks/winbindd_cache.tdb'
add_trusted_domain: Added domain [BUILTIN] [(null)] [S-1-5-32]
add_trusted_domain: Added domain [J2] [(null)] [S-1-5-21-4255622434-1312408701-3568591385]
add_trusted_domain: Added domain [EECSYORKUCA] [AD.EECS.YORKU.CA] [S-1-5-21-1981678738-1545235886-4256466701]
connection_ok: Connection to (null) for domain EECSYORKUCA is not connected
Successfully contacted LDAP server 130.63.94.66
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
Connecting to 130.63.94.66 at port 445
ldb_wrap open of secrets.ldb
GENSEC backend 'gssapi_spnego' registered
GENSEC backend 'gssapi_krb5' registered
GENSEC backend 'gssapi_krb5_sasl' registered
GENSEC backend 'spnego' registered
GENSEC backend 'schannel' registered
GENSEC backend 'naclrpc_as_system' registered
GENSEC backend 'sasl-EXTERNAL' registered
GENSEC backend 'ntlmssp' registered
GENSEC backend 'ntlmssp_resume_ccache' registered
GENSEC backend 'http_basic' registered
GENSEC backend 'http_ntlm' registered
GENSEC backend 'http_negotiate' registered
GENSEC backend 'krb5' registered
GENSEC backend 'fake_gssapi_krb5' registered
winbindd_dual_list_trusted_domains: [ 12149]: list trusted domains
ads: trusted_domains
ldb_wrap open of secrets.ldb
Connecting to 130.63.94.66 at port 135
Connecting to 130.63.94.66 at port 49152
Connecting to 130.63.94.66 at port 135
Connecting to 130.63.94.66 at port 49152
winbindd_dual_list_trusted_domains: [ 12149]: list trusted domains
ads: trusted_domains
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
Successfully contacted LDAP server 130.63.94.66
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"

[jas@j2 jas]# su - jas
winbindd_interface_version: [nss_winbind (12180)]: request interface version (version = 31)
winbindd_getpwnam_send: [nss_winbind (12180)] getpwnam jas
Connecting to 130.63.94.66 at port 135 <- perfect - it's dc1.
Connecting to 130.63.94.66 at port 49153 <- great...
idmap backend ad not found

load_module_absolute_path: Module '/xsys/pkg/samba-4.11.16/lib/idmap/ad.so' loaded

Connecting to 130.63.94.67 at port 389 <- but this shouldn't be happening - that's dc2.
su: user jas does not exist
[jas@j2 jas]# winbindd_interface_version: [nss_winbind (12183)]: request interface version (version = 31)
winbindd_interface_version: [nss_winbind (12182)]: request interface version (version = 31)
winbindd_getgroups_send: [nss_winbind (12182)] getgroups root
winbindd_getgroups_send: [nss_winbind (12183)] getgroups root
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
Successfully contacted LDAP server 130.63.94.66
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"
get_dc_list: preferred server list: "dc1.ad.eecs.yorku.ca, *"

And every time I try to "su - jas", it again tries to connect to dc2!!

It even creates a krb5.conf._JOIN_ file containing:

[realms]
        AD.EECS.YORKU.CA = {
                kdc = 130.63.94.66
        }

But eventually....... after a long period of time...

It works:

# su - jas
winbindd_interface_version: [nss_winbind (13046)]: request interface version (version = 31)
winbindd_getpwnam_send: [nss_winbind (13046)] getpwnam jas
resolve_hosts: Attempting host lookup for name dc1.ad.eecs.yorku.ca<0x20>
Connecting to 130.63.94.66 at port 389
....

-----

Over the course of the last approximately 8 months, I've been researching moving our environment (hundreds of mixed Linux and Windows systems) under Samba DC (including a lot of NFSv4 Krb).  From time to time, I mail the Samba list with a variety of questions - sometimes about bugs such as the above, or functionality.  Most of my messages to the list receive no response.  I recognize that everyone is busy.  However, I hope that none of my questions have in any way offended the Samba developers, as this is certainly not my intention. I love the Samba product, and am very excited to soon start a large rollout in my Department. I hope that any issues I report can help to improve stability for everyone.

Comment 1 heupink 2021-01-06 13:38:56 UTC

I can confirm this on our (3 DC) site.

We reboot our DCs in the night, and occasionally our domain member servers "getent group" starts reporting only the local groups during reboot of a DC.

And while this happens, "wbinfo --ping-dc" still succeeds, and also "wbinfo -g" / "wbinfo -u" still report all AD groups.

The situation auto-corrects itself after few minutes.

I have not taken the winbind debug steps from the original poster, but i think the problem is the same.

The problem happens only every few nights, most of the time, the domain member servers don't even notice the DCs reboot, as they should.

Comment 2 samba 2021-03-04 15:58:01 UTC

I'm experiencing a similar problem - no failover whatsoever from my 2nd DC.  Initially, I could sometimes query the network with the 1st DC down, but having a long lag time.  However, after trying some of the suggested fixes regarding resolv.conf and krb5.conf, logins and queries no longer work at all if DC1 is down.

I am seeing SRV queries in the DNS query logs on DC2, but the failover is not happening.  Adding "options rotate" to resolv.conf on the clients has caused the number of queries to go up by a factor of 10 or more, but failover is not happening.