samba 4.20.6 version. We are building a domain with a large number of DCs. Now we have more than 40 DCs and suffer from broken replication, which started not long ago. The only activity we did - joining new DCs. Before that replication worked fine. We did investigate the problem and seem to find the root cause for it, which we'd be glad to validate with you. The problem is that after some time replication between DCs became broken. Error message relates to getting ticket in kerberos, cause KDC could not be found (NO_LOGON_SERVER error). We fixed the problem with explicitly specified kdc in krb5.conf (previously we relied on resolving kdc through dns). But we wanted to understand why resolve through DNS is not working. We collected DNS logs with log level 11 and noticed that kdc cannot be obtained after querying _kerberos._udp.REALM and _kerberos._tcp.REALM dns nodes, although dns logs showed that response is correctly sent back with all the needed SRV records. Finally we dig into file https://github.com/samba-team/samba/blob/master/third_party/heimdal/lib/roken/resolve.c, method dns_lookup_int, that is trying to get DNS records with initial buffer size of 1500 bytes. It calls system function like res_search and in case returned size is < 0 then retry request with larger buffer. But this is not our case, cause we do not see repeated requests in DNS logs. The option is that system library returns size larger then allocated buffer and code proceeds with parsing such response. It seems like incorrect. How large response can fit into small reply buffer? We do not know what exactly version of system library is used, but according (for example) to Oracle documentation (https://docs.oracle.com/cd/E36784_01/html/E36875/res-nsearch-3resolv.html) such functions may return size bigger then reply buffer size, which means that request to DNS should be repeated with larger buffer. Please, advise if we are going right way. And what additional information is needed to investigate this problem. Thanks in advance.
Created attachment 18625 [details] logs with dns and kerberos level = 11
This problem can be easily reproduced. We did it on our test env with 2 DCs. Replication worked fine. Using samba-tool I've added approx. 60 SRV records into _kerberos._udp.TEST.LAB and _kerberos._tcp.TEST.LAB After that replication stopped working. The difference in dns logs is the following: when replication is working DNS requests look like: _kerberos._udp.TEST.LAB SRV dc1.test.lab A dc1.test.lab AAAA _kerberos._udp.TEST.LAB SRV dc1.test.lab A dc1.test.lab AAAA _kerberos._tcp.TEST.LAB SRV dc1.test.lab A dc1.test.lab AAAA with broken replication DNS requests: _kerberos._udp.TEST.LAB SRV _kerberos._tcp.TEST.LAB SRV _kerberos._http.TEST.LAB SRV - 7 times kerberos.TEST.LAB A kerberos.TEST.LAB AAAA I also attached logs when replication is broken.
This problem is applicable to internal DNS only. No replication problems are observed when using Bind9 with the same number of DCs.