Bug 15842 - resolving kdc through DNS for domain with large number of DCs
Summary: resolving kdc through DNS for domain with large number of DCs
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Other (show other bugs)
Version: 4.20.6
Hardware: All Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-04-01 15:21 UTC by Andrew
Modified: 2025-04-07 03:42 UTC (History)
1 user (show)

See Also:


Attachments
logs with dns and kerberos level = 11 (1.53 MB, text/plain)
2025-04-02 07:08 UTC, Andrew
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew 2025-04-01 15:21:37 UTC
samba 4.20.6 version.

We are building a domain with a large number of DCs. Now we have more than 40 DCs and suffer from broken replication, which started not long ago. The only activity we did - joining new DCs. Before that replication worked fine.

We did investigate the problem and seem to find the root cause for it, which we'd be glad to validate with you.

The problem is that after some time replication between DCs became broken. Error message relates to getting ticket in kerberos, cause KDC could not be found (NO_LOGON_SERVER error).
We fixed the problem with explicitly specified kdc in krb5.conf (previously we relied on resolving kdc through dns).
But we wanted to understand why resolve through DNS is not working. We collected DNS logs with log level 11 and noticed that kdc cannot be obtained after querying _kerberos._udp.REALM and _kerberos._tcp.REALM dns nodes, although dns logs showed that response is correctly sent back with all the needed SRV records.
Finally we dig into file https://github.com/samba-team/samba/blob/master/third_party/heimdal/lib/roken/resolve.c, method dns_lookup_int, that is trying to get DNS records with initial buffer size of 1500 bytes. It calls system function like res_search and in case returned size is < 0 then retry request with larger buffer. But this is not our case, cause we do not see repeated requests in DNS logs. The option is that system library returns size larger then allocated buffer and code proceeds with parsing such response.
It seems like incorrect. How large response can fit into small reply buffer?
We do not know what exactly version of system library is used, but according (for example) to Oracle documentation (https://docs.oracle.com/cd/E36784_01/html/E36875/res-nsearch-3resolv.html) such functions may return size bigger then reply buffer size, which means that request to DNS should be repeated with larger buffer.

Please, advise if we are going right way. And what additional information is needed to investigate this problem.

Thanks in advance.
Comment 1 Andrew 2025-04-02 07:08:57 UTC
Created attachment 18625 [details]
logs with dns and kerberos level = 11
Comment 2 Andrew 2025-04-02 07:13:17 UTC
This problem can be easily reproduced. We did it on our test env with 2 DCs.
Replication worked fine.
Using samba-tool I've added approx. 60 SRV records into _kerberos._udp.TEST.LAB and _kerberos._tcp.TEST.LAB

After that replication stopped working.
The difference in dns logs is the following:
when replication is working DNS requests look like:
  _kerberos._udp.TEST.LAB SRV
  dc1.test.lab A
  dc1.test.lab AAAA
  _kerberos._udp.TEST.LAB SRV
  dc1.test.lab A
  dc1.test.lab AAAA
  _kerberos._tcp.TEST.LAB SRV
  dc1.test.lab A
  dc1.test.lab AAAA

with broken replication DNS requests:
  _kerberos._udp.TEST.LAB SRV
  _kerberos._tcp.TEST.LAB SRV
  _kerberos._http.TEST.LAB SRV - 7 times
  kerberos.TEST.LAB A
  kerberos.TEST.LAB AAAA

I also attached logs when replication is broken.
Comment 3 Andrew 2025-04-02 16:34:50 UTC
This problem is applicable to internal DNS only. No replication problems are observed when using Bind9 with the same number of DCs.