This bug is about a failure to obtain a service ticket to ldap in a specific AD setup. Steps to reproduce: 1. Install a Server 2003R2 DC, set domain and forest FL to 2003. 2. Define two sites - HQ and BRANCH, in two subnets. The DC is in HQ. 3. adprep the domain for 2008R2 and RODC's 4. Add a Server 2008R2 DC to the domain, in HQ site. 5. Add a Server 2008R2 RODC to BRANCH site. 6. Verify DNS is setup correctly (i.e. the BRANCH SRV records only point to the RODC). 7. Setup a Samba AD member in the subnet of BRANCH site, and join it to the domain: a. "Vanilla" samba 4.3.9 b. Build with built-in heimdal c. No need for smbd/winbindd, just have a proper smb.conf and do a "net ads join" d. verify that "net ads testjoin" works 8. Delete gencache.tdb (that's just to delete the server affinity records, alternatively you can wait one hour). 9. Add the newly-created machine account of the Samba machine to the "Allowed RODC password replication group" group. 10. Replicate everything from HQ site 2008R2 server to the RODC (this replicates the new machine account and the modification to the group). Also replicate the password of the new machine account using "repadmin /rodcpwdrepl". 11. Now do a "net ads testjoin" and observe that it fails. Some more info: 1. Packet capture show that the AS handshake is successful, but the TGS request is not being answered at all. In a customer setup where the TGS was over TCP (larger TGT), the server simply shut the connection without answering. 2. Kerberos auditing at the server shows failure with code 0x80000003 3. The key differences from a regular setup seem to be: a. The use of a TGT from the special RODC TGT account, not the "regular" krbtgt account. b. The RC4-HMAC-MD5 encryption On a different setup with AES/SHA1 encryption (only 2008R2 DCs) there's no problem even if the RODC krbtgt is used. Also, if the TGT is the "regular one" (as when password is not cached or when doing the Kerberos AS against a RWDC), the auth succeeds even with RC4-HMAC-MD5 enc. 4. Initially I reproduced it with a NAS appliance which has firmware with Samba 4.3.9 plus some customizations. With this NAS, it was possible to downgrade the firmware to one that's based on Samba 3.3.x and Heimdal 1.2.1, and using the same machine account it worked. The only difference I was able to spot was that newer Samba (heimdal actually) used a subkey in the TGS request, whereas the old one didn't use a subkey with RC4-HMAC-MD5. However, when I hacked the newer Samba not to use a subkey the failure persisted, and packet captures look identical. 5. I figured that if the packet look the same (except for nonces and such), maybe there's uninitialized stuff in unused fields that causing trouble, but valgrind does not report any such thing. I'll add some packet captures shortly.
Created attachment 12065 [details] keytab for decrypting capture files
Created attachment 12066 [details] failing testjoin on samba-4.3.9 based firmware
Created attachment 12067 [details] succeeding testjoin on samba-3.3.16 based firmware (+heimdal 1.2.1), same machine account
Created attachment 12068 [details] failing testjoin on samba-4.3.9 based firmware with hack to disable subkey
IP addresses in the packet captures: 192.168.42.2 - the Samba device 192.168.42.10 - the RODC 192.168.40.10 - The Server 2003R2 DC (PDC role) 192.168.40.11 - The Server 2008R2 RWDC
OK I believe I've finally found the issue. Has to do with encoding of KVNO which is more than 4 bytes. This link discusses the special RODC key: https://blogs.msdn.microsoft.com/openspecification/2011/05/11/notes-on-kerberos-kvno-in-windows-rodc-environment/ A quote from this page is: t should be noted that if the TGS-REQ is malformed, e.g. Kvno encoded with more than 4 bytes, it is possible that the KDC discards the request without an error indication, for the purpose of mitigating a security attack. And indeed, we're receiving a TGT encrypted with a KVNO of 4 bytes but sending KVNO of 5 bytes, whereas older Samba/heimdal sends a KVNO of 4 bytes. Now I have to figure out how the heck to fix this... Because this is simple decode/encode issue, I believe it should be possible to write a torture test for this one too.
Doesn't seem like a security issue because this dropping of connection is documented in the above link, and there's no actual crash to be observed on the Windows machine (lsass remains with same PID), so I'm removing the restrictions I placed before.
DER/BER encoding of integers must be "two's complement binary number equal to the integer value" [X.690 8.3.3]. So afaict 0x9d720001 must be encoded as 0x0009d720001, the client is right, the server is wrong. Now, do we have to break it to be compatible or can MS fix their implementation? :) Not much discussion about this topic on the web [1], but the standard seems clear. [1] http://stackoverflow.com/questions/12860226/oddity-when-encoding-large-integers-using-asn-1
(In reply to Ralph Böhme from comment #8) Yes. According to Kerberos RFC, the KVNO is an unsigned integer, and according to DER that would require 5 bytes. In the meantime I've verified that moving to 4 bytes fixes the issue - attaching a patch which demonstrates this. This patch is probably wrong but it just demonstrates the issue. I'll contact dochelp, see what they have to say about it. A possible workaround without sacrificing compliance is to treat the whole TGT as a BLOB we don't modify - that's what it is essentially. We can parse it to extract some useful info, but when putting it in a TGS, just copy the bytes we received.
Created attachment 12074 [details] POC to demonstrate that a 4-byte KVNO fixes the issue.
I dug the following comment out of MIT Kerberos (src/lib/krb5/asn.1/asn1_k_encode.c): /* * krb5_kvno is defined as unsigned int, but historically (MIT krb5 through 1.6 * in the encoder, and through 1.10 in the decoder) we treat it as signed, in * violation of RFC 4120. kvno values large enough to be problematic are only * likely to be seen with Windows read-only domain controllers, which overload * the high 16-bits of kvno values for krbtgt principals. Since Windows * encodes kvnos as signed 32-bit values, for interoperability it's best if we * do the same. */ This probably means my "POC" patch is the right direction. Also inside heimdal, krb5_kvno type is int32_t.
Dochelp confirm Windows deviation from rfc4120: "Windows KILE key version numbers are signed 32-bit integers. Windows KDC does not accept 5 bytes Kvno and does not return errors on “malformed” packets as that can be used to setup a DoS flood attack. The first 16 bits of the kvno, including the most significant bit, are an unsigned 16-bit number that SHOULD identify the RODC (if it’s RODC). The remaining 16 bits SHOULD be the version number of the key. KILE has a deviation from [RFC4120] which defines kvno as Uint32." So it seems that contrary to the original problem decription, reproducing the issue does not depend on having Win2003 functional level (and RC4-HMAC-MD5 encryption). Rather, it depends on getting an RODC Id > 0x8000. This is encoded in the msDS-SecondaryKrbTgtNumber, but I'm not sure it's easy to dictate it while joining an RODC to the domain. I'll modify the bug heading accordingly.
Created attachment 12086 [details] git-am fix for 4.4.next and 4.3.next
Reassigning to Karolin for inclusion in 4.3 and 4.4.
(In reply to Ralph Böhme from comment #14) Pushed to autobuild-v4-[3|4]-test.
(In reply to Karolin Seeger from comment #15) Pushed to both branches. Closing our bug report. Thanks!