Bug 12127 - Winbindd ldap queries to AD server getting timedout in 4.3.11
Summary: Winbindd ldap queries to AD server getting timedout in 4.3.11
Status: RESOLVED INVALID
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.3.11
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-08 06:26 UTC by Hemanth
Modified: 2016-10-13 07:33 UTC (History)
2 users (show)

See Also:


Attachments
child winbindd log. (72.58 KB, text/rtf)
2016-08-08 06:26 UTC, Hemanth
no flags Details
winbindd log in plain text. (72.24 KB, text/plain)
2016-08-08 21:19 UTC, Hemanth
no flags Details
Packet capture. (128.34 KB, application/vnd.tcpdump.pcap)
2016-08-08 22:00 UTC, Hemanth
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hemanth 2016-08-08 06:26:45 UTC
Created attachment 12323 [details]
child winbindd log.

Hi,

We are currently with 4.3.5 stack. Recently, we have started unit testing the 4.3.11 changes as we wanted to include the BAD LOCK and other security fixes.

With 4.3.11 we are seeing winbindd issues. Logs indicating that LDAP queries to AD server getting timed out. Eventually all the winbindd requests which requires a lookup fails. 

Following the log messages are seen frequently for domain child.

[2016/08/07 22:50:11.390503,  3, pid=10391, effective(0, 0), real(0, 0)] ../lib/krb5_wrap/krb5_samba.c:2502(kerberos_get_principal_from_service_hostname)
  kerberos_get_principal_from_service_hostname: cannot get realm from, desthost drtnasdcprod01.testdom1.test.exp.domain.com or default ccache. Using default smb.conf realm testdom1.test.exp.domain.com

[2016/08/07 22:50:11.324867,  3, pid=10391, effective(0, 0), real(0, 0)] ../source3/libads/ldap.c:981(ads_do_paged_search_args)
  ads_do_paged_search_args: ldap_search_with_timeout((objectclass=*)) -> Time limit exceeded
[2016/08/07 22:50:11.324938,  1, pid=10391, effective(0, 0), real(0, 0)] ../source3/libads/ldap_utils.c:91(ads_do_search_retry_internal)
  Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT

These suggest that LDAP queries made to DC are actually getting timedout. 

Also as a side effect of this, we could see lot of(seen max of 32 total) child winbindd processes getting forked even though there are no trusted domains.

By attaching with gdb shows that this child domain process is in retry loop.

#0  0x00007f5df9ec1248 in poll () from /lib64/libc.so.6
#1  0x00007f5dfa186b1d in ?? () from /lib64/libldap-2.4.so.2
#2  0x00007f5dfa189e3f in ldap_pvt_search_s () from /lib64/libldap-2.4.so.2
#3  0x00007f5dfa189ed0 in ldap_search_ext_s () from /lib64/libldap-2.4.so.2
#4  0x00007f5e01a1428a in ldap_search_with_timeout (ld=0x563dde44e2b0, base=0x563dde47f0e0 "", scope=0, filter=0x563dde47fba0 "(objectclass=*)", attrs=0x563dde473e10, 
    attrsonly=0, sctrls=0x7fff254dd590, cctrls=0x0, sizelimit=0, res=0x7fff254dd908) at ../source3/libads/ldap.c:155
#5  0x00007f5e01a169c1 in ads_do_paged_search_args (ads=0x563dde480230, bind_path=0x563dde480490 "", scope=0, expr=0x7f5e01a343ce "(objectclass=*)", attrs=0x7fff254dd8d0, 
    args=0x0, res=0x7fff254dd908, count=0x7fff254dd75c, cookie=0x7fff254dd750) at ../source3/libads/ldap.c:971
#6  0x00007f5e01a16e3d in ads_do_search_all_args (ads=0x563dde480230, bind_path=0x563dde480490 "", scope=0, expr=0x7f5e01a343ce "(objectclass=*)", attrs=0x7fff254dd8d0, 
    args=0x0, res=0x7fff254dd908) at ../source3/libads/ldap.c:1075
#7  0x00007f5e01a2a96e in ads_do_search_retry_internal (ads=0x563dde480230, bind_path=0x7f5e01a33b2e "", scope=0, expr=0x7f5e01a343ce "(objectclass=*)", attrs=0x7fff254dd8d0, 
    args=0x0, res=0x7fff254dd908) at ../source3/libads/ldap_utils.c:77
#8  0x00007f5e01a2af5b in ads_do_search_retry (ads=0x563dde480230, bind_path=0x7f5e01a33b2e "", scope=0, expr=0x7f5e01a343ce "(objectclass=*)", attrs=0x7fff254dd8d0, 
    res=0x7fff254dd908) at ../source3/libads/ldap_utils.c:144
#9  0x00007f5e01a1ba86 in ads_USN (ads=0x563dde480230, usn=0x563dde463e90) at ../source3/libads/ldap.c:2884
#10 0x0000563ddc3b7d32 in sequence_number (domain=0x563dde463d60, seq=0x563dde463e90) at ../source3/winbindd/winbindd_ads.c:1506
#11 0x0000563ddc3b27db in sequence_number (domain=0x563dde463d60, seq=0x563dde463e90) at ../source3/winbindd/winbindd_reconnect_ads.c:248
#12 0x0000563ddc389396 in refresh_sequence_number (domain=0x563dde463d60, force=false) at ../source3/winbindd/winbindd_cache.c:560
#13 0x0000563ddc389a5b in wcache_fetch (cache=0x563dde449e90, domain=0x563dde463d60, format=0x563ddc43ac71 "NS/%s/%s") at ../source3/winbindd/winbindd_cache.c:713
#14 0x0000563ddc38d699 in wcache_name_to_sid (domain=0x563dde463d60, domain_name=0x563dde454fc0 "TESTDOM1", name=0x563dde470ea0 "ROOT", sid=0x563dde44afe0, 
    type=0x563dde44af70) at ../source3/winbindd/winbindd_cache.c:1822
#15 0x0000563ddc38d7fc in name_to_sid (domain=0x563dde463d60, mem_ctx=0x563dde462780, domain_name=0x563dde454fc0 "TESTDOM1", name=0x563dde470ea0 "ROOT", flags=8, 
    sid=0x563dde44afe0, type=0x563dde44af70) at ../source3/winbindd/winbindd_cache.c:1855
#16 0x0000563ddc3c0b5b in _wbint_LookupName (p=0x7fff254ddd20, r=0x563dde470e00) at ../source3/winbindd/winbindd_dual_srv.c:114
#17 0x0000563ddc429257 in api_wbint_LookupName (p=0x7fff254ddd20) at default/librpc/gen_ndr/srv_winbind.c:316
#18 0x0000563ddc3c05b8 in winbindd_dual_ndrcmd (domain=0x563dde463d60, state=0x7fff254de078) at ../source3/winbindd/winbindd_dual_ndr.c:322
#19 0x0000563ddc3bc0d6 in child_process_request (child=0x563dde463fb0, state=0x7fff254de078) at ../source3/winbindd/winbindd_dual.c:518


Looking at the history of changes, we suspect that the behavior changes are due to fixes made for CVE-2016-2112. Looks like now we have strong restrictions for sasl bind requests. May be I am guessing here. 

Not really sure whats going wrong here. But if we replace with 4.3.5 stack on same node, everything works fine.

Will attach complete logs for child winbindd process performing these lookups.
Comment 1 Hemanth 2016-08-08 06:28:55 UTC
Actually some one else posted similar issue at:
https://access.redhat.com/solutions/2290811
Comment 3 Uri Simchoni 2016-08-08 19:43:14 UTC
(In reply to Andreas Schneider from comment #2)
...But both those bugs are fixed now...
Also the timeout is with ads_USN() which is a small query.

It looks like the ldap bind is successful, yet the rather simple query times out. In general, Samba-4-3-11 successfully makes ldap queries to Windows DCs. So this must be specific to configuration, to the DC type or setup, or to the merging (since admittedly this is not "vanilla" Samba).

One thing I suddenly noticed is that the ldap communications are signed but not sealed (not encrypted). The default conf is "signed", but in pre-security-patch samba, "signed" still meant "sealed", and now it seems to follow the conf... You can try setting client ldap sasl wrapping = seal and see if it fixes the issue.

Another behavioral change I'm aware of is that the new code, when using GSSAPI to create a secure session, requests delegation (which effectively translates to an additional TGS request for another ticket and inclusion of that ticket inside the sasl bind request). See this thread https://lists.samba.org/archive/samba-technical/2016-July/114907.html for discussion and the place in the code that requests it (look for GSS_C_DELEG_FLAG).
Comment 4 Jeremy Allison 2016-08-08 20:41:04 UTC
Comment on attachment 12323 [details]
child winbindd log.

rtf format Hemanth - really ? Can get we it in plain text please :-).
Comment 5 Jeremy Allison 2016-08-08 20:58:58 UTC
Can you also get us a wireshark trace of all traffic between winbindd and the specific AD-LDAP server ? That should help a lot in tracking this down.
Comment 6 Jeremy Allison 2016-08-08 20:59:03 UTC
Can you also get us a wireshark trace of all traffic between winbindd and the specific AD-LDAP server ? That should help a lot in tracking this down.
Comment 7 Hemanth 2016-08-08 21:19:28 UTC
Created attachment 12326 [details]
winbindd log in plain text.

Attaching the log in plain text.
Comment 8 Hemanth 2016-08-08 21:20:03 UTC
(In reply to Jeremy Allison from comment #6)
Sure Jeremy. I will capture one and attach.
Comment 9 Hemanth 2016-08-08 22:00:41 UTC
Created attachment 12327 [details]
Packet capture.

Attaching the packet capture showing the communication with DC.
Comment 10 Hemanth 2016-08-08 22:06:26 UTC
(In reply to Uri Simchoni from comment #3)
Thanks a lot Uri! 

Infact we had "client ldap sasl wrapping" set to plain. This has overwritten the new default option(sign). I have removed this explicit config to make "sign" as current setting. So far things are working fine. I will monitor for some more time and update the status.
Comment 11 Hemanth 2016-08-08 23:28:30 UTC
OK. Setting "client ldap sasl wrapping" from "plain" to "sign" seems to be fixing the issue. No hangs so far.

Uri,
Where can I check if my ldap requests are actually signed. As mentioned earlier, we had the sasl wrapping is set to "plain". So it shouldn't have signed the packets. 

Capture shows the ldap query that is getting timed out shows as:

Lightweight Directory Access Protocol
    LDAPMessage searchRequest(5) "<ROOT>" baseObject
        messageID: 5
        protocolOp: searchRequest (3)
            searchRequest
                baseObject: 
                scope: baseObject (0)
                derefAliases: neverDerefAliases (0)
                sizeLimit: 0
                timeLimit: 15
                typesOnly: False
                Filter: (objectclass=*)
                    filter: present (7)
                        present: objectclass
                attributes: 1 item
                    AttributeDescription: highestCommittedUSN
        controls: 2 items
            Control
                controlType: 1.2.840.113556.1.4.1339 (LDAP_SERVER_DOMAIN_SCOPE_OID)
                controlValue: <MISSING>
            Control
                controlType: 1.2.840.113556.1.4.319 (pagedResultsControl)
                criticality: True
                SearchControlValue
                    size: 500
                    cookie: <MISSING>


I couldn't find any sign/seal option in this. Does this missing cookies and control value fields were the concern?

Would like to know the final recommendation for this. Should I use the default option("sign") or "seal"?
Comment 12 Uri Simchoni 2016-08-09 04:03:18 UTC
(In reply to Hemanth from comment #10)
AFAIK, the client default has been "sign" since 4.2.0, and the security release hardens the server requirements (look at the CVE and the new "ldap server require strong auth" parameter). Was this setup working against a Samba AD?
Comment 13 Uri Simchoni 2016-08-09 04:31:52 UTC
(In reply to Hemanth from comment #11)
When the ldap message is signed or sealed, it is transmitted on top of the GSSAPI wrapping protocol.

A signed message looks like this:
Lightweight Directory Access Protocol
    SASL Buffer Length: 114
    SASL Buffer
        GSS-API Generic Security Service Application Program Interface
            krb5_blob: 050405ff000c000c000000004edf574dd5d541bee89ec04c...
                krb5_tok_id: KRB_TOKEN_CFX_WRAP (0x0405)
                krb5_cfx_flags: 0x05
                    .... .1.. = AcceptorSubkey: Set
                    .... ..0. = Sealed: Not set
                    .... ...1 = SendByAcceptor: Set
                krb5_filler: ff
                krb5_cfx_ec: 12
                krb5_cfx_rrc: 12
                krb5_cfx_seq: 1323259725
                krb5_sgn_cksum: d5d541bee89ec04c99593f00
        GSS-API payload (86 bytes)
            LDAPMessage searchResEntry(5) "<ROOT>" [3 results]
                messageID: 5
                protocolOp: searchResEntry (4)
                    searchResEntry
                        objectName: 
                        attributes: 1 item
                            PartialAttributeList item highestCommittedUSN

Note the presence of the GSSAPI stuff and that the "seal" bit is unset, and the message can be read.

If the ldap is sealed, it will look like this:
Lightweight Directory Access Protocol
    SASL Buffer Length: 1798
    SASL Buffer
        GSS-API Generic Security Service Application Program Interface
            krb5_blob: 050407ff0000001c0000000079a39dda3658ebc5171bc78d...
                krb5_tok_id: KRB_TOKEN_CFX_WRAP (0x0405)
                krb5_cfx_flags: 0x07
                    .... .1.. = AcceptorSubkey: Set
                    .... ..1. = Sealed: Set
                    .... ...1 = SendByAcceptor: Set
                krb5_filler: ff
                krb5_cfx_ec: 0
                krb5_cfx_rrc: 28
                krb5_cfx_seq: 2040765914
                krb5_sgn_cksum: 3658ebc5171bc78d4b4930f24b678b3ce572a7cb016a2920...
        GSS-API payload (60 bytes)
            LDAPMessage
                BER Error: Sequence expected but class:UNIVERSAL(0) primitive tag:5 was unexpected
                    [Expert Info (Warn/Malformed): BER Error: Sequence expected]
                        [BER Error: Sequence expected]
                        [Severity level: Warn]
                        [Group: Malformed]

Note that the "sealed" bit is set and that the message is encrypted and cannot be parsed. If you use a keytab (see https://wiki.samba.org/index.php/Keytab_Extraction) you will be able to decrypt it.

I don't know of a recommendation of whether to use "sign" or "seal". It's a matter of the security policy of the AD server, I guess. Encrypted packets add a headache of having to decrypt them when you're analyzing them, but with wireshark that's not very difficult.
Comment 14 Hemanth 2016-08-09 18:01:17 UTC
(In reply to Uri Simchoni from comment #12)
For some historical reasons this setting has been changed to plain on our stack. We haven't tested against samba AD. Yeah. I did see that strong server side requirements for ldap but ignored them as we are not using samba AD server.

And thanks for shedding light of ldap sign and seal options.