Bug 8771 - Winbind takes up to 20 minutes to change from DC 1 to DC 2 and keeps in the meantime to respond NT_STATUS_IO_TIMEOUT
Winbind takes up to 20 minutes to change from DC 1 to DC 2 and keeps in the m...
Status: RESOLVED FIXED
Product: Samba 3.6
Classification: Unclassified
Component: Winbind
unspecified
All All
: P5 normal
: ---
Assigned To: Matthieu Patou
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-22 09:08 UTC by Matthieu Patou
Modified: 2012-05-25 19:44 UTC (History)
4 users (show)

See Also:


Attachments
logs of winbind (63.39 KB, application/x-compressed-tar)
2012-02-22 09:22 UTC, Matthieu Patou
no flags Details
trace between DC and samba (75.51 KB, application/cap)
2012-02-22 09:26 UTC, Matthieu Patou
no flags Details
Possible fix (849 bytes, patch)
2012-02-22 10:02 UTC, Matthieu Patou
no flags Details
Clean patch that applies to master and 3.6.3 branch (1.28 KB, patch)
2012-02-24 23:10 UTC, Matthieu Patou
no flags Details
Clean patch that applies to 3.5.x branch (1.36 KB, patch)
2012-02-24 23:15 UTC, Matthieu Patou
jra: review+
Details
git-am fix for master and 3.6.x. (1.61 KB, patch)
2012-02-25 00:16 UTC, Jeremy Allison
no flags Details
Updated version of the patch (2.29 KB, patch)
2012-02-26 20:18 UTC, Matthieu Patou
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthieu Patou 2012-02-22 09:08:20 UTC
We have a forest with 2 DCs (ROOT & s2_w2k8r2) in the domain (NIN) where samba is a member. Samba is in the same site has s2_w2k8r2 so it will by default use this DC after a fresh start.
I checked that a ntlm_auth (both in "normal" and in "crap" mode) is working correctly, then I remove the network plug of s2_w2k8r2 and try to do a crap request.
After ~20 seconds I have a NT_STATUS_IO_TIMEOUT response (expected).
During the next 20 minutes ntlm_auth in both mode will fail, wbinfo --ping-dc will fail.

The problem occurs only if the first request after the network between s2_w2k8r2 is a PAM_CRAP.

I'm using this configuration:

[global]
        log level = 11
    realm = NIN.ASGLAB.JUNIPER.NET
    workgroup = NIN
    netbios name = ubuntu
    security = ads
    client ldap sasl wrapping = sign
    winbind use default domain = yes
    name resolve order = hosts bcast
    allow trusted domains = yes
    winbind expand groups = 1
    client ntlmv2 auth = yes
    kerberos method = dedicated keytab
    winbind cache time = 60
    idmap uid = 10000-20000
    idmap gid = 10000-20000
    machine password timeout = 360


SendQ for the winbind process is growing.

sudo netstat -anp | grep 445
tcp        0   1689 10.91.0.202:48443       10.64.108.92:445        ESTABLISHED 8056/winbindd   
tcp        0      0 10.91.0.202:48446       10.64.108.92:445        ESTABLISHED 8055/winbindd
Comment 1 Matthieu Patou 2012-02-22 09:21:58 UTC
The problem didn't occur if ntlm_auth in "normal" mode is used.
In this case the first call return also NT_STATUS_IO_TIMEOUT, then the winbind for this domain is set offline. At the next request a search for all DC is made a new connection to the other DC is established.
Comment 2 Matthieu Patou 2012-02-22 09:22:57 UTC
Created attachment 7335 [details]
logs of winbind
Comment 3 Matthieu Patou 2012-02-22 09:26:37 UTC
Created attachment 7336 [details]
trace between DC and samba
Comment 4 Matthieu Patou 2012-02-22 09:49:27 UTC
If I attach a debugger and hit Ctrl-C while ntlm_auth is waiting here is the stacktrace:

#0  0x00007f4135665f58 in poll () from /lib/libc.so.6
#1  0x00007f413b4f04d1 in sys_poll (fds=0x1153f90, num_fds=1, timeout=10000) at ../lib/util/select.c:104
#2  0x00007f413ac4fe36 in s3_event_loop_once (ev=0x1151ed0, location=0x7f413aea1c10 "../lib/tevent/tevent_req.c:210") at ../source3/lib/events.c:341
#3  0x00007f413ae9afea in _tevent_loop_once (ev=0x1151ed0, location=0x7f413aea1c10 "../lib/tevent/tevent_req.c:210") at ../lib/tevent/tevent.c:504
#4  0x00007f413ae9cbcb in tevent_req_poll (req=0x114fd00, ev=0x1151ed0) at ../lib/tevent/tevent_req.c:210
#5  0x00007f413b731a6b in dcerpc_binding_handle_call (h=0x113da60, object=0x0, table=0x7f4137e119a0, opnum=39, r_mem=0x111ee60, r_ptr=0x7fff797da0d0)
    at ../librpc/rpc/binding_handle.c:542
#6  0x00007f413a4d5ed6 in dcerpc_netr_LogonSamLogonEx_r (h=0x113da60, mem_ctx=0x111ee60, r=0x7fff797da0d0) at default/librpc/gen_ndr/ndr_netlogon_c.c:8010
#7  0x00007f413a4d639d in dcerpc_netr_LogonSamLogonEx (h=0x113da60, mem_ctx=0x111ee60, _server_name=0x1152110 "\\\\root.nin.asglab.juniper.net", 
    _computer_name=0x1115d80 "UBUNTU", _logon_level=NetlogonNetworkInformation, _logon=0x11262f0, _validation_level=6, _validation=0x7fff797da1e0, 
    _authoritative=0x7fff797da22f "\"P\354\023\001", _flags=0x7fff797da20c, result=0x7fff797da220) at default/librpc/gen_ndr/ndr_netlogon_c.c:8163
#8  0x00007f41391aebc2 in rpccli_netlogon_sam_network_logon_ex (cli=0x113ec50, mem_ctx=0x111ee60, logon_parameters=2080, 
    server=0x111ebe0 "root.nin.asglab.juniper.net", username=0x7fff797db4e4 "mat", domain=0x7fff797db5e4 "NIN", workstation=0x7fff797db8ec "UBUNTU", 
    chal=0x7fff797db4d8 "\207hƺ\033ڧ\252 \b", validation_level=6, lm_response=..., nt_response=..., info3=0x7fff797da438) at ../source3/rpc_client/cli_netlogon.c:557
#9  0x000000000044a90d in winbind_samlogon_retry_loop (domain=0x111e850, mem_ctx=0x111ee60, logon_parameters=2080, server=0x111ebe0 "root.nin.asglab.juniper.net", 
    username=0x7fff797db4e4 "mat", domainname=0x7fff797db5e4 "NIN", workstation=0x7fff797db8ec "UBUNTU", chal=0x7fff797db4d8 "\207hƺ\033ڧ\252 \b", lm_response=..., 
    nt_response=..., info3=0x7fff797da438) at ../source3/winbindd/winbindd_pam.c:1235
#10 0x000000000044c936 in winbindd_dual_pam_auth_crap (domain=0x111e850, state=0x7fff797da500) at ../source3/winbindd/winbindd_pam.c:1827
#11 0x0000000000464ba9 in child_process_request (child=0x111e3b0, state=0x7fff797da500) at ../source3/winbindd/winbindd_dual.c:440
#12 0x0000000000468479 in fork_domain_child (child=0x111e3b0) at ../source3/winbindd/winbindd_dual.c:1553
#13 0x0000000000463c58 in wb_child_request_trigger (req=0x113fd60, private_data=0x0) at ../source3/winbindd/winbindd_dual.c:145
#14 0x00007f413ae9c1c6 in tevent_queue_immediate_trigger (ev=0x1117350, im=0x111f280, private_data=0x111ef10) at ../lib/tevent/tevent_queue.c:144
#15 0x00007f413ae9be58 in tevent_common_loop_immediate (ev=0x1117350) at ../lib/tevent/tevent_immediate.c:135
#16 0x00007f413ac4f549 in run_events_poll (ev=0x1117350, pollrtn=0, pfds=0x0, num_pfds=0) at ../source3/lib/events.c:197
#17 0x00007f413ac4fdcc in s3_event_loop_once (ev=0x1117350, location=0x4d0b60 "../source3/winbindd/winbindd.c:1443") at ../source3/lib/events.c:331
#18 0x00007f413ae9afea in _tevent_loop_once (ev=0x1117350, location=0x4d0b60 "../source3/winbindd/winbindd.c:1443") at ../lib/tevent/tevent.c:504
#19 0x000000000043485c in main (argc=3, argv=0x7fff797dc158, envp=0x7fff797dc178) at ../source3/winbindd/winbindd.c:1443



My changeset is 7104ce3220f9c7b301d1c638d071be2a86755c7e.
Comment 5 Matthieu Patou 2012-02-22 10:02:34 UTC
Created attachment 7337 [details]
Possible fix

This patch remove connection if the return status is timeout.

Doing so is similar to offline the DC (that's what is done in winbindd_dual_pam_auth_samlogon) or in trusted_domains() of ./source3/winbindd/winbindd_cache.c
Comment 6 Michael Adam 2012-02-22 21:53:08 UTC
Volker
(Cc Jeremy, Metze)

assigned the but to you.
Could you comment on the patch?

Cheers - Michael
Comment 7 Volker Lendecke 2012-02-24 06:47:43 UTC
Comment on attachment 7337 [details]
Possible fix

While the patch looks as if it makes sense, it does not apply to either master, v3-5-test or v3-6-test. What version is this patch against? And, would it be possible to get this in "git format-patch" form?
Comment 8 Matthieu Patou 2012-02-24 23:10:31 UTC
Created attachment 7345 [details]
Clean patch that applies to master and 3.6.3 branch
Comment 9 Matthieu Patou 2012-02-24 23:15:32 UTC
Created attachment 7346 [details]
Clean patch that applies to 3.5.x branch

It's a slightly modified version of the attachment #7345 [details] so that it compiles in 3.5.x branch, the variable domain needs to be renamed to contact_domain.
Comment 10 Matthieu Patou 2012-02-24 23:16:26 UTC
Comment on attachment 7346 [details]
Clean patch that applies to 3.5.x branch

Added two reviews due to rules for backports to "old" branch
Comment 11 Jeremy Allison 2012-02-25 00:16:00 UTC
Created attachment 7347 [details]
git-am fix for master and 3.6.x.

I think this is a better place to put that code in master and 3.6.x. Matthieu can you review please ?
Comment 12 Matthieu Patou 2012-02-26 20:04:28 UTC
Comment on attachment 7347 [details]
git-am fix for master and 3.6.x.

Jeremy this patch is not sufficient, we can face a timeout in cm_connect_netlogon().

I'm updating this patch with a version that should address this problem.
Comment 13 Matthieu Patou 2012-02-26 20:18:47 UTC
Created attachment 7348 [details]
Updated version of the patch

This is an updated version of Jeremy's patch so that we do the cm_connect_netlogon() in the retry loop() but we not only invalidate the connection if the timeout occurs on the netr_sam_logonex() but also when trying to connect to a netlogon pipe.

This has been tested in my env and it solves the connection hanged problem.
Comment 14 Jeremy Allison 2012-02-27 20:37:16 UTC
Comment on attachment 7348 [details]
Updated version of the patch

LGTM. Karolin please apply to 3.5.next and 3.6.next.
Comment 15 Karolin Seeger 2012-02-28 19:07:56 UTC
(In reply to comment #14)
> Comment on attachment 7348 [details]
> Updated version of the patch
> 
> LGTM. Karolin please apply to 3.5.next and 3.6.next.

Pushed to v3-6-test.
Does not apply cleanly to v3-5-test.
Please confirm that https://attachments.samba.org/attachment.cgi?id=7346
should be picked or provide an updated version.

Thanks!

Re-assigning to Matthieu.
Comment 16 Jeremy Allison 2012-02-28 19:36:53 UTC
Comment on attachment 7346 [details]
Clean patch that applies to 3.5.x branch

LGTM. Sorry for the confusion. Karolin, this attachment (#7346) is the correct patch for 3.5.next. Please apply and then the bug can be closed.

Jeremy.
Comment 17 Karolin Seeger 2012-02-29 19:59:24 UTC
(In reply to comment #16)
> Comment on attachment 7346 [details]
> Clean patch that applies to 3.5.x branch
> 
> LGTM. Sorry for the confusion. Karolin, this attachment (#7346) is the correct
> patch for 3.5.next. Please apply and then the bug can be closed.
> 
> Jeremy.

Thanks for clarification, Jeremy!
Pushed to v3-5-test.
Closing out bug report.

Thanks!