We have a forest with 2 DCs (ROOT & s2_w2k8r2) in the domain (NIN) where samba is a member. Samba is in the same site has s2_w2k8r2 so it will by default use this DC after a fresh start. I checked that a ntlm_auth (both in "normal" and in "crap" mode) is working correctly, then I remove the network plug of s2_w2k8r2 and try to do a crap request. After ~20 seconds I have a NT_STATUS_IO_TIMEOUT response (expected). During the next 20 minutes ntlm_auth in both mode will fail, wbinfo --ping-dc will fail. The problem occurs only if the first request after the network between s2_w2k8r2 is a PAM_CRAP. I'm using this configuration: [global] log level = 11 realm = NIN.ASGLAB.JUNIPER.NET workgroup = NIN netbios name = ubuntu security = ads client ldap sasl wrapping = sign winbind use default domain = yes name resolve order = hosts bcast allow trusted domains = yes winbind expand groups = 1 client ntlmv2 auth = yes kerberos method = dedicated keytab winbind cache time = 60 idmap uid = 10000-20000 idmap gid = 10000-20000 machine password timeout = 360 SendQ for the winbind process is growing. sudo netstat -anp | grep 445 tcp 0 1689 10.91.0.202:48443 10.64.108.92:445 ESTABLISHED 8056/winbindd tcp 0 0 10.91.0.202:48446 10.64.108.92:445 ESTABLISHED 8055/winbindd
The problem didn't occur if ntlm_auth in "normal" mode is used. In this case the first call return also NT_STATUS_IO_TIMEOUT, then the winbind for this domain is set offline. At the next request a search for all DC is made a new connection to the other DC is established.
Created attachment 7335 [details] logs of winbind
Created attachment 7336 [details] trace between DC and samba
If I attach a debugger and hit Ctrl-C while ntlm_auth is waiting here is the stacktrace: #0 0x00007f4135665f58 in poll () from /lib/libc.so.6 #1 0x00007f413b4f04d1 in sys_poll (fds=0x1153f90, num_fds=1, timeout=10000) at ../lib/util/select.c:104 #2 0x00007f413ac4fe36 in s3_event_loop_once (ev=0x1151ed0, location=0x7f413aea1c10 "../lib/tevent/tevent_req.c:210") at ../source3/lib/events.c:341 #3 0x00007f413ae9afea in _tevent_loop_once (ev=0x1151ed0, location=0x7f413aea1c10 "../lib/tevent/tevent_req.c:210") at ../lib/tevent/tevent.c:504 #4 0x00007f413ae9cbcb in tevent_req_poll (req=0x114fd00, ev=0x1151ed0) at ../lib/tevent/tevent_req.c:210 #5 0x00007f413b731a6b in dcerpc_binding_handle_call (h=0x113da60, object=0x0, table=0x7f4137e119a0, opnum=39, r_mem=0x111ee60, r_ptr=0x7fff797da0d0) at ../librpc/rpc/binding_handle.c:542 #6 0x00007f413a4d5ed6 in dcerpc_netr_LogonSamLogonEx_r (h=0x113da60, mem_ctx=0x111ee60, r=0x7fff797da0d0) at default/librpc/gen_ndr/ndr_netlogon_c.c:8010 #7 0x00007f413a4d639d in dcerpc_netr_LogonSamLogonEx (h=0x113da60, mem_ctx=0x111ee60, _server_name=0x1152110 "\\\\root.nin.asglab.juniper.net", _computer_name=0x1115d80 "UBUNTU", _logon_level=NetlogonNetworkInformation, _logon=0x11262f0, _validation_level=6, _validation=0x7fff797da1e0, _authoritative=0x7fff797da22f "\"P\354\023\001", _flags=0x7fff797da20c, result=0x7fff797da220) at default/librpc/gen_ndr/ndr_netlogon_c.c:8163 #8 0x00007f41391aebc2 in rpccli_netlogon_sam_network_logon_ex (cli=0x113ec50, mem_ctx=0x111ee60, logon_parameters=2080, server=0x111ebe0 "root.nin.asglab.juniper.net", username=0x7fff797db4e4 "mat", domain=0x7fff797db5e4 "NIN", workstation=0x7fff797db8ec "UBUNTU", chal=0x7fff797db4d8 "\207hƺ\033ڧ\252 \b", validation_level=6, lm_response=..., nt_response=..., info3=0x7fff797da438) at ../source3/rpc_client/cli_netlogon.c:557 #9 0x000000000044a90d in winbind_samlogon_retry_loop (domain=0x111e850, mem_ctx=0x111ee60, logon_parameters=2080, server=0x111ebe0 "root.nin.asglab.juniper.net", username=0x7fff797db4e4 "mat", domainname=0x7fff797db5e4 "NIN", workstation=0x7fff797db8ec "UBUNTU", chal=0x7fff797db4d8 "\207hƺ\033ڧ\252 \b", lm_response=..., nt_response=..., info3=0x7fff797da438) at ../source3/winbindd/winbindd_pam.c:1235 #10 0x000000000044c936 in winbindd_dual_pam_auth_crap (domain=0x111e850, state=0x7fff797da500) at ../source3/winbindd/winbindd_pam.c:1827 #11 0x0000000000464ba9 in child_process_request (child=0x111e3b0, state=0x7fff797da500) at ../source3/winbindd/winbindd_dual.c:440 #12 0x0000000000468479 in fork_domain_child (child=0x111e3b0) at ../source3/winbindd/winbindd_dual.c:1553 #13 0x0000000000463c58 in wb_child_request_trigger (req=0x113fd60, private_data=0x0) at ../source3/winbindd/winbindd_dual.c:145 #14 0x00007f413ae9c1c6 in tevent_queue_immediate_trigger (ev=0x1117350, im=0x111f280, private_data=0x111ef10) at ../lib/tevent/tevent_queue.c:144 #15 0x00007f413ae9be58 in tevent_common_loop_immediate (ev=0x1117350) at ../lib/tevent/tevent_immediate.c:135 #16 0x00007f413ac4f549 in run_events_poll (ev=0x1117350, pollrtn=0, pfds=0x0, num_pfds=0) at ../source3/lib/events.c:197 #17 0x00007f413ac4fdcc in s3_event_loop_once (ev=0x1117350, location=0x4d0b60 "../source3/winbindd/winbindd.c:1443") at ../source3/lib/events.c:331 #18 0x00007f413ae9afea in _tevent_loop_once (ev=0x1117350, location=0x4d0b60 "../source3/winbindd/winbindd.c:1443") at ../lib/tevent/tevent.c:504 #19 0x000000000043485c in main (argc=3, argv=0x7fff797dc158, envp=0x7fff797dc178) at ../source3/winbindd/winbindd.c:1443 My changeset is 7104ce3220f9c7b301d1c638d071be2a86755c7e.
Created attachment 7337 [details] Possible fix This patch remove connection if the return status is timeout. Doing so is similar to offline the DC (that's what is done in winbindd_dual_pam_auth_samlogon) or in trusted_domains() of ./source3/winbindd/winbindd_cache.c
Volker (Cc Jeremy, Metze) assigned the but to you. Could you comment on the patch? Cheers - Michael
Comment on attachment 7337 [details] Possible fix While the patch looks as if it makes sense, it does not apply to either master, v3-5-test or v3-6-test. What version is this patch against? And, would it be possible to get this in "git format-patch" form?
Created attachment 7345 [details] Clean patch that applies to master and 3.6.3 branch
Created attachment 7346 [details] Clean patch that applies to 3.5.x branch It's a slightly modified version of the attachment #7345 [details] so that it compiles in 3.5.x branch, the variable domain needs to be renamed to contact_domain.
Comment on attachment 7346 [details] Clean patch that applies to 3.5.x branch Added two reviews due to rules for backports to "old" branch
Created attachment 7347 [details] git-am fix for master and 3.6.x. I think this is a better place to put that code in master and 3.6.x. Matthieu can you review please ?
Comment on attachment 7347 [details] git-am fix for master and 3.6.x. Jeremy this patch is not sufficient, we can face a timeout in cm_connect_netlogon(). I'm updating this patch with a version that should address this problem.
Created attachment 7348 [details] Updated version of the patch This is an updated version of Jeremy's patch so that we do the cm_connect_netlogon() in the retry loop() but we not only invalidate the connection if the timeout occurs on the netr_sam_logonex() but also when trying to connect to a netlogon pipe. This has been tested in my env and it solves the connection hanged problem.
Comment on attachment 7348 [details] Updated version of the patch LGTM. Karolin please apply to 3.5.next and 3.6.next.
(In reply to comment #14) > Comment on attachment 7348 [details] > Updated version of the patch > > LGTM. Karolin please apply to 3.5.next and 3.6.next. Pushed to v3-6-test. Does not apply cleanly to v3-5-test. Please confirm that https://attachments.samba.org/attachment.cgi?id=7346 should be picked or provide an updated version. Thanks! Re-assigning to Matthieu.
Comment on attachment 7346 [details] Clean patch that applies to 3.5.x branch LGTM. Sorry for the confusion. Karolin, this attachment (#7346) is the correct patch for 3.5.next. Please apply and then the bug can be closed. Jeremy.
(In reply to comment #16) > Comment on attachment 7346 [details] > Clean patch that applies to 3.5.x branch > > LGTM. Sorry for the confusion. Karolin, this attachment (#7346) is the correct > patch for 3.5.next. Please apply and then the bug can be closed. > > Jeremy. Thanks for clarification, Jeremy! Pushed to v3-5-test. Closing out bug report. Thanks!