Found by Herb (Herb Lewis <hlewis@panasas.com>). We have a win 2008 domain here we use for testing and it has only a single DC. We were running some tests using swifttest and while it was making connections and writting files we rebooted the DC. The winbind child for the domain hung in s3_event_loop_once calling sys_poll with a timeout of MAX_INT (which on a 64 bit systems turns out to be about 24 days!). Basically winbindd was getting a stream of AUTH_CRAP requests that it was sending to the child. The call to ads_connect from dcip_to_name returned an error but it continued on to call s3_event_loop_once with the MAX_INT timeout. Even though the DC rebooted it never reconnected. I added a return False where ads_connect returned an error and things seem to work fine now. I'm not sure that is the best fix. How is it supposed to decide to reconnect instead of polling for the dead DC? I've included a log snippet where I added some extra debug statements so I could follow what was happening. This was running 3.6.2. It is easy to reproduce so I could give you full logs if you like. Stack backtrace: #0 0x0000000801d5f0fc in poll () at poll.S:2 #1 0x0000000000596747 in sys_poll (fds=0x802143750, num_fds=1, timeout=2147483647) at /src/samba/source3/../lib/util/select.c:104 #2 0x00000000005d73c6 in s3_event_loop_once (ev=0x80212f350, location=0xcde3b8 "/src/samba/source3/../lib/t +event/tevent_req.c:193") at /src/samba/source3/lib/events.c:342 #3 0x00000000005d862f in _tevent_loop_once (ev=0x80212f350, location=0xcde3b8 "/src/samba/source3/../lib/t +event/tevent_req.c:193") at /src/samba/source3/../lib/tevent/tevent.c:4 +94 #4 0x00000000005db542 in tevent_req_poll (req=0x80212f410, ev=0x80212f350) at /src/samba/source3/../lib/tevent/tevent_req +.c:193 #5 0x00000000005c59c1 in tevent_req_poll_ntstatus (req=0x80212f410, ev=0x80212f350, status=0x7fffffffc810) at /src/samba/source3/lib/util.c:2672 #6 0x0000000000637c4b in nbt_getdc (msg_ctx=0x8021320d0, dc_addr=0x80211ec90, domain_name=0x80211e800 "NATIVE2K8", sid=0x8021 +1eb00, nt_version=1, mem_ctx=0x8021082e0, pnt_version=0x7fffffffc90c, dc_name=0x7fffffffc8f8, samlogon_response=0x0) at /src/samba/source3/libsmb/clidgram.c:462 #7 0x00000000004b4bee in dcip_to_name (mem_ctx=0x8021082e0, domain=0x80211e800, pss=0x80211ec90, name=0x80211eb90 "CAVM4-84") at /src/samba/source3/winbindd/winbindd_cm.c:1 +192 #8 0x00000000004b56ed in find_new_dc (mem_ctx=0x8021082e0, domain=0x80211e800, dcname=0x80211eb90 "CAVM4-84", pss=0x80211ec90 +, fd=0x7fffffffcc9c) at /src/samba/source3/winbindd/winbindd_cm.c:1 +414 #9 0x00000000004b5ec9 in cm_open_connection (domain=0x80211e800, new_conn=0x80211ed20) at /src/samba/source3/winbindd/winbindd_cm.c:1 +596 #10 0x00000000004b6645 in init_dc_connection_network (domain=0x80211e800) at /src/samba/Quit (gdb)
Created attachment 7582 [details] Fix I'm testing for master.
Created attachment 7598 [details] Fix for 3.6.next. Fix tested by Herb. Please +1 and I'll assign to Karolin for inclusion in 3.6.next. Jeremy.
Comment on attachment 7598 [details] Fix for 3.6.next. Michael, if you get to this first I'm ok with that :-).
Comment on attachment 7598 [details] Fix for 3.6.next. Trying to get this into 3.6.next :-).
Comment on attachment 7598 [details] Fix for 3.6.next. + if (ev && req && + tevent_req_is_error(req, &err_state, &error) && + err_state == TEVENT_REQ_TIMED_OUT) { + status = NT_STATUS_IO_TIMEOUT; + } Is very bad style in formatting and in tevent_req rules. It's also not needed as the nbt_getdc_recv() should already handle that, see tevent_req_is_nterror()
Ok, I'll fix and re-submit later today (got to go on a school trip this morning). Thanks for the review ! I'll fix in master also. Jeremy.
Created attachment 7607 [details] Updated fix for 3.6.next. Added metze's suggestions.
Comment on attachment 7607 [details] Updated fix for 3.6.next. Looks good for 3.6
Karolin, please pick this for the release
Pushed to v3-6-test. Closing out bug report. Thanks!