Bug 8953 - winbind can hang as nbt_getdc() has no timeout.
Summary: winbind can hang as nbt_getdc() has no timeout.
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.6
Classification: Unclassified
Component: Winbind (show other bugs)
Version: unspecified
Hardware: All All
: P5 normal
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks: 8595
  Show dependency treegraph
 
Reported: 2012-05-22 23:22 UTC by Jeremy Allison
Modified: 2012-05-31 19:06 UTC (History)
3 users (show)

See Also:


Attachments
Fix I'm testing for master. (4.31 KB, patch)
2012-05-22 23:51 UTC, Jeremy Allison
no flags Details
Fix for 3.6.next. (4.55 KB, patch)
2012-05-25 20:14 UTC, Jeremy Allison
herb: review+
metze: review-
Details
Updated fix for 3.6.next. (3.72 KB, patch)
2012-05-29 23:39 UTC, Jeremy Allison
metze: review+
herb: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Allison 2012-05-22 23:22:50 UTC
Found by Herb (Herb Lewis <hlewis@panasas.com>).

We have a win 2008 domain here we use for testing and it has only a
single DC.
We were running some tests using swifttest and while it was making
connections
and writting files we rebooted the DC. The winbind child for the
domain hung in
s3_event_loop_once calling sys_poll with a timeout of MAX_INT (which on a 64
bit systems turns out to be about 24 days!). Basically winbindd was
getting a
stream of AUTH_CRAP requests that it was sending to the child.  The call to
ads_connect from dcip_to_name returned an error but it continued on to call
s3_event_loop_once with the MAX_INT timeout. Even though the DC rebooted it
never reconnected. I added a return False where ads_connect returned
an error
and things seem to work fine now. I'm not sure that is the best fix.
How is it
supposed to decide to reconnect instead of polling for the dead DC?
I've included
a log snippet where I added some extra debug statements so I could follow
what was happening. This was running 3.6.2. It is easy to reproduce so
I could
give you full logs if you like.

Stack backtrace:

#0  0x0000000801d5f0fc in poll () at poll.S:2
#1  0x0000000000596747 in sys_poll (fds=0x802143750, num_fds=1, timeout=2147483647)
    at /src/samba/source3/../lib/util/select.c:104
#2  0x00000000005d73c6 in s3_event_loop_once (ev=0x80212f350,
    location=0xcde3b8 "/src/samba/source3/../lib/t
+event/tevent_req.c:193")
    at /src/samba/source3/lib/events.c:342
#3  0x00000000005d862f in _tevent_loop_once (ev=0x80212f350,
    location=0xcde3b8 "/src/samba/source3/../lib/t
+event/tevent_req.c:193")
    at /src/samba/source3/../lib/tevent/tevent.c:4
+94
#4  0x00000000005db542 in tevent_req_poll (req=0x80212f410, ev=0x80212f350)
    at /src/samba/source3/../lib/tevent/tevent_req
+.c:193
#5  0x00000000005c59c1 in tevent_req_poll_ntstatus (req=0x80212f410, ev=0x80212f350, status=0x7fffffffc810)
    at /src/samba/source3/lib/util.c:2672
#6  0x0000000000637c4b in nbt_getdc (msg_ctx=0x8021320d0, dc_addr=0x80211ec90, domain_name=0x80211e800 "NATIVE2K8", sid=0x8021
+1eb00, nt_version=1,
    mem_ctx=0x8021082e0, pnt_version=0x7fffffffc90c, dc_name=0x7fffffffc8f8, samlogon_response=0x0)
    at /src/samba/source3/libsmb/clidgram.c:462
#7  0x00000000004b4bee in dcip_to_name (mem_ctx=0x8021082e0, domain=0x80211e800, pss=0x80211ec90, name=0x80211eb90 "CAVM4-84")
    at /src/samba/source3/winbindd/winbindd_cm.c:1
+192
#8  0x00000000004b56ed in find_new_dc (mem_ctx=0x8021082e0, domain=0x80211e800, dcname=0x80211eb90 "CAVM4-84", pss=0x80211ec90
+, fd=0x7fffffffcc9c)
    at /src/samba/source3/winbindd/winbindd_cm.c:1
+414
#9  0x00000000004b5ec9 in cm_open_connection (domain=0x80211e800, new_conn=0x80211ed20)
    at /src/samba/source3/winbindd/winbindd_cm.c:1
+596
#10 0x00000000004b6645 in init_dc_connection_network (domain=0x80211e800)
 at /src/samba/Quit
(gdb)
Comment 1 Jeremy Allison 2012-05-22 23:51:52 UTC
Created attachment 7582 [details]
Fix I'm testing for master.
Comment 2 Jeremy Allison 2012-05-25 20:14:18 UTC
Created attachment 7598 [details]
Fix for 3.6.next.

Fix tested by Herb. Please +1 and I'll assign to Karolin for inclusion in 3.6.next.

Jeremy.
Comment 3 Jeremy Allison 2012-05-25 20:14:45 UTC
Comment on attachment 7598 [details]
Fix for 3.6.next.

Michael, if you get to this first I'm ok with that :-).
Comment 4 Jeremy Allison 2012-05-25 22:31:12 UTC
Comment on attachment 7598 [details]
Fix for 3.6.next.

Trying to get this into 3.6.next :-).
Comment 5 Stefan Metzmacher 2012-05-26 08:45:47 UTC
Comment on attachment 7598 [details]
Fix for 3.6.next.

+	if (ev && req &&
+			tevent_req_is_error(req, &err_state, &error) &&
+			err_state == TEVENT_REQ_TIMED_OUT) {
+		status = NT_STATUS_IO_TIMEOUT;
+	}

Is very bad style in formatting and in tevent_req rules.

It's also not needed as the nbt_getdc_recv() should already handle that, see tevent_req_is_nterror()
Comment 6 Jeremy Allison 2012-05-29 14:53:23 UTC
Ok, I'll fix and re-submit later today (got to go on a school trip this morning). Thanks for the review ! I'll fix in master also.

Jeremy.
Comment 7 Jeremy Allison 2012-05-29 23:39:58 UTC
Created attachment 7607 [details]
Updated fix for 3.6.next.

Added metze's suggestions.
Comment 8 Stefan Metzmacher 2012-05-30 06:17:43 UTC
Comment on attachment 7607 [details]
Updated fix for 3.6.next.

Looks good for 3.6
Comment 9 Stefan Metzmacher 2012-05-30 06:19:49 UTC
Karolin, please pick this for the release
Comment 10 Karolin Seeger 2012-05-31 19:06:13 UTC
Pushed to v3-6-test.
Closing out bug report.

Thanks!