Bug 10192 - Winbind dumping core in dcerpc_lsa_lookup_sids during lookup sids response processing once the requests starts timing out
Summary: Winbind dumping core in dcerpc_lsa_lookup_sids during lookup sids response pr...
Status: NEW
Alias: None
Product: Samba 3.6
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.6.3
Hardware: All All
: P5 major
Target Milestone: ---
Assignee: Michael Adam
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-10 05:39 UTC by hargagan
Modified: 2015-02-06 11:37 UTC (History)
3 users (show)

See Also:


Attachments
Probable fix for winbind core due to timeout in lsarpc lookupsids. This proposes configurable timeout value rather than hardcode 35 sec. (2.16 KB, patch)
2013-10-10 05:45 UTC, hargagan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description hargagan 2013-10-10 05:39:55 UTC
I am using samba-3.6.3 and facing one issue wrt winbind rpc lookup domain sids.

There are continuous requests hitting winbind for different lookup sids to name requests. These requests vary in the number of sids in each request. So some requests (single sid) are done fast and others takes time.

Now when there is a request which is taking time, it times out and it is retried as per the current code in lookup_groupmem() in winbindd_ads.c. This retry also times out due to delay in the response. This delay can be attributed to the number of sids in the request or some server slowness.

After a few time out instances, for any new request of sid-to-name type, I have noticed that the response is varying. Sometime it picks up the response of the previously timed out sid-to-name command of some other SID(s). Here is the output of the command I ran in loop, 'wbinfo -s S-1-5-21-3842980903-586396-300088805-513'.

SOMEDOMAIN\testgroup1 2

SOMEDOMAIN\Domain Users 2

SOMEDOMAIN\puser100 1

The correct name is 'Domain Users' but other two are wrong names picked due to the problem I explained above.

And in other case, it also causes crash. This is when the response has lesser number of names compared to the requested sids. The backtrace in gdb looks like this :


#0  0x00007fe3d49ecb35 in raise () from /lib64/libc.so.6
#1  0x00007fe3d49ee111 in abort () from /lib64/libc.so.6
#2  0x00007fe3d7925ceb in dump_core () at lib/fault.c:391
#3  0x00007fe3d7935b85 in smb_panic (why=<optimized out>) at lib/util.c:1133
#4  0x00007fe3d7926280 in fault_report (sig=<optimized out>) at lib/fault.c:53
#5  sig_fault (sig=1) at lib/fault.c:76
#6  <signal handler called>
#7  dcerpc_lsa_lookup_sids_noalloc (presult=<optimized out>, use_lookupsids3=<optimized out>, 
    types=<optimized out>, names=<optimized out>, domains=<optimized out>, sids=<optimized out>, 
    num_sids=<optimized out>, pol=<optimized out>, names_ctx=<optimized out>, 
    domains_ctx=<optimized out>, mem_ctx=<optimized out>, h=<optimized out>)
    at rpc_client/cli_lsarpc.c:290
#8  dcerpc_lsa_lookup_sids_generic (h=0x7fe3d8445230, mem_ctx=0x7fe3d8158e20, pol=0x7fffc456e070, 
    num_sids=56, sids=<optimized out>, pdomains=0x7fffc456e1f0, pnames=0x7fffc456e200, 
    ptypes=0x7fffc456e1f8, use_lookupsids3=false, presult=0x7fffc456e090)
    at rpc_client/cli_lsarpc.c:399
#9  0x00007fe3d7c7ad7e in dcerpc_lsa_lookup_sids (h=0x7fe3d8449c28, mem_ctx=0x7fe3d4d2d2e8, 
    pol=0x7fe3d8449ae0, num_sids=-666594544, sids=0x7fe3d8448da0, pdomains=0x20656e6f64202d20, 
    pnames=0x7fffc456e200, ptypes=0x7fffc456e1f8, result=0x7fffc456e090)
    at rpc_client/cli_lsarpc.c:481
#10 0x00007fe3d7873965 in winbindd_lookup_sids (mem_ctx=0x7fe3d8158e20, domain=0x7fe3d80eee00, 
    num_sids=56, sids=0x7fe3d8437cc0, domains=0x7fffc456e1f0, names=0x7fffc456e200, 
    types=0x7fffc456e1f8) at winbindd/winbindd_msrpc.c:1110
#11 0x00007fe3d7876fd2 in lookup_groupmem (domain=0x7fe3d80eee00, mem_ctx=0x7fe3d82ab570, 
    group_sid=0x7fe3d8158d70, type=<optimized out>, num_names=0x7fffc456e40c, 
    sid_mem=0x7fffc456e400, names=0x7fffc456e3f8, name_types=0x7fffc456e3f0)
    at winbindd/winbindd_ads.c:1179
---Type <return> to continue, or q <return> to quit---
#12 0x00007fe3d78635b0 in lookup_groupmem (domain=0x7fe3d80eee00, mem_ctx=0x7fe3d82ab570, 
    group_sid=0x7fe3d8158d70, type=SID_NAME_DOM_GRP, num_names=0x7fffc456e40c, 
    sid_mem=0x7fffc456e400, names=0x7fffc456e3f8, name_types=0x7fffc456e3f0)
    at winbindd/winbindd_cache.c:2692
#13 0x00007fe3d7881593 in _wbint_LookupGroupMembers (p=0x7fffc456e470, r=0x7fe3d81e3b30)
    at winbindd/winbindd_dual_srv.c:347
#14 0x00007fe3d788ae06 in api_wbint_LookupGroupMembers (p=0x7fffc456e470)
    at librpc/gen_ndr/srv_wbint.c:1271
#15 0x00007fe3d78802d1 in winbindd_dual_ndrcmd (domain=0x7fe3d80eee00, state=0x7fffc456e600)
    at winbindd/winbindd_dual_ndr.c:322
#16 0x00007fe3d787f032 in child_process_request (state=<optimized out>, child=<optimized out>)
    at winbindd/winbindd_dual.c:440
#17 fork_domain_child (child=0x7fe3d80ee4d0) at winbindd/winbindd_dual.c:1554
#18 0x00007fe3d787fb55 in wb_child_request_trigger (req=0x7fe3d823a9a0, 
    private_data=<optimized out>) at winbindd/winbindd_dual.c:145
#19 0x00007fe3d794685a in tevent_common_loop_immediate (ev=0x7fe3d80e7480)
    at ../lib/tevent/tevent_immediate.c:135
#20 0x00007fe3d7944abb in run_events_poll (ev=0x7fe3d80e7480, pollrtn=0, pfds=0x0, num_pfds=0)
    at lib/events.c:197
#21 0x00007fe3d79451f2 in s3_event_loop_once (ev=0x7fe3d80e7480, location=<optimized out>)
    at lib/events.c:331
#22 0x00007fe3d79455f0 in _tevent_loop_once (ev=0x7fe3d80e7480, 
    location=0x7fe3d7ca68d8 "winbindd/winbindd.c:1456") at ../lib/tevent/tevent.c:494
#23 0x00007fe3d785744b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at winbindd/winbindd.c:1456
(gdb) f 7
#7  dcerpc_lsa_lookup_sids_noalloc (presult=<optimized out>, use_lookupsids3=<optimized out>, 
    types=<optimized out>, names=<optimized out>, domains=<optimized out>, sids=<optimized out>, 
    num_sids=<optimized out>, pol=<optimized out>, names_ctx=<optimized out>, 
    domains_ctx=<optimized out>, mem_ctx=<optimized out>, h=<optimized out>)
    at rpc_client/cli_lsarpc.c:290
290	in rpc_client/cli_lsarpc.c
(gdb) p lsa_names
$8 = {count = 1, names = 0x7fe3d84492f0}
(gdb) p *lsa_names.names
$9 = {sid_type = SID_NAME_DOM_GRP, name = {length = 24, size = 26, 
    string = 0x7fe3d8449970 "Domain Users"}, sid_index = 0}
(gdb) p sid_array
$10 = {num_sids = 56, sids = 0x7fe3d8443780}
Comment 1 hargagan 2013-10-10 05:45:22 UTC
Created attachment 9265 [details]
Probable fix for winbind core due to timeout in lsarpc lookupsids. This proposes configurable timeout value rather than hardcode 35 sec.

The current issue is happening due to the timeout happening and there on, the non-relevant responses are being taken up by the winbind dcerpc handling module. 

The timeout of the rpc call made to get the sid-to-name is having a timeout of 35 seconds. This is a hard code entry. I have come up with a patch on 3.6.3 code base which is making this timeout entry as configurable. 

If I make a higher timeout value, I can avoid this core dump.

Please review the change for this change.
Comment 2 Volker Lendecke 2013-10-12 11:17:15 UTC
Am I getting this right that this does not address the root cause of the crash, but it gives the user the option to make it less likely?
Comment 3 Volker Lendecke 2013-10-12 14:02:08 UTC
We definitely need to solve the synchronisation issue. From your description I could not really find out where exactly we time out and get out of sync. A simple fix would be to drop the domain connection and reconnect once we see a timout. Not the smartest move, but it should save us from getting out of sync.

Question to all working on rpc right now: I'm looking at 3.6.3 (this is what this bug is opened against). In rpc_api_pipe_req_done or somewhere in that area, shouldn't we compare the incoming call_id and drop the reply if it does not match what we expect?
Comment 4 hargagan 2013-10-14 06:00:46 UTC
(In reply to comment #3)
> We definitely need to solve the synchronisation issue. From your description I
> could not really find out where exactly we time out and get out of sync. 

Volker,

Thanks for looking into the issue.

The requests of the lsa_lookupsids are initiated by the winbind. It starts the rpc call with the hard coded timeout of 35secs. It will hit a domain controller to get the information. If this first call timeouts, winbind does a retry. If second one also fails to respond within specified time, winbind fail the request.

I will try to explain the timeout scenario. I will denote requests as R# and its retry as RT# and the response as R`#.

It is something like this :

R1 -> DC
  timeout
RT1 -> DC
  timeout
R2 -> DC
   <- R`1

If R2 has more number of sids in sid_array (N), it will read the names (n) from lsa_names array comfortably. Though the information read is incorrect. The core dump will happen if R2 has lesser number of SIDs than the response containing names.

You are right in your previous comment, that the patch attached is just trying to avoid the core dump. It is giving the option to adjust the timeout based on the response time. It is not fixing the actual root cause.
Comment 5 hargagan 2013-10-21 06:47:50 UTC
Dear Samba team,

Please update this bug with your thoughts on getting the fix for this issue, in the next patch release of 3.6.x release ?
Comment 6 Jeremy Allison 2013-10-21 15:56:42 UTC
I am working on a fix for this. We will certainly fix this for the next 3.6.x and above releases.

Jeremy.
Comment 7 hargagan 2013-10-22 10:06:28 UTC
Hi Jeremy, 

We are looking forward for this change. Thanks for the update.
Comment 8 hargagan 2013-11-26 04:52:10 UTC
Hi Jeremy,

Can you please consider including the patch mention in comment #1 as well ?

Currently winbind code has this as a hard coded value of 35 sec. This change will give control to configure this timeout outside and it will help for a slow and wan-separated setups.

Thanks,
hargagan
Comment 9 hargagan 2013-12-05 08:41:03 UTC
Hi Samba-team,

Can somebody please provide an update on this ?
Comment 10 hargagan 2015-02-06 08:29:24 UTC
Going through some of the changes there is call_id approach taken, that could differentiate between the replies. 

One such id is : 
From 1f14621cdbf960fff3e838afa5a345090cbb4cf9 Mon Sep 17 00:00:00 2001
From: Jeremy Allison <jra@samba.org>
Date: Tue, 22 Oct 2013 15:34:12 -0700
Subject: [PATCH 12/15] CVE-2013-4408:s3:Ensure we always check call_id when
 validating an RPC reply.

Bug: https://bugzilla.samba.org/show_bug.cgi?id=10185

Is patch update for the bug 10185 fixing this bug ?
Comment 11 Stefan Metzmacher 2015-02-06 10:30:14 UTC
(In reply to hargagan from comment #10)

With this patch we'll drop the connection and hopefully retry
instead of mixing unrelated requests and responses.
Comment 12 hargagan 2015-02-06 11:37:59 UTC
(In reply to Stefan (metze) Metzmacher from comment #11)

Thanks for your response.

This bug was happening when winbind was consuming response of unrelated request and then winbind child process was crashing. Wth this patch of call_id approach this mixup will not happen. I will take this patch and validate the scenario.