I am using samba-3.6.3 and facing one issue wrt winbind rpc lookup domain sids. There are continuous requests hitting winbind for different lookup sids to name requests. These requests vary in the number of sids in each request. So some requests (single sid) are done fast and others takes time. Now when there is a request which is taking time, it times out and it is retried as per the current code in lookup_groupmem() in winbindd_ads.c. This retry also times out due to delay in the response. This delay can be attributed to the number of sids in the request or some server slowness. After a few time out instances, for any new request of sid-to-name type, I have noticed that the response is varying. Sometime it picks up the response of the previously timed out sid-to-name command of some other SID(s). Here is the output of the command I ran in loop, 'wbinfo -s S-1-5-21-3842980903-586396-300088805-513'. SOMEDOMAIN\testgroup1 2 SOMEDOMAIN\Domain Users 2 SOMEDOMAIN\puser100 1 The correct name is 'Domain Users' but other two are wrong names picked due to the problem I explained above. And in other case, it also causes crash. This is when the response has lesser number of names compared to the requested sids. The backtrace in gdb looks like this : #0 0x00007fe3d49ecb35 in raise () from /lib64/libc.so.6 #1 0x00007fe3d49ee111 in abort () from /lib64/libc.so.6 #2 0x00007fe3d7925ceb in dump_core () at lib/fault.c:391 #3 0x00007fe3d7935b85 in smb_panic (why=<optimized out>) at lib/util.c:1133 #4 0x00007fe3d7926280 in fault_report (sig=<optimized out>) at lib/fault.c:53 #5 sig_fault (sig=1) at lib/fault.c:76 #6 <signal handler called> #7 dcerpc_lsa_lookup_sids_noalloc (presult=<optimized out>, use_lookupsids3=<optimized out>, types=<optimized out>, names=<optimized out>, domains=<optimized out>, sids=<optimized out>, num_sids=<optimized out>, pol=<optimized out>, names_ctx=<optimized out>, domains_ctx=<optimized out>, mem_ctx=<optimized out>, h=<optimized out>) at rpc_client/cli_lsarpc.c:290 #8 dcerpc_lsa_lookup_sids_generic (h=0x7fe3d8445230, mem_ctx=0x7fe3d8158e20, pol=0x7fffc456e070, num_sids=56, sids=<optimized out>, pdomains=0x7fffc456e1f0, pnames=0x7fffc456e200, ptypes=0x7fffc456e1f8, use_lookupsids3=false, presult=0x7fffc456e090) at rpc_client/cli_lsarpc.c:399 #9 0x00007fe3d7c7ad7e in dcerpc_lsa_lookup_sids (h=0x7fe3d8449c28, mem_ctx=0x7fe3d4d2d2e8, pol=0x7fe3d8449ae0, num_sids=-666594544, sids=0x7fe3d8448da0, pdomains=0x20656e6f64202d20, pnames=0x7fffc456e200, ptypes=0x7fffc456e1f8, result=0x7fffc456e090) at rpc_client/cli_lsarpc.c:481 #10 0x00007fe3d7873965 in winbindd_lookup_sids (mem_ctx=0x7fe3d8158e20, domain=0x7fe3d80eee00, num_sids=56, sids=0x7fe3d8437cc0, domains=0x7fffc456e1f0, names=0x7fffc456e200, types=0x7fffc456e1f8) at winbindd/winbindd_msrpc.c:1110 #11 0x00007fe3d7876fd2 in lookup_groupmem (domain=0x7fe3d80eee00, mem_ctx=0x7fe3d82ab570, group_sid=0x7fe3d8158d70, type=<optimized out>, num_names=0x7fffc456e40c, sid_mem=0x7fffc456e400, names=0x7fffc456e3f8, name_types=0x7fffc456e3f0) at winbindd/winbindd_ads.c:1179 ---Type <return> to continue, or q <return> to quit--- #12 0x00007fe3d78635b0 in lookup_groupmem (domain=0x7fe3d80eee00, mem_ctx=0x7fe3d82ab570, group_sid=0x7fe3d8158d70, type=SID_NAME_DOM_GRP, num_names=0x7fffc456e40c, sid_mem=0x7fffc456e400, names=0x7fffc456e3f8, name_types=0x7fffc456e3f0) at winbindd/winbindd_cache.c:2692 #13 0x00007fe3d7881593 in _wbint_LookupGroupMembers (p=0x7fffc456e470, r=0x7fe3d81e3b30) at winbindd/winbindd_dual_srv.c:347 #14 0x00007fe3d788ae06 in api_wbint_LookupGroupMembers (p=0x7fffc456e470) at librpc/gen_ndr/srv_wbint.c:1271 #15 0x00007fe3d78802d1 in winbindd_dual_ndrcmd (domain=0x7fe3d80eee00, state=0x7fffc456e600) at winbindd/winbindd_dual_ndr.c:322 #16 0x00007fe3d787f032 in child_process_request (state=<optimized out>, child=<optimized out>) at winbindd/winbindd_dual.c:440 #17 fork_domain_child (child=0x7fe3d80ee4d0) at winbindd/winbindd_dual.c:1554 #18 0x00007fe3d787fb55 in wb_child_request_trigger (req=0x7fe3d823a9a0, private_data=<optimized out>) at winbindd/winbindd_dual.c:145 #19 0x00007fe3d794685a in tevent_common_loop_immediate (ev=0x7fe3d80e7480) at ../lib/tevent/tevent_immediate.c:135 #20 0x00007fe3d7944abb in run_events_poll (ev=0x7fe3d80e7480, pollrtn=0, pfds=0x0, num_pfds=0) at lib/events.c:197 #21 0x00007fe3d79451f2 in s3_event_loop_once (ev=0x7fe3d80e7480, location=<optimized out>) at lib/events.c:331 #22 0x00007fe3d79455f0 in _tevent_loop_once (ev=0x7fe3d80e7480, location=0x7fe3d7ca68d8 "winbindd/winbindd.c:1456") at ../lib/tevent/tevent.c:494 #23 0x00007fe3d785744b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at winbindd/winbindd.c:1456 (gdb) f 7 #7 dcerpc_lsa_lookup_sids_noalloc (presult=<optimized out>, use_lookupsids3=<optimized out>, types=<optimized out>, names=<optimized out>, domains=<optimized out>, sids=<optimized out>, num_sids=<optimized out>, pol=<optimized out>, names_ctx=<optimized out>, domains_ctx=<optimized out>, mem_ctx=<optimized out>, h=<optimized out>) at rpc_client/cli_lsarpc.c:290 290 in rpc_client/cli_lsarpc.c (gdb) p lsa_names $8 = {count = 1, names = 0x7fe3d84492f0} (gdb) p *lsa_names.names $9 = {sid_type = SID_NAME_DOM_GRP, name = {length = 24, size = 26, string = 0x7fe3d8449970 "Domain Users"}, sid_index = 0} (gdb) p sid_array $10 = {num_sids = 56, sids = 0x7fe3d8443780}
Created attachment 9265 [details] Probable fix for winbind core due to timeout in lsarpc lookupsids. This proposes configurable timeout value rather than hardcode 35 sec. The current issue is happening due to the timeout happening and there on, the non-relevant responses are being taken up by the winbind dcerpc handling module. The timeout of the rpc call made to get the sid-to-name is having a timeout of 35 seconds. This is a hard code entry. I have come up with a patch on 3.6.3 code base which is making this timeout entry as configurable. If I make a higher timeout value, I can avoid this core dump. Please review the change for this change.
Am I getting this right that this does not address the root cause of the crash, but it gives the user the option to make it less likely?
We definitely need to solve the synchronisation issue. From your description I could not really find out where exactly we time out and get out of sync. A simple fix would be to drop the domain connection and reconnect once we see a timout. Not the smartest move, but it should save us from getting out of sync. Question to all working on rpc right now: I'm looking at 3.6.3 (this is what this bug is opened against). In rpc_api_pipe_req_done or somewhere in that area, shouldn't we compare the incoming call_id and drop the reply if it does not match what we expect?
(In reply to comment #3) > We definitely need to solve the synchronisation issue. From your description I > could not really find out where exactly we time out and get out of sync. Volker, Thanks for looking into the issue. The requests of the lsa_lookupsids are initiated by the winbind. It starts the rpc call with the hard coded timeout of 35secs. It will hit a domain controller to get the information. If this first call timeouts, winbind does a retry. If second one also fails to respond within specified time, winbind fail the request. I will try to explain the timeout scenario. I will denote requests as R# and its retry as RT# and the response as R`#. It is something like this : R1 -> DC timeout RT1 -> DC timeout R2 -> DC <- R`1 If R2 has more number of sids in sid_array (N), it will read the names (n) from lsa_names array comfortably. Though the information read is incorrect. The core dump will happen if R2 has lesser number of SIDs than the response containing names. You are right in your previous comment, that the patch attached is just trying to avoid the core dump. It is giving the option to adjust the timeout based on the response time. It is not fixing the actual root cause.
Dear Samba team, Please update this bug with your thoughts on getting the fix for this issue, in the next patch release of 3.6.x release ?
I am working on a fix for this. We will certainly fix this for the next 3.6.x and above releases. Jeremy.
Hi Jeremy, We are looking forward for this change. Thanks for the update.
Hi Jeremy, Can you please consider including the patch mention in comment #1 as well ? Currently winbind code has this as a hard coded value of 35 sec. This change will give control to configure this timeout outside and it will help for a slow and wan-separated setups. Thanks, hargagan
Hi Samba-team, Can somebody please provide an update on this ?
Going through some of the changes there is call_id approach taken, that could differentiate between the replies. One such id is : From 1f14621cdbf960fff3e838afa5a345090cbb4cf9 Mon Sep 17 00:00:00 2001 From: Jeremy Allison <jra@samba.org> Date: Tue, 22 Oct 2013 15:34:12 -0700 Subject: [PATCH 12/15] CVE-2013-4408:s3:Ensure we always check call_id when validating an RPC reply. Bug: https://bugzilla.samba.org/show_bug.cgi?id=10185 Is patch update for the bug 10185 fixing this bug ?
(In reply to hargagan from comment #10) With this patch we'll drop the connection and hopefully retry instead of mixing unrelated requests and responses.
(In reply to Stefan (metze) Metzmacher from comment #11) Thanks for your response. This bug was happening when winbind was consuming response of unrelated request and then winbind child process was crashing. Wth this patch of call_id approach this mixup will not happen. I will take this patch and validate the scenario.