Bug 10823 - winbind frequently dumps core.
winbind frequently dumps core.
Status: RESOLVED FIXED
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind
unspecified
All Linux
: P5 normal
: 4.2
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-19 11:01 UTC by John Haxby
Modified: 2015-09-01 07:59 UTC (History)
4 users (show)

See Also:


Attachments
git-am fix for 4.2.next, 4.1.next. (987 bytes, patch)
2015-07-23 18:04 UTC, Jeremy Allison
no flags Details
git-am fix for 4.2.next, 4.1.next. (1.02 KB, patch)
2015-07-23 18:05 UTC, Jeremy Allison
obnox: review+
asn: review+
metze: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Haxby 2014-09-19 11:01:29 UTC
This is a strange bug that I have been unable to reproduce outside of its production environment, but winbind (from the Red Hat/Oracle Linux 3.6.9-169 RPM on RHEL6) frequently dumps core, it usually lasts a few hours before crashing and being restarted.

The crash is always the same and looks like this:

--------------------------------------------------
[2014/05/25 05:58:01.587717,  0] ../lib/util/debug.c:413(talloc_log_fn)
  talloc: access after free error - first free may be at winbindd/winbindd_dual_srv.c:429
[2014/05/25 05:58:01.587898,  0] ../lib/util/debug.c:413(talloc_log_fn)
  Bad talloc magic value - access after free
[2014/05/25 05:58:01.588018,  0] lib/util.c:1117(smb_panic)
  PANIC (pid 29959): Bad talloc magic value - access after free
[2014/05/25 05:58:01.590374,  0] lib/util.c:1221(log_stack_trace)
  BACKTRACE: 19 stack frames:
   #0 winbindd(log_stack_trace+0x1a) [0x7eff074ee87a]
   #1 winbindd(smb_panic+0x2b) [0x7eff074ee94b]
   #2 /usr/lib64/libtalloc.so.2(_talloc_zero+0x17f) [0x7eff050d221f]
   #3 winbindd(ndr_push_init_ctx+0x12) [0x7eff07507162]
   #4 winbindd(+0x1035cb) [0x7eff074435cb]
   #5 winbindd(winbindd_dual_ndrcmd+0xb7) [0x7eff07438f07]
   #6 winbindd(+0xf6ecb) [0x7eff07436ecb]
   #7 /usr/lib64/libtevent.so.0(+0x3f56808ebe) [0x7eff04ec8ebe]
   #8 /usr/lib64/libtevent.so.0(+0x3f568072e6) [0x7eff04ec72e6]
   #9 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x9d) [0x7eff04ec349d]
   #10 winbindd(+0xf8024) [0x7eff07438024]
   #11 winbindd(+0xf8795) [0x7eff07438795]
   #12 /usr/lib64/libtevent.so.0(tevent_common_loop_immediate+0xe8) [0x7eff04ec40c8]
   #13 /usr/lib64/libtevent.so.0(+0x3f56808caf) [0x7eff04ec8caf]
   #14 /usr/lib64/libtevent.so.0(+0x3f568072e6) [0x7eff04ec72e6]
   #15 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x9d) [0x7eff04ec349d]
   #16 winbindd(main+0x7c3) [0x7eff0740fd53]
   #17 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7eff04515d1d]
   #18 winbindd(+0xcd449) [0x7eff0740d449]
[2014/05/25 05:58:01.591431,  0] lib/fault.c:372(dump_core)
  dumping core in /var/log/samba/cores/winbindd
--------------------------------------------------

(I know this is quite old, I've been testing a fix in production for a few months now and there have been no problems which is why I'm opening this now.)

As I say, the crash is always the same: talloc records at attempt to use a
value free'd by TALLOC_FREE() here:

source3/winbindd/winbindd_dual_srv.c:
    NTSTATUS _wbint_QueryGroupList(struct pipes_struct *p,
                                   struct wbint_QueryGroupList *r)
    {
            [...]
            status = domain->methods->enum_dom_groups(domain, talloc_tos(),
                                                      &num_groups, &groups);
            [...]
            TALLOC_FREE(groups);
            return NT_STATUS_OK;
    }

The offending allocation is in librpc/ndr/ndr.c:

    _PUBLIC_ struct ndr_push *ndr_push_init_ctx(TALLOC_CTX *mem_ctx)
    {
            struct ndr_push *ndr;

            ndr = talloc_zero(mem_ctx, struct ndr_push);
            [...]
    }

Obviouslt the TALLOC_FREE() in _wbint_QueryGroupList() matches the allocation and the code should be fine.  However, both the allocation in ndr_push_init_ctx() and the one in _wbint_QueryGroupList() are using talloc_tos().

I'm not sure, how, but it looks as though the problematic TALLOC_FREE() in _wbint_QueryGroupList() is also freeing the storage allocated in ndr_push_init_ctx().   I'm inclined to blame some unexpected inlining in the compiler, but I can't be certain about that.

My eventual fix is very simple: delete the TALLOC_FREE() at the end of _wbint_QueryGroupList() and leave it to someone else to clear it up eventually.  I did consider modifying _wbint_QueryGroupList() to use a non-stack TALLOC_CTX() but I decided to try the simpler code change first.

It's been a while since I put this fix into a production machine and winbindd has been running flawlessly ever since.  It doesn't crash and it doesn't leak memory.
Comment 1 wei zhong 2015-07-23 00:53:12 UTC
it's not a compiler issue, it's winbind's bug. 

variable "groups" in _wbint_QueryGroupList is not initialized, it just use the value of "r" (dirty data) on the stack for memory release when satisfy following two conditions:
1. the target domain return zero group back to winbindd. 
2. another group query comes before the winbindd(child winbindd for the target domain) cache timeout. 

the first one depends on AD configuration, the second one can be triggered when suffer a network interruption(at least 35 seconds to allow winbind client to resend the group query)

Initialization of the variable "groups" can get this bug resolved. 
@@ -391,7 +391,7 @@
 {
 	struct winbindd_domain *domain = wb_child_domain();
 	uint32_t i, num_groups;
-	struct wb_acct_info *groups;
+	struct wb_acct_info *groups = NULL;
 	struct wbint_Principal *result;
 	NTSTATUS status;

Alternatively, it can be done by assigned a NULL value to groups pointer "info" in cache_methods.enum_dom_groups() when num_entries equals 0.

samba4.0.0 and upper has the same issue.
Comment 2 John Haxby 2015-07-23 09:58:31 UTC
Thank you so much for this.

How I missed the lack of initialization I don't know.

Can you reproduce the issue to test it or would you like me to try to persuade the reporter of the problem to try a different fix?
Comment 3 Jeremy Allison 2015-07-23 17:52:11 UTC
(In reply to wei zhong from comment #1)

Great catch ! I'll get this fixed for 4.2.next, 4.1.next !
Comment 4 Jeremy Allison 2015-07-23 18:04:23 UTC
Created attachment 11282 [details]
git-am fix for 4.2.next, 4.1.next.
Comment 5 Jeremy Allison 2015-07-23 18:05:40 UTC
Created attachment 11283 [details]
git-am fix for 4.2.next, 4.1.next.

Doh ! Added BUG: url to commit message...
Comment 6 Michael Adam 2015-07-23 18:34:43 UTC
Comment on attachment 11283 [details]
git-am fix for 4.2.next, 4.1.next.

ACK
Comment 7 Jeremy Allison 2015-07-23 18:42:50 UTC
Re-assigning to Karolin for inclusion in 4.2.next, 4.1.next.
Comment 8 Stefan Metzmacher 2015-08-03 10:47:50 UTC
Comment on attachment 11283 [details]
git-am fix for 4.2.next, 4.1.next.

Ok, 4.3 doesn't need this fix...
Comment 9 Karolin Seeger 2015-08-25 06:56:43 UTC
Pushed to autobuild-v4-[1|2]-test.
Comment 10 Stefan Metzmacher 2015-09-01 07:59:09 UTC
Pushed to v4-{1,2}-test