This is a strange bug that I have been unable to reproduce outside of its production environment, but winbind (from the Red Hat/Oracle Linux 3.6.9-169 RPM on RHEL6) frequently dumps core, it usually lasts a few hours before crashing and being restarted. The crash is always the same and looks like this: -------------------------------------------------- [2014/05/25 05:58:01.587717, 0] ../lib/util/debug.c:413(talloc_log_fn) talloc: access after free error - first free may be at winbindd/winbindd_dual_srv.c:429 [2014/05/25 05:58:01.587898, 0] ../lib/util/debug.c:413(talloc_log_fn) Bad talloc magic value - access after free [2014/05/25 05:58:01.588018, 0] lib/util.c:1117(smb_panic) PANIC (pid 29959): Bad talloc magic value - access after free [2014/05/25 05:58:01.590374, 0] lib/util.c:1221(log_stack_trace) BACKTRACE: 19 stack frames: #0 winbindd(log_stack_trace+0x1a) [0x7eff074ee87a] #1 winbindd(smb_panic+0x2b) [0x7eff074ee94b] #2 /usr/lib64/libtalloc.so.2(_talloc_zero+0x17f) [0x7eff050d221f] #3 winbindd(ndr_push_init_ctx+0x12) [0x7eff07507162] #4 winbindd(+0x1035cb) [0x7eff074435cb] #5 winbindd(winbindd_dual_ndrcmd+0xb7) [0x7eff07438f07] #6 winbindd(+0xf6ecb) [0x7eff07436ecb] #7 /usr/lib64/libtevent.so.0(+0x3f56808ebe) [0x7eff04ec8ebe] #8 /usr/lib64/libtevent.so.0(+0x3f568072e6) [0x7eff04ec72e6] #9 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x9d) [0x7eff04ec349d] #10 winbindd(+0xf8024) [0x7eff07438024] #11 winbindd(+0xf8795) [0x7eff07438795] #12 /usr/lib64/libtevent.so.0(tevent_common_loop_immediate+0xe8) [0x7eff04ec40c8] #13 /usr/lib64/libtevent.so.0(+0x3f56808caf) [0x7eff04ec8caf] #14 /usr/lib64/libtevent.so.0(+0x3f568072e6) [0x7eff04ec72e6] #15 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x9d) [0x7eff04ec349d] #16 winbindd(main+0x7c3) [0x7eff0740fd53] #17 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7eff04515d1d] #18 winbindd(+0xcd449) [0x7eff0740d449] [2014/05/25 05:58:01.591431, 0] lib/fault.c:372(dump_core) dumping core in /var/log/samba/cores/winbindd -------------------------------------------------- (I know this is quite old, I've been testing a fix in production for a few months now and there have been no problems which is why I'm opening this now.) As I say, the crash is always the same: talloc records at attempt to use a value free'd by TALLOC_FREE() here: source3/winbindd/winbindd_dual_srv.c: NTSTATUS _wbint_QueryGroupList(struct pipes_struct *p, struct wbint_QueryGroupList *r) { [...] status = domain->methods->enum_dom_groups(domain, talloc_tos(), &num_groups, &groups); [...] TALLOC_FREE(groups); return NT_STATUS_OK; } The offending allocation is in librpc/ndr/ndr.c: _PUBLIC_ struct ndr_push *ndr_push_init_ctx(TALLOC_CTX *mem_ctx) { struct ndr_push *ndr; ndr = talloc_zero(mem_ctx, struct ndr_push); [...] } Obviouslt the TALLOC_FREE() in _wbint_QueryGroupList() matches the allocation and the code should be fine. However, both the allocation in ndr_push_init_ctx() and the one in _wbint_QueryGroupList() are using talloc_tos(). I'm not sure, how, but it looks as though the problematic TALLOC_FREE() in _wbint_QueryGroupList() is also freeing the storage allocated in ndr_push_init_ctx(). I'm inclined to blame some unexpected inlining in the compiler, but I can't be certain about that. My eventual fix is very simple: delete the TALLOC_FREE() at the end of _wbint_QueryGroupList() and leave it to someone else to clear it up eventually. I did consider modifying _wbint_QueryGroupList() to use a non-stack TALLOC_CTX() but I decided to try the simpler code change first. It's been a while since I put this fix into a production machine and winbindd has been running flawlessly ever since. It doesn't crash and it doesn't leak memory.
it's not a compiler issue, it's winbind's bug. variable "groups" in _wbint_QueryGroupList is not initialized, it just use the value of "r" (dirty data) on the stack for memory release when satisfy following two conditions: 1. the target domain return zero group back to winbindd. 2. another group query comes before the winbindd(child winbindd for the target domain) cache timeout. the first one depends on AD configuration, the second one can be triggered when suffer a network interruption(at least 35 seconds to allow winbind client to resend the group query) Initialization of the variable "groups" can get this bug resolved. @@ -391,7 +391,7 @@ { struct winbindd_domain *domain = wb_child_domain(); uint32_t i, num_groups; - struct wb_acct_info *groups; + struct wb_acct_info *groups = NULL; struct wbint_Principal *result; NTSTATUS status; Alternatively, it can be done by assigned a NULL value to groups pointer "info" in cache_methods.enum_dom_groups() when num_entries equals 0. samba4.0.0 and upper has the same issue.
Thank you so much for this. How I missed the lack of initialization I don't know. Can you reproduce the issue to test it or would you like me to try to persuade the reporter of the problem to try a different fix?
(In reply to wei zhong from comment #1) Great catch ! I'll get this fixed for 4.2.next, 4.1.next !
Created attachment 11282 [details] git-am fix for 4.2.next, 4.1.next.
Created attachment 11283 [details] git-am fix for 4.2.next, 4.1.next. Doh ! Added BUG: url to commit message...
Comment on attachment 11283 [details] git-am fix for 4.2.next, 4.1.next. ACK
Re-assigning to Karolin for inclusion in 4.2.next, 4.1.next.
Comment on attachment 11283 [details] git-am fix for 4.2.next, 4.1.next. Ok, 4.3 doesn't need this fix...
Pushed to autobuild-v4-[1|2]-test.
Pushed to v4-{1,2}-test