Am hitting this in ~stock 4.2.0 winbind: Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519806, 0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/popt_common.c:68(popt_s3_talloc_log_fn) Mar 06 17:57:39 fater winbindd[3226]: talloc: access after free error - first free may be at ../tevent_req.c:247 Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519859, 0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/popt_common.c:68(popt_s3_talloc_log_fn) Mar 06 17:57:39 fater winbindd[3226]: Bad talloc magic value - access after free Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519884, 0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/util.c:788(smb_panic_s3) Mar 06 17:57:39 fater winbindd[3226]: PANIC (pid 3226): Bad talloc magic value - access after free Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.520586, 0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/util.c:899(log_stack_trace) Mar 06 17:57:39 fater winbindd[3226]: BACKTRACE: 11 stack frames: Mar 06 17:57:39 fater winbindd[3226]: #0 /usr/lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7fe07bf4f0aa] Mar 06 17:57:39 fater winbindd[3226]: #1 /usr/lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7fe07bf4f180] Mar 06 17:57:39 fater winbindd[3226]: #2 /usr/lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7fe07ebc5b8f] Mar 06 17:57:39 fater winbindd[3226]: #3 /usr/lib64/libtalloc.so.2(+0x236f) [0x7fe07a04636f] Mar 06 17:57:39 fater winbindd[3226]: #4 /usr/lib64/libtalloc.so.2(talloc_check_name+0x6c) [0x7fe07a046e3c] Mar 06 17:57:39 fater winbindd[3226]: #5 /usr/lib64/libtevent.so.0(+0x9763) [0x7fe079548763] Mar 06 17:57:39 fater winbindd[3226]: #6 /usr/lib64/libtevent.so.0(+0x7d47) [0x7fe079546d47] Mar 06 17:57:39 fater winbindd[3226]: #7 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7fe07954351d] Mar 06 17:57:39 fater winbindd[3226]: #8 /usr/sbin/winbindd(main+0xb14) [0x7fe080bf41e4] Mar 06 17:57:39 fater winbindd[3226]: #9 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe078f6cb05] Mar 06 17:57:39 fater winbindd[3226]: #10 /usr/sbin/winbindd(+0x278fb) [0x7fe080bf48fb] Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.522497, 0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:318(dump_core) Mar 06 17:57:39 fater winbindd[3226]: dumping core in /var/log/samba/cores/winbindd (gdb) bt #0 0x00007fe078f80187 in raise () from /lib64/libc.so.6 #1 0x00007fe078f81538 in abort () from /lib64/libc.so.6 #2 0x00007fe07bf5c26b in dump_core () at ../source3/lib/dumpcore.c:337 #3 0x00007fe07bf4f1e7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:811 #4 0x00007fe07ebc5b8f in smb_panic (why=0x7fe07a0500b8 "Bad talloc magic value - access after free") at ../lib/util/fault.c:166 #5 0x00007fe07a04636f in ?? () from /usr/lib64/libtalloc.so.2 #6 0x00007fe07a046e3c in talloc_check_name () from /usr/lib64/libtalloc.so.2 #7 0x00007fe079548763 in epoll_event_loop (tvalp=0x7fff77e92260, epoll_ev=0x7fe0811d2510) at ../tevent_epoll.c:664 #8 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926 #9 0x00007fe079546d47 in std_event_loop_once (ev=0x7fe0811d2360, location=0x7fe080c7c1e0 "../source3/winbindd/winbindd.c:1676") at ../tevent_standard.c:114 #10 0x00007fe07954351d in _tevent_loop_once (ev=0x7fe0811d2360, location=0x7fe080c7c1e0 "../source3/winbindd/winbindd.c:1676") at ../tevent.c:533 #11 0x00007fe080bf41e4 in main (argc=<optimized out>, argv=<optimized out>) at ../source3/winbindd/winbindd.c:1676 (gdb) up 7 #7 0x00007fe079548763 in epoll_event_loop (tvalp=0x7fff77e92260, epoll_ev=0x7fe0811d2510) at ../tevent_epoll.c:664 664 struct tevent_fd *fde = talloc_get_type(events[i].data.ptr, (gdb) l 659 tevent_common_loop_timer_delay(epoll_ev->ev); 660 return 0; 661 } 662 663 for (i=0;i<ret;i++) { 664 struct tevent_fd *fde = talloc_get_type(events[i].data.ptr, 665 struct tevent_fd); 666 uint16_t flags = 0; 667 struct tevent_fd *mpx_fde = NULL; 668 (gdb) 669 if (fde == NULL) { 670 epoll_panic(epoll_ev, "epoll_wait() gave bad data", true); 671 return -1; 672 } 673 if (fde->additional_flags & EPOLL_ADDITIONAL_FD_FLAG_HAS_MPX) { 674 /* 675 * Save off the multiplexed event in case we need 676 * to use it to call the handler function. 677 */ 678 mpx_fde = talloc_get_type_abort(fde->additional_data, (gdb) p i $1 = 0 (gdb) p events[i] $2 = {events = 17, data = {ptr = 0x7fe0811f5b40, fd = -2128651456, u32 = 2166315840, u64 = 140602215717696}} (gdb) p (struct tevent_fd *)0x7fe0811f5b40 $3 = (struct tevent_fd *) 0x7fe0811f5b40 (gdb) p *(struct tevent_fd *)0x7fe0811f5b40 $4 = {prev = 0x0, next = 0x0, event_ctx = 0x0, fd = 34, flags = 1, handler = 0x7fe07a457b90 <read_packet_handler>, close_fn = 0x0, private_data = 0x7fe0811f5830, handler_name = 0x7fe07a4594be "read_packet_handler", location = 0x7fe07a459308 "../lib/async_req/async_sock.c:602", additional_flags = 4, additional_data = 0x0} (gdb) p *(struct tevent_req *)0x7fe0811f5830 $5 = {async = {fn = 0x7fe080c3fee0 <wb_req_read_done>, private_data = 0x7fe0811f79e0}, data = 0x0, private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0, state = TEVENT_REQ_INIT}, internal = {private_type = 0x7fe07a459456 "struct read_packet_state", create_location = 0x7fe07a4592b8 "../lib/async_req/async_sock.c:587", finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_RECEIVED, error = 0, trigger = 0x0, defer_callback_ev = 0x0, timer = 0x0}}
David, do you have any idea how to reproduce this?
(In reply to Stefan (metze) Metzmacher from comment #1) It's not easy to reproduce. I was able to hit it the most when using the diskshadow.exe FSRVP client (requires the extra mgmt RPC server changes). I also encountered it infrequently on ~stock 4.2 by restarting smbd with Windows clients connected.
What kernel are you using? I have access to a cluster with kernel-3.10.0-123.20.1.el7.x86_64 where this happens
You're using tevent-0.9.24 and talloc-2.1.2 ?
SLE12 kernel: 3.12.39-47-default tevent: 0.9.21-2.4
I finally found the problem :-) What happens is the following: - The related winbindd_cli_state has a wb_req_read_send() or wb_resp_write_send() pending. That means there's tevent_fd registered with the kernel via epoll_ctl(...ADD) - A new winbindd child is fork'ed and winbindd_reinit_after_fork() -> close_conns_after_fork() starts to run, but didn't reach the related winbindd_cli_state yet. - The parent calls remove_client() on the related winbindd_cli_state maybe via remove_timed_out_clients(). - remove_client() calls close() before TALLOC_FREE(state) the close() doesn't remove the fd from the parents epoll set yet, as the socket fd is still open in the child. TALLOC_FREE(state) triggers tevent_req_received() on the pending wb_req_read or wb_req_write, which then triggers epoll_event_fd_destructor() -> epoll_update_event() -> epoll_del_event(). The results in epoll_ctl(DEL) returning EBADF because the fd is already closed in the parent. - As the socket fd was still open in the child, epoll_wait returns on event in the parent epoll queue, but the pointer in events.data.ptr is not valid anymore. So we need to make sure the close() comes after epoll_ctl(DEL) As a safe-guard epoll_del_event() should also call epoll_panic() if it gets EBADF, this would result in a fallback to the poll backend instead of crashing later.
(In reply to Stefan (metze) Metzmacher from comment #6) Great work tracking this one down Metze! I'll mark you down as assignee, as it looks like you've made much more progress on this than I did.
Created attachment 11075 [details] Patch for v4-2-test
Created attachment 11078 [details] Patch for v4-1-test
Reassigning to Karolin for inclusion in 4.2.next, 4.1.next.
(In reply to Jeremy Allison from comment #10) Pushed to autobuild-v4-[1|2]-test.
(In reply to Karolin Seeger from comment #11) Pushed to both branches. Closing out bug report. Thanks!