Bug 11141 - Use after free panic in winbindd epoll_event_loop
Summary: Use after free panic in winbindd epoll_event_loop
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.2.0
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks: 11316
  Show dependency treegraph
 
Reported: 2015-03-06 18:54 UTC by David Disseldorp
Modified: 2015-06-08 08:02 UTC (History)
5 users (show)

See Also:


Attachments
Patch for v4-2-test (3.55 KB, patch)
2015-05-20 22:21 UTC, Stefan Metzmacher
jra: review+
Details
Patch for v4-1-test (3.59 KB, patch)
2015-05-20 22:27 UTC, Stefan Metzmacher
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Disseldorp 2015-03-06 18:54:28 UTC
Am hitting this in ~stock 4.2.0 winbind:
Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519806,  0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/popt_common.c:68(popt_s3_talloc_log_fn)
Mar 06 17:57:39 fater winbindd[3226]: talloc: access after free error - first free may be at ../tevent_req.c:247
Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519859,  0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/popt_common.c:68(popt_s3_talloc_log_fn)
Mar 06 17:57:39 fater winbindd[3226]: Bad talloc magic value - access after free
Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.519884,  0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/util.c:788(smb_panic_s3)
Mar 06 17:57:39 fater winbindd[3226]: PANIC (pid 3226): Bad talloc magic value - access after free
Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.520586,  0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/util.c:899(log_stack_trace)
Mar 06 17:57:39 fater winbindd[3226]: BACKTRACE: 11 stack frames:
Mar 06 17:57:39 fater winbindd[3226]: #0 /usr/lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7fe07bf4f0aa]
Mar 06 17:57:39 fater winbindd[3226]: #1 /usr/lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7fe07bf4f180]
Mar 06 17:57:39 fater winbindd[3226]: #2 /usr/lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7fe07ebc5b8f]
Mar 06 17:57:39 fater winbindd[3226]: #3 /usr/lib64/libtalloc.so.2(+0x236f) [0x7fe07a04636f]
Mar 06 17:57:39 fater winbindd[3226]: #4 /usr/lib64/libtalloc.so.2(talloc_check_name+0x6c) [0x7fe07a046e3c]
Mar 06 17:57:39 fater winbindd[3226]: #5 /usr/lib64/libtevent.so.0(+0x9763) [0x7fe079548763]
Mar 06 17:57:39 fater winbindd[3226]: #6 /usr/lib64/libtevent.so.0(+0x7d47) [0x7fe079546d47]
Mar 06 17:57:39 fater winbindd[3226]: #7 /usr/lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7fe07954351d]
Mar 06 17:57:39 fater winbindd[3226]: #8 /usr/sbin/winbindd(main+0xb14) [0x7fe080bf41e4]
Mar 06 17:57:39 fater winbindd[3226]: #9 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe078f6cb05]
Mar 06 17:57:39 fater winbindd[3226]: #10 /usr/sbin/winbindd(+0x278fb) [0x7fe080bf48fb]
Mar 06 17:57:39 fater winbindd[3226]: [2015/03/06 17:57:39.522497,  0, pid=3226, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:318(dump_core)
Mar 06 17:57:39 fater winbindd[3226]: dumping core in /var/log/samba/cores/winbindd





(gdb) bt
#0  0x00007fe078f80187 in raise () from /lib64/libc.so.6
#1  0x00007fe078f81538 in abort () from /lib64/libc.so.6
#2  0x00007fe07bf5c26b in dump_core () at ../source3/lib/dumpcore.c:337
#3  0x00007fe07bf4f1e7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:811
#4  0x00007fe07ebc5b8f in smb_panic (why=0x7fe07a0500b8 "Bad talloc magic value - access after free") at ../lib/util/fault.c:166
#5  0x00007fe07a04636f in ?? () from /usr/lib64/libtalloc.so.2
#6  0x00007fe07a046e3c in talloc_check_name () from /usr/lib64/libtalloc.so.2
#7  0x00007fe079548763 in epoll_event_loop (tvalp=0x7fff77e92260, epoll_ev=0x7fe0811d2510) at ../tevent_epoll.c:664
#8  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926
#9  0x00007fe079546d47 in std_event_loop_once (ev=0x7fe0811d2360, location=0x7fe080c7c1e0 "../source3/winbindd/winbindd.c:1676") at ../tevent_standard.c:114
#10 0x00007fe07954351d in _tevent_loop_once (ev=0x7fe0811d2360, location=0x7fe080c7c1e0 "../source3/winbindd/winbindd.c:1676") at ../tevent.c:533
#11 0x00007fe080bf41e4 in main (argc=<optimized out>, argv=<optimized out>) at ../source3/winbindd/winbindd.c:1676
(gdb) up 7
#7  0x00007fe079548763 in epoll_event_loop (tvalp=0x7fff77e92260, epoll_ev=0x7fe0811d2510) at ../tevent_epoll.c:664
664                     struct tevent_fd *fde = talloc_get_type(events[i].data.ptr,
(gdb) l
659                     tevent_common_loop_timer_delay(epoll_ev->ev);
660                     return 0;
661             }
662
663             for (i=0;i<ret;i++) {
664                     struct tevent_fd *fde = talloc_get_type(events[i].data.ptr,
665                                                            struct tevent_fd);
666                     uint16_t flags = 0;
667                     struct tevent_fd *mpx_fde = NULL;
668
(gdb)
669                     if (fde == NULL) {
670                             epoll_panic(epoll_ev, "epoll_wait() gave bad data", true);
671                             return -1;
672                     }
673                     if (fde->additional_flags & EPOLL_ADDITIONAL_FD_FLAG_HAS_MPX) {
674                             /*
675                              * Save off the multiplexed event in case we need
676                              * to use it to call the handler function.
677                              */
678                             mpx_fde = talloc_get_type_abort(fde->additional_data,
(gdb) p i
$1 = 0
(gdb) p events[i]
$2 = {events = 17, data = {ptr = 0x7fe0811f5b40, fd = -2128651456, u32 = 2166315840, u64 = 140602215717696}}
(gdb) p (struct tevent_fd *)0x7fe0811f5b40
$3 = (struct tevent_fd *) 0x7fe0811f5b40
(gdb) p *(struct tevent_fd *)0x7fe0811f5b40
$4 = {prev = 0x0, next = 0x0, event_ctx = 0x0, fd = 34, flags = 1, handler = 0x7fe07a457b90 <read_packet_handler>, close_fn = 0x0, private_data = 0x7fe0811f5830, handler_name = 0x7fe07a4594be "read_packet_handler",
  location = 0x7fe07a459308 "../lib/async_req/async_sock.c:602", additional_flags = 4, additional_data = 0x0}
(gdb) p *(struct tevent_req *)0x7fe0811f5830
$5 = {async = {fn = 0x7fe080c3fee0 <wb_req_read_done>, private_data = 0x7fe0811f79e0}, data = 0x0, private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0, state = TEVENT_REQ_INIT}, internal = {private_type = 0x7fe07a459456 "struct read_packet_state",
    create_location = 0x7fe07a4592b8 "../lib/async_req/async_sock.c:587", finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_RECEIVED, error = 0, trigger = 0x0, defer_callback_ev = 0x0, timer = 0x0}}
Comment 1 Stefan Metzmacher 2015-05-11 13:05:35 UTC
David, do you have any idea how to reproduce this?
Comment 2 David Disseldorp 2015-05-11 13:13:08 UTC
(In reply to Stefan (metze) Metzmacher from comment #1)

It's not easy to reproduce.

I was able to hit it the most when using the diskshadow.exe FSRVP client (requires the extra mgmt RPC server changes).
I also encountered it infrequently on ~stock 4.2 by restarting smbd with Windows clients connected.
Comment 3 Stefan Metzmacher 2015-05-12 12:29:58 UTC
What kernel are you using?

I have access to a cluster with kernel-3.10.0-123.20.1.el7.x86_64
where this happens
Comment 4 Stefan Metzmacher 2015-05-12 12:31:19 UTC
You're using tevent-0.9.24 and talloc-2.1.2 ?
Comment 5 David Disseldorp 2015-05-12 12:57:09 UTC
SLE12 kernel: 3.12.39-47-default
tevent: 0.9.21-2.4
Comment 6 Stefan Metzmacher 2015-05-15 10:59:17 UTC
I finally found the problem :-)

What happens is the following:

- The related winbindd_cli_state has a wb_req_read_send()
  or wb_resp_write_send() pending. That means there's
  tevent_fd registered with the kernel via epoll_ctl(...ADD)
- A new winbindd child is fork'ed and winbindd_reinit_after_fork()
  -> close_conns_after_fork() starts to run, but didn't
  reach the related winbindd_cli_state yet.
- The parent calls remove_client() on the related winbindd_cli_state
  maybe via remove_timed_out_clients().
- remove_client() calls close() before TALLOC_FREE(state)
  the close() doesn't remove the fd from the parents epoll
  set yet, as the socket fd is still open in the child.
  TALLOC_FREE(state) triggers tevent_req_received() on the pending
  wb_req_read or wb_req_write, which then triggers
  epoll_event_fd_destructor() -> epoll_update_event() -> epoll_del_event().
  The results in epoll_ctl(DEL) returning EBADF because the fd is already
  closed in the parent.
- As the socket fd was still open in the child, epoll_wait returns
  on event in the parent epoll queue, but the pointer in events.data.ptr
  is not valid anymore.

So we need to make sure the close() comes after epoll_ctl(DEL)

As a safe-guard epoll_del_event() should also call epoll_panic()
if it gets EBADF, this would result in a fallback to the poll backend
instead of crashing later.
Comment 7 David Disseldorp 2015-05-15 11:43:04 UTC
(In reply to Stefan (metze) Metzmacher from comment #6)
Great work tracking this one down Metze!
I'll mark you down as assignee, as it looks like you've made much more progress on this than I did.
Comment 8 Stefan Metzmacher 2015-05-20 22:21:30 UTC
Created attachment 11075 [details]
Patch for v4-2-test
Comment 9 Stefan Metzmacher 2015-05-20 22:27:04 UTC
Created attachment 11078 [details]
Patch for v4-1-test
Comment 10 Jeremy Allison 2015-05-21 07:52:32 UTC
Reassigning to Karolin for inclusion in 4.2.next, 4.1.next.
Comment 11 Karolin Seeger 2015-05-21 12:41:49 UTC
(In reply to Jeremy Allison from comment #10)
Pushed to autobuild-v4-[1|2]-test.
Comment 12 Karolin Seeger 2015-05-24 18:41:55 UTC
(In reply to Karolin Seeger from comment #11)
Pushed to both branches.
Closing out bug report.

Thanks!