concurrent threads executing jobs created by pthreadpool_tevent_job_send and processed by pthreadpool have race conditions. A customer core dump triggered investigation into this area as the core dump showed corrupted glue_list (at time of abort) which was overwritten by valid data by the time the core had been generated. At the time of the core 2 pthreadpool_tevent jobs were being processed vfs_pwrite_do and vfs_fsync_do thread 1 (coring thread, after completing vfs_pwrite_do and calling pthreadpool_tevent_job_signal to indicate job event completion [...] 369 #ifdef HAVE_PTHREAD 370 for (g = state->pool->glue_list; g != NULL; g = g->next) { 371 if (g->ev == state->ev) { 372 tctx = g->tctx; 373 break; 374 } 375 } 376 377 if (tctx == NULL) { *378 abort(); 379 } 380 #endif #8 0x00007f4ba8a53cdb in raise () from /lib64/libc.so.6 #9 0x00007f4ba8a55395 in abort () from /lib64/libc.so.6 #10 0x00007f4ba96dbc44 in pthreadpool_tevent_job_signal (jobid=<optimized out>, job_fn=<optimized out>, job_private_data=<optimized out>, private_data=<optimized out>) at ../../lib/pthreadpool/pthreadpool_tevent.c:378 #11 0x00007f4ba96dabaf in pthreadpool_server (arg=0x55b2c234ffe0) at ../../lib/pthreadpool/pthreadpool.c:657 #12 0x00007f4ba97bf6da in start_thread () from /lib64/libpthread.so.0 #13 0x00007f4ba8b2153f in clone () from /lib64/libc.so.6 Note: stack tells us we aborted at 378 above meaning tctx is NULL, but it shouldn't be (obviously it was at the time of the crash). If we examine the glue_list and and state->ev (we have to go up a frame to figure these out because these variables are optimized at this point in the backtrace) (gdb) up // Get the state->ev from the struct pthreadpool_tevent_job_state passed // to pool->signal_fn (gdb) print ((struct pthreadpool_tevent_job_state *)job.private_data)->ev $3 = (struct tevent_context *) 0x55b2c233d2e0 (gdb) print ((struct pthreadpool_tevent_job_state *)job.private_data)->pool $4 = (struct pthreadpool_tevent *) 0x55b2c234ffc0 // Get 'ev' from the first 'g' to iterate (from pool->glue->list) (gdb) print ((struct pthreadpool_tevent_job_state *)job.private_data)->pool->glue_list->ev $5 = (struct tevent_context *) 0x55b2c2558070 // its not a match, iterate to the next (gdb) print ((struct pthreadpool_tevent_job_state *)job.private_data)->pool->glue_list->next->ev $6 = (struct tevent_context *) 0x55b2c233d2e0 // bingo, this is a match (but doesn't correlate to the core we see!!!!) thread 2 (awaiting job to complete in worker thread) (gdb) where #0 0x00007f4ba8b2192f in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4ba8c0de82 in ?? () from /usr/lib64/libtevent.so.0 #2 0x00007f4ba8c0c0a7 in ?? () from /usr/lib64/libtevent.so.0 #3 0x00007f4ba8c0706d in _tevent_loop_once () from /usr/lib64/libtevent.so.0 #4 0x00007f4ba8c08af3 in tevent_req_poll () from /usr/lib64/libtevent.so.0 #5 0x00007f4ba9ce3d5b in smb_vfs_fsync_sync (fsp=fsp@entry=0x55b2c23bada0) at ../../source3/smbd/vfs.c:2062 #6 0x00007f4ba9c7e248 in sync_file (conn=conn@entry=0x55b2c237f8a0, fsp=fsp@entry=0x55b2c23bada0, write_through=write_through@entry=true) at ../../source3/smbd/fileio.c:315 #7 0x00007f4ba9ca784e in reply_flush (req=0x55b2c2557ee0) at ../../source3/smbd/reply.c:5398 thread 3 (worker thread doing the fsync) (gdb) where #0 0x00007f4ba97ca847 in fsync () from /lib64/libpthread.so.0 #1 0x00007f4ba9dbe6d4 in vfs_fsync_do (private_data=<optimized out>) at ../../source3/modules/vfs_default.c:1128 #2 0x00007f4ba96dab97 in pthreadpool_server (arg=0x55b2c234ffe0) at ../../lib/pthreadpool/pthreadpool.c:655 #3 0x00007f4ba97bf6da in start_thread () from /lib64/libpthread.so.0 #4 0x00007f4ba8b2153f in clone () from /lib64/libc.so.6 clearly the glue_list is being concurrently modified and read, the only 2 functions that can be involved are pthreadpool_tevent_register_ev & pthreadpool_tevent_job_signal & additionally pthreadpool_tevent_destructor triggeredd by deleting the registered event context (which for smb_vfs_fsync_sync is after every call) these function can run concurrently in different threads (both reading and writing) and the glue_list is unprotected from such concurrent access.