Reported by Ashok Ramakrishnan <aramakrishnan@nasuni.com>. Hi Jeremy: My investigation continues! I thought something funny was happening around compiler optimization, and tried making the nu +m_aio_requests a volatile unsigned in the files_struct. We were able to reproduce the problem even with that (previously we used to get a core because fsp->aio_requests was NULL + and we were dereferencing it, now infinite loop...)... anyways, this is samba 4.12.6 (we do not have any vfs module red +efining the pwrite/pread methods, but do have other custom VFS modules). Here is my stack trace (attached live to the smbd process since it is in an infinite loop in assert_no_pending_aio() Some variable values that may be of interest... The reason it is in an infinite loop is fsp->aio_requests[0] == NULL with + fsp->num_aio_requests == 4. I have absolutely no idea how this could happen. Still looking. Clearly no one else seems to be having this problem, so m +ust be something unique to our use-case, but thought I would share with you the progress. I will update the list if I fi +nd anything concrete. Thanks, (gdb) p num_requests $17 = 5 (gdb) p fsp->num_aio_requests $18 = 4 (gdb) p fsp->aio_requests[0] $19 = (struct tevent_req *) 0x0 (gdb) p fsp->aio_requests[1] $20 = (struct tevent_req *) 0x55fca4da5a80 (gdb) p fsp->aio_requests[2] $21 = (struct tevent_req *) 0x55fca4dd79a0 (gdb) p fsp->aio_requests[3] $22 = (struct tevent_req *) 0x55fca4dd9a30 (gdb) p fsp->aio_requests[4] $23 = (struct tevent_req *) 0x55fca4ddbac0 (gdb) p fsp->aio_requests[5] $24 = (struct tevent_req *) 0x55fca4dcc190 (gdb) p *$20 $25 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4da5980}, data = 0x55fca4da5c30, privat +e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0, state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x +7f2276dc8593 "../../source3/smbd/aio.c:341", finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4da5b70, de +fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}} (gdb) p *$21 $26 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4dd78a0}, data = 0x55fca4dd7b50, privat +e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0, state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x +7f2276dc8593 "../../source3/smbd/aio.c:341", finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4dd7a90, de +fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}}
Hi Jeremy: After looking at the assembly code generated from this code block in assert_no_pending_aio(), while (fsp->num_aio_requests != 0) { TALLOC_FREE(fsp->aio_requests[0]); } return; It appears that the problem we may be running into is the logic in TALLOC_FREE(). I see the definition as #define TALLOC_FREE(ctx) do { if (ctx != NULL) { talloc_free(ctx); ctx=NULL; } } while(0) So, it sets the ctx (fsp->aio_requests[0] in this case) to NULL after the talloc_free(). Wouldn't that cause fsp->aio_requests[0] to be set to NULL after the first call to talloc_free()? It seems like that is w +hat happened in my case below... Thanks, Ashok
The TALLOC_FREE(fsp->aio_requests[0]) inside assert_no_pending_aio() must be replaced with talloc_free(fsp->aio_requests[0]). The destructor for fsp->aio_requests[index] is: static int aio_del_req_from_fsp(struct aio_req_fsp_link *lnk) { unsigned i; files_struct *fsp = lnk->fsp; struct tevent_req *req = lnk->req; for (i=0; i<fsp->num_aio_requests; i++) { if (fsp->aio_requests[i] == req) { break; } } if (i == fsp->num_aio_requests) { DEBUG(1, ("req %p not found in fsp %p\n", req, fsp)); return 0; } fsp->num_aio_requests -= 1; fsp->aio_requests[i] = fsp->aio_requests[fsp->num_aio_requests]; if (fsp->num_aio_requests == 0) { TALLOC_FREE(fsp->aio_requests); } return 0; } Note that inside this destructor, if lnk->req == fsp->aio_requests[i], then fsp->aio_requests[i] is *replaced* with the last element of the fsp->aio_requests[] array. If this is being invoked from assert_no_pending_aio() via TALLOC_FREE(fsp->aio_requests[0]), then the TALLOC_FREE macro does: talloc_free(ctx); ctx=NULL; which will overwrite the value just written by the destructor with NULL, thus causing the crash.
I broke this in 4.12 and above with the async close teardown fixes :-(. Look at what we had previously in 4.11.x while (fsp->num_aio_requests != 0) { /* * The destructor of the req will remove itself from the * fsp. Don't use TALLOC_FREE here, this will overwrite * what the destructor just wrote into aio_requests[0]. */ talloc_free(fsp->aio_requests[0]); } :-(.
Ralph, that previous comment was your code (the correct one in 4.11.x). Really sorry about that :-(.
As Ralph just pointed out in the MR: Kudos to who deserves the kudos: that code was just copied by me from another part were it was added by Volker in commit 940395d3 and reviewed by you. :) https://gitlab.com/samba-team/samba/-/merge_requests/1584
This bug was referenced in samba master: fca8cb63762faff54cda243c1ed8217b36333131
Created attachment 16266 [details] git-am fix for 4.12.next, 4.13.next. Cherry-picked from master. Applies cleanly to 4.12.next, 4.13.next (only affected branches).
Reassigning to Karolin for inclusion in 4.12 and 4.13.
*** Bug 14565 has been marked as a duplicate of this bug. ***
(In reply to Ralph Böhme from comment #8) Pushed to autobuild-v4-{13,12}-test.
This bug was referenced in samba v4-13-test: 61c8a44895ea5e4bd42d1447384005d89f5327e6
This bug was referenced in samba v4-12-test: dcce5e5bf679e8d9afeb9bb9455da2c98b3ae7b2
Pushed to both branches. Closing out bug report. Thanks!
*** Bug 14584 has been marked as a duplicate of this bug. ***
*** Bug 14591 has been marked as a duplicate of this bug. ***
This bug was referenced in samba v4-13-stable (Release samba-4.13.3): 61c8a44895ea5e4bd42d1447384005d89f5327e6
This bug was referenced in samba v4-12-stable (Release samba-4.12.11): dcce5e5bf679e8d9afeb9bb9455da2c98b3ae7b2