Bug 14515 - assert_no_pending_aio() causes a core dump due to TALLOC_FREE() overwriting a valid value with NULL.
Summary: assert_no_pending_aio() causes a core dump due to TALLOC_FREE() overwriting a...
Status: ASSIGNED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.12.2
Hardware: All All
: P5 regression (vote)
Target Milestone: ---
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-27 05:05 UTC by Jeremy Allison
Modified: 2020-09-30 17:50 UTC (History)
2 users (show)

See Also:


Attachments
git-am fix for 4.12.next, 4.13.next. (1.75 KB, patch)
2020-09-30 17:50 UTC, Jeremy Allison
jra: review? (slow)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Allison 2020-09-27 05:05:11 UTC
Reported by Ashok Ramakrishnan <aramakrishnan@nasuni.com>.

Hi Jeremy:
My investigation continues! I thought something funny was happening around compiler optimization, and tried making the nu
+m_aio_requests a volatile unsigned in the files_struct.

We were able to reproduce the problem even with that (previously we used to get a core because fsp->aio_requests was NULL
+ and we were dereferencing it, now infinite loop...)... anyways, this is samba 4.12.6 (we do not have any vfs module red
+efining the pwrite/pread methods, but do have other custom VFS modules).

Here is my stack trace (attached live to the smbd process since it is in an infinite loop in assert_no_pending_aio()

Some variable values that may be of interest... The reason it is in an infinite loop is fsp->aio_requests[0] == NULL with
+ fsp->num_aio_requests == 4.

I have absolutely no idea how this could happen. Still looking. Clearly no one else seems to be having this problem, so m
+ust be something unique to our use-case, but thought I would share with you the progress. I will update the list if I fi
+nd anything concrete.

Thanks,

(gdb) p num_requests
$17 = 5
(gdb) p fsp->num_aio_requests
$18 = 4
(gdb) p fsp->aio_requests[0]
$19 = (struct tevent_req *) 0x0
(gdb) p fsp->aio_requests[1]
$20 = (struct tevent_req *) 0x55fca4da5a80
(gdb) p fsp->aio_requests[2]
$21 = (struct tevent_req *) 0x55fca4dd79a0
(gdb) p fsp->aio_requests[3]
$22 = (struct tevent_req *) 0x55fca4dd9a30
(gdb) p fsp->aio_requests[4]
$23 = (struct tevent_req *) 0x55fca4ddbac0
(gdb) p fsp->aio_requests[5]
$24 = (struct tevent_req *) 0x55fca4dcc190
(gdb) p *$20
$25 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4da5980}, data = 0x55fca4da5c30, privat
+e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
    state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x
+7f2276dc8593 "../../source3/smbd/aio.c:341",
    finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4da5b70, de
+fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}}
(gdb) p *$21
$26 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4dd78a0}, data = 0x55fca4dd7b50, privat
+e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
    state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x
+7f2276dc8593 "../../source3/smbd/aio.c:341",
    finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4dd7a90, de
+fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}}
Comment 1 Jeremy Allison 2020-09-27 05:06:23 UTC
Hi Jeremy:
After looking at the assembly code generated from this code block in assert_no_pending_aio(),

                while (fsp->num_aio_requests != 0) {
                        TALLOC_FREE(fsp->aio_requests[0]);
                }
                return;

It appears that the problem we may be running into is the logic in TALLOC_FREE().

I see the definition as

#define TALLOC_FREE(ctx) do { if (ctx != NULL) { talloc_free(ctx); ctx=NULL; } } while(0)

So, it sets the ctx (fsp->aio_requests[0] in this case) to NULL after the talloc_free().

Wouldn't that cause fsp->aio_requests[0] to be set to NULL after the first call to talloc_free()? It seems like that is w
+hat happened in my case below...

Thanks,
Ashok
Comment 2 Jeremy Allison 2020-09-27 05:10:49 UTC
The TALLOC_FREE(fsp->aio_requests[0]) inside assert_no_pending_aio() must be replaced with talloc_free(fsp->aio_requests[0]).

The destructor for fsp->aio_requests[index] is:

static int aio_del_req_from_fsp(struct aio_req_fsp_link *lnk)
{
        unsigned i;
        files_struct *fsp = lnk->fsp;
        struct tevent_req *req = lnk->req;

        for (i=0; i<fsp->num_aio_requests; i++) {
                if (fsp->aio_requests[i] == req) {
                        break;
                }
        }
        if (i == fsp->num_aio_requests) {
                DEBUG(1, ("req %p not found in fsp %p\n", req, fsp));
                return 0;
        }
        fsp->num_aio_requests -= 1;
        fsp->aio_requests[i] = fsp->aio_requests[fsp->num_aio_requests];

        if (fsp->num_aio_requests == 0) {
                TALLOC_FREE(fsp->aio_requests);
        }
        return 0;
}

Note that inside this destructor, if lnk->req == fsp->aio_requests[i], then

fsp->aio_requests[i] is *replaced* with the last element of the
fsp->aio_requests[] array.

If this is being invoked from assert_no_pending_aio() via TALLOC_FREE(fsp->aio_requests[0]), then the TALLOC_FREE macro does:

talloc_free(ctx); ctx=NULL;

which will overwrite the value just written by the destructor with NULL, thus causing the crash.
Comment 3 Jeremy Allison 2020-09-27 05:26:13 UTC
I broke this in 4.12 and above with the async close teardown fixes :-(. Look at what we had previously in 4.11.x

                while (fsp->num_aio_requests != 0) {
                        /*
                         * The destructor of the req will remove itself from the
                         * fsp.  Don't use TALLOC_FREE here, this will overwrite
                         * what the destructor just wrote into aio_requests[0].
                         */
                        talloc_free(fsp->aio_requests[0]);
                }

:-(.
Comment 4 Jeremy Allison 2020-09-27 05:27:57 UTC
Ralph, that previous comment was your code (the correct one in 4.11.x). Really sorry about that :-(.
Comment 5 Jeremy Allison 2020-09-27 06:42:52 UTC
As Ralph just pointed out in the MR:

Kudos to who deserves the kudos: that code was just copied by me from another part were it was added by Volker in commit 940395d3 and reviewed by you. :)

https://gitlab.com/samba-team/samba/-/merge_requests/1584
Comment 6 Samba QA Contact 2020-09-30 11:19:14 UTC
This bug was referenced in samba master:

fca8cb63762faff54cda243c1ed8217b36333131
Comment 7 Jeremy Allison 2020-09-30 17:50:05 UTC
Created attachment 16266 [details]
git-am fix for 4.12.next, 4.13.next.

Cherry-picked from master. Applies cleanly to 4.12.next, 4.13.next (only affected branches).