Bug 14515 - assert_no_pending_aio() causes a core dump due to TALLOC_FREE() overwriting a valid value with NULL.
Summary: assert_no_pending_aio() causes a core dump due to TALLOC_FREE() overwriting a...
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.12.2
Hardware: All All
: P5 regression (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
: 14565 14584 14591 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-09-27 05:05 UTC by Jeremy Allison
Modified: 2021-01-14 08:33 UTC (History)
5 users (show)

See Also:


Attachments
git-am fix for 4.12.next, 4.13.next. (1.75 KB, patch)
2020-09-30 17:50 UTC, Jeremy Allison
slow: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Allison 2020-09-27 05:05:11 UTC
Reported by Ashok Ramakrishnan <aramakrishnan@nasuni.com>.

Hi Jeremy:
My investigation continues! I thought something funny was happening around compiler optimization, and tried making the nu
+m_aio_requests a volatile unsigned in the files_struct.

We were able to reproduce the problem even with that (previously we used to get a core because fsp->aio_requests was NULL
+ and we were dereferencing it, now infinite loop...)... anyways, this is samba 4.12.6 (we do not have any vfs module red
+efining the pwrite/pread methods, but do have other custom VFS modules).

Here is my stack trace (attached live to the smbd process since it is in an infinite loop in assert_no_pending_aio()

Some variable values that may be of interest... The reason it is in an infinite loop is fsp->aio_requests[0] == NULL with
+ fsp->num_aio_requests == 4.

I have absolutely no idea how this could happen. Still looking. Clearly no one else seems to be having this problem, so m
+ust be something unique to our use-case, but thought I would share with you the progress. I will update the list if I fi
+nd anything concrete.

Thanks,

(gdb) p num_requests
$17 = 5
(gdb) p fsp->num_aio_requests
$18 = 4
(gdb) p fsp->aio_requests[0]
$19 = (struct tevent_req *) 0x0
(gdb) p fsp->aio_requests[1]
$20 = (struct tevent_req *) 0x55fca4da5a80
(gdb) p fsp->aio_requests[2]
$21 = (struct tevent_req *) 0x55fca4dd79a0
(gdb) p fsp->aio_requests[3]
$22 = (struct tevent_req *) 0x55fca4dd9a30
(gdb) p fsp->aio_requests[4]
$23 = (struct tevent_req *) 0x55fca4ddbac0
(gdb) p fsp->aio_requests[5]
$24 = (struct tevent_req *) 0x55fca4dcc190
(gdb) p *$20
$25 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4da5980}, data = 0x55fca4da5c30, privat
+e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
    state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x
+7f2276dc8593 "../../source3/smbd/aio.c:341",
    finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4da5b70, de
+fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}}
(gdb) p *$21
$26 = {async = {fn = 0x7f2276c945b0 <aio_pwrite_smb2_done>, private_data = 0x55fca4dd78a0}, data = 0x55fca4dd7b50, privat
+e_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
    state = TEVENT_REQ_INIT}, internal = {private_type = 0x7f2276dc8400 "struct pwrite_fsync_state", create_location = 0x
+7f2276dc8593 "../../source3/smbd/aio.c:341",
    finish_location = 0x0, cancel_location = 0x0, state = TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x55fca4dd7a90, de
+fer_callback_ev = 0x0, timer = 0x0, profile = 0x0}}
Comment 1 Jeremy Allison 2020-09-27 05:06:23 UTC
Hi Jeremy:
After looking at the assembly code generated from this code block in assert_no_pending_aio(),

                while (fsp->num_aio_requests != 0) {
                        TALLOC_FREE(fsp->aio_requests[0]);
                }
                return;

It appears that the problem we may be running into is the logic in TALLOC_FREE().

I see the definition as

#define TALLOC_FREE(ctx) do { if (ctx != NULL) { talloc_free(ctx); ctx=NULL; } } while(0)

So, it sets the ctx (fsp->aio_requests[0] in this case) to NULL after the talloc_free().

Wouldn't that cause fsp->aio_requests[0] to be set to NULL after the first call to talloc_free()? It seems like that is w
+hat happened in my case below...

Thanks,
Ashok
Comment 2 Jeremy Allison 2020-09-27 05:10:49 UTC
The TALLOC_FREE(fsp->aio_requests[0]) inside assert_no_pending_aio() must be replaced with talloc_free(fsp->aio_requests[0]).

The destructor for fsp->aio_requests[index] is:

static int aio_del_req_from_fsp(struct aio_req_fsp_link *lnk)
{
        unsigned i;
        files_struct *fsp = lnk->fsp;
        struct tevent_req *req = lnk->req;

        for (i=0; i<fsp->num_aio_requests; i++) {
                if (fsp->aio_requests[i] == req) {
                        break;
                }
        }
        if (i == fsp->num_aio_requests) {
                DEBUG(1, ("req %p not found in fsp %p\n", req, fsp));
                return 0;
        }
        fsp->num_aio_requests -= 1;
        fsp->aio_requests[i] = fsp->aio_requests[fsp->num_aio_requests];

        if (fsp->num_aio_requests == 0) {
                TALLOC_FREE(fsp->aio_requests);
        }
        return 0;
}

Note that inside this destructor, if lnk->req == fsp->aio_requests[i], then

fsp->aio_requests[i] is *replaced* with the last element of the
fsp->aio_requests[] array.

If this is being invoked from assert_no_pending_aio() via TALLOC_FREE(fsp->aio_requests[0]), then the TALLOC_FREE macro does:

talloc_free(ctx); ctx=NULL;

which will overwrite the value just written by the destructor with NULL, thus causing the crash.
Comment 3 Jeremy Allison 2020-09-27 05:26:13 UTC
I broke this in 4.12 and above with the async close teardown fixes :-(. Look at what we had previously in 4.11.x

                while (fsp->num_aio_requests != 0) {
                        /*
                         * The destructor of the req will remove itself from the
                         * fsp.  Don't use TALLOC_FREE here, this will overwrite
                         * what the destructor just wrote into aio_requests[0].
                         */
                        talloc_free(fsp->aio_requests[0]);
                }

:-(.
Comment 4 Jeremy Allison 2020-09-27 05:27:57 UTC
Ralph, that previous comment was your code (the correct one in 4.11.x). Really sorry about that :-(.
Comment 5 Jeremy Allison 2020-09-27 06:42:52 UTC
As Ralph just pointed out in the MR:

Kudos to who deserves the kudos: that code was just copied by me from another part were it was added by Volker in commit 940395d3 and reviewed by you. :)

https://gitlab.com/samba-team/samba/-/merge_requests/1584
Comment 6 Samba QA Contact 2020-09-30 11:19:14 UTC
This bug was referenced in samba master:

fca8cb63762faff54cda243c1ed8217b36333131
Comment 7 Jeremy Allison 2020-09-30 17:50:05 UTC
Created attachment 16266 [details]
git-am fix for 4.12.next, 4.13.next.

Cherry-picked from master. Applies cleanly to 4.12.next, 4.13.next (only affected branches).
Comment 8 Ralph Böhme 2020-11-09 11:18:09 UTC
Reassigning to Karolin for inclusion in 4.12 and 4.13.
Comment 9 Ralph Böhme 2020-11-09 11:20:08 UTC
*** Bug 14565 has been marked as a duplicate of this bug. ***
Comment 10 Karolin Seeger 2020-11-19 09:51:30 UTC
(In reply to Ralph Böhme from comment #8)
Pushed to autobuild-v4-{13,12}-test.
Comment 11 Samba QA Contact 2020-11-19 11:26:11 UTC
This bug was referenced in samba v4-13-test:

61c8a44895ea5e4bd42d1447384005d89f5327e6
Comment 12 Samba QA Contact 2020-11-19 14:11:04 UTC
This bug was referenced in samba v4-12-test:

dcce5e5bf679e8d9afeb9bb9455da2c98b3ae7b2
Comment 13 Karolin Seeger 2020-11-24 11:20:15 UTC
Pushed to both branches.
Closing out bug report.

Thanks!
Comment 14 Stefan Metzmacher 2020-11-27 14:04:55 UTC
*** Bug 14584 has been marked as a duplicate of this bug. ***
Comment 15 Karolin Seeger 2020-12-08 09:42:51 UTC
*** Bug 14591 has been marked as a duplicate of this bug. ***
Comment 16 Samba QA Contact 2020-12-16 12:21:36 UTC
This bug was referenced in samba v4-13-stable (Release samba-4.13.3):

61c8a44895ea5e4bd42d1447384005d89f5327e6
Comment 17 Samba QA Contact 2021-01-14 08:33:45 UTC
This bug was referenced in samba v4-12-stable (Release samba-4.12.11):

dcce5e5bf679e8d9afeb9bb9455da2c98b3ae7b2