Bug 15128 - possible use after free of connection_struct when iterating smbd_server_connection->connections
Summary: possible use after free of connection_struct when iterating smbd_server_conne...
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: unspecified
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-23 11:16 UTC by Noel Power
Modified: 2022-09-28 15:42 UTC (History)
1 user (show)

See Also:


Attachments
valgrind output (3.76 KB, text/plain)
2022-07-23 12:08 UTC, Noel Power
no flags Details
valgrind output (4.05 KB, text/plain)
2022-07-23 12:08 UTC, Noel Power
no flags Details
debug & hacky repro tweak (15.57 KB, patch)
2022-07-24 14:00 UTC, Noel Power
no flags Details
full valgrind trace (that I got with difficulty) (72.65 KB, text/plain)
2022-07-24 14:02 UTC, Noel Power
no flags Details
associated debug from time of valgrind trace (34.04 KB, text/plain)
2022-07-24 14:02 UTC, Noel Power
no flags Details
debug & hacky repro tweak (18.75 KB, patch)
2022-08-03 16:14 UTC, Noel Power
no flags Details
full valgrind output (37.77 KB, text/plain)
2022-08-03 16:15 UTC, Noel Power
no flags Details
associated debug log.smbd from time of valgrind trace (104.22 KB, text/plain)
2022-08-03 16:16 UTC, Noel Power
no flags Details
patch for v4.17, v4.16, v.15 (7.22 KB, patch)
2022-08-17 10:16 UTC, Noel Power
no flags Details
patch for v4.17, v4.16, v.15 (7.35 KB, patch)
2022-08-17 10:21 UTC, Noel Power
jra: review-
Details
patch for v4.17 (14.97 KB, patch)
2022-08-18 16:06 UTC, Noel Power
jra: review+
Details
patch for v4.16 (15.07 KB, patch)
2022-08-18 16:06 UTC, Noel Power
jra: review+
Details
patch for v4.15 (13.67 KB, patch)
2022-08-18 16:07 UTC, Noel Power
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Noel Power 2022-07-23 11:16:04 UTC

    
Comment 1 Noel Power 2022-07-23 12:05:23 UTC
We got reports from a customer of random core dumps. Unfortunately it has been really hard to figure out what is going on because of pretty strict security restrictions on the information they can share with us. So we have some core dumps (but only text, not the actual binaries) Also some samba logs but heavily redacted. By heavily redacted I mean nothing but the header (e.g. date/time file and line number :/)

e.g.

there are a number of variation of the core back trace, all essentially indicating some memory error associated with the connection_structs hanging off scon->connections e.g.

#5  0x00007f332f4f3854 in fault_report (sig=11) at ../../lib/util/fault.c:81
#6  sig_fault (sig=11) at ../../lib/util/fault.c:92
#7  <signal handler called>
#8  conn_lastused_update (sconn=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:38
#9  conn_idle_all (sconn=sconn@entry=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:54
#10 0x00007f332fab3734 in deadtime_fn (now=0x7ffcb86af710, private_data=0x55ce2bcbd8f0)
    at ../../source3/smbd/process.c:2926
#11 0x00007f332e6490d5 in smbd_idle_event_handler (ctx=0x55ce2bc6b240, te=<optimized out>, now=..., 
    private_data=<optimized out>) at ../../source3/lib/util_event.c:45
#12 0x00007f332c36dae6 in tevent_common_invoke_timer_handler (te=te@entry=0x55ce2bca5570, current_time=..., 
    removed=removed@entry=0x0) at ../../tevent_timed.c:376


Anyway, after some head scratching I managed to (only on a handful of rare occasions) reproduce very similar back traces by introducing a psuedo random failure in smbXsrv_tcon_disconnect (by injecting some custom code into  source3/smbd/smbXsrv_tcon.c) Note: although for a brief time I managed to reproduce the bug a couple of times easily after that I managed only to reproduce it 2 in about 3 days of trying, it's a real pita

For reference I will attach 2 valgrind reports I managed to get (with my hacked smbd binary to introduce some failing smbXsrv_tcon_disconnect badness)
Comment 2 Noel Power 2022-07-23 12:08:32 UTC
Created attachment 17440 [details]
valgrind output

note the line numbers here are a bit skewed due to other debug statements I added
Comment 3 Noel Power 2022-07-23 12:08:51 UTC
Created attachment 17441 [details]
valgrind output
Comment 4 Noel Power 2022-07-23 12:16:08 UTC
So while the cause of these cores seems to be the failing SMB2_OP_TDIS (there is indication this is happening with the debug header info in from customer logs) and I haven't yet got the chance to find out why this is happening, this failure shouldn't allow the use after free scenarios to happen.

Attempt to mitigate against either a failing or missing SMB2_OP_TDIS before getting a SMB2_OP_LOGOFF by introducing a talloc_destructor for the connection_struct

https://gitlab.com/samba-team/samba/-/merge_requests/2635
Comment 5 Jeremy Allison 2022-07-23 23:09:21 UTC
Hi Noel, can you also upload the hacks you added to the smbd binary so I can match the line numbers with the valgrind traces.

Having the line numbers not match makes this almost impossible to understand.

Also, full valgrind backtraces all the way up to main() helps a lot too.
Comment 6 Jeremy Allison 2022-07-23 23:10:21 UTC
For example, *where* inside source3/smbd/smbXsrv_tcon.c did you add the random failure ? There's not enough data here, sorry.
Comment 7 Noel Power 2022-07-24 14:00:52 UTC
Created attachment 17442 [details]
debug & hacky repro tweak
Comment 8 Noel Power 2022-07-24 14:02:17 UTC
Created attachment 17443 [details]
full valgrind trace (that I got with difficulty)
Comment 9 Noel Power 2022-07-24 14:02:41 UTC
Created attachment 17444 [details]
associated debug from time of valgrind trace
Comment 10 Noel Power 2022-07-24 14:23:52 UTC
(In reply to Jeremy Allison from comment #5)

for the benefit of getting matching line numbers the repro patch here was applied to 4.15.4 sources (but... confirmed by customer it is happening on 4.15.7 also)

regarding the hacky fail of the tree disconnect, I don't think the heuristic I introduced matters in the slightest, I only did it that way (when first testing my theory) because I wanted a spread of disconnects over IPC$ and whatever shares I was using (my test smbd server had 3 simple shares) You probably would be better off failing *every* time (so every disconnect results in a connection_struct not removed from sconn->connection)

Unfortunately I can't even offer a set of steps that seem to help in reproducing the valgrind trace, the first day I tried it I was just randomly clicking around, and I managed to get a good few 'hits' e.g. valgrind traces. Subsequently I really struggled to reproduce it (at one stage I spend 2 days+ on and off randomly browsing the shares and didn't once get a valgrind error trace.

Customer reports that they see cores frequently (within 3 hours of bringing the system online) I believe from what information we have that the system is heavily loaded (connection wise) Also, multi-channel is negotiated but, this isn't necessary. For a while I though maybe it was somehow involved, but they have reproduced this with/without mutlichannel enabled. Customer also reports that multichannel enabled seems to make it happen faster.

Hope this helps
Comment 11 Noel Power 2022-08-03 16:14:57 UTC
Created attachment 17457 [details]
debug & hacky repro tweak

debug & hacky repro tweak against master (commit id 766151bf5b7)
Comment 12 Noel Power 2022-08-03 16:15:23 UTC
Created attachment 17458 [details]
full valgrind output
Comment 13 Noel Power 2022-08-03 16:16:13 UTC
Created attachment 17459 [details]
associated debug log.smbd from time of valgrind trace
Comment 14 Jeremy Allison 2022-08-16 19:28:33 UTC
OK, now I fully understand this.

In SMB2 smbd_smb2_tree_connect() we create the new conn struct using:

        compat_conn = make_connection_smb2(req,
                                        tcon, snum,
                                        "???",
                                        &status);

then move the ownership to tcon using:

        tcon->compat = talloc_move(tcon, &compat_conn);

so the lifetime of tcon->compat is tied directly to tcon.

You correctly identified the error:

 905         if (tcon->compat) {
 906                 bool ok;
 907                 static int i;
 908                 int res = i % 3;
 909                 i = i + 1;
 910                 ok = chdir_current_service(tcon->compat);
 911                 DBG_ERR("## %d\n",  res);
 912                 if (!ok || res == 0) {
 913                         status = NT_STATUS_INTERNAL_ERROR;
 914                         DEBUG(0, ("#### smbXsrv_tcon_disconnect(0x%08x, '%s'): "
 915                                   "chdir_current_service() failed: %s\n",
 916                                   tcon->global->tcon_global_id,
 917                                   tcon->global->share_name,
 918                                   nt_errstr(status)));
 919                         tcon->compat = NULL;
 920                         return status;
 921                 }
 922 
 923                 close_cnum(tcon->compat, vuid);
 924                 tcon->compat = NULL;
 925         }

If "chdir_current_service(tcon->compat);" fails for whatever reason, we exit this block without ever having called close_cnum(tcon->compat, vuid), leaving the conn pointer left attached to:

sconn->connections.

Then tcon get's freed and tcon->compat goes with it, never having called
DLIST_REMOVE(conn->sconn->connections, conn);

Next list walk on conn->sconn->connections and *BOOM* valgrind error :-).

Your fix is 100% correct. There also needs to be a call to close_cnum(tcon->compat, vuid) before:

 919                         tcon->compat = NULL;
 920                         return status;

but that on top of your fix will completely fix this. Great work !
Comment 15 Samba QA Contact 2022-08-17 09:55:04 UTC
This bug was referenced in samba master:

f92bacbe216d2d74ea3ccf3fe0df5c1cc9860996
Comment 16 Noel Power 2022-08-17 10:16:56 UTC
Created attachment 17478 [details]
patch for v4.17, v4.16, v.15

not sure if all versions are necessary but if required the attached patch will apply cleanly for v4.15, v4.16 & v4.17
Comment 17 Noel Power 2022-08-17 10:21:56 UTC
Created attachment 17479 [details]
patch for v4.17, v4.16, v.15

with cherry-pick detail this time
Comment 18 Jeremy Allison 2022-08-17 18:33:24 UTC
Comment on attachment 17479 [details]
patch for v4.17, v4.16, v.15

Sorry Noel, I realized there's a follow-up patch we need here to complete this bug. I'll add the MR once it's passed ci.
Comment 19 Jeremy Allison 2022-08-17 19:06:56 UTC
Extra fixes in ci here:

https://gitlab.com/samba-team/devel/samba/-/pipelines/616086697
Comment 20 Samba QA Contact 2022-08-18 14:11:03 UTC
This bug was referenced in samba master:

9203d17106c0e55a30813ff1ed76869c7581a343
7005a6354df5522d9f665fb30052c458dfc93124
cf5f7b1489930f6d64c3e3512f116ccf286d4605
Comment 21 Noel Power 2022-08-18 16:06:03 UTC
Created attachment 17481 [details]
patch for v4.17
Comment 22 Noel Power 2022-08-18 16:06:41 UTC
Created attachment 17482 [details]
patch for v4.16
Comment 23 Noel Power 2022-08-18 16:07:17 UTC
Created attachment 17483 [details]
patch for v4.15
Comment 24 Noel Power 2022-08-18 16:09:22 UTC
not sure if v4.15 is still needed (says on the schedule page it is still in maint but from the date it looks like not) but... this is the one I am most interested in so I'd be really grateful for a double check on that (files.c changes in last patch are different) to make sure I didn't forget or miss something
Comment 25 Jeremy Allison 2022-08-18 18:17:41 UTC
Re-assigning to Jule for inclusion in 4.17.rcNext, 4.16.next, 4.15.next.
Comment 26 Jule Anger 2022-08-23 05:59:10 UTC
Pushed to autobuild-v4-{17,16,15}-test.
Comment 27 Samba QA Contact 2022-08-23 07:35:25 UTC
This bug was referenced in samba v4-15-test:

6b54bb8abead164487d6c061f729ac3dc25580b9
659dfb93c2acbd038aa5465abae89cc12e394ad4
6cd04ec396cf101fb86c1cc2c17f5547d2e2e154
4c436dfe8cca162f086bd49f4032bdd3eb77553b
907e4ce03ab6809dd00d984c78ff1c006d925f97
Comment 28 Samba QA Contact 2022-08-23 08:54:19 UTC
This bug was referenced in samba v4-16-test:

56e1a9fc623ae184fefcf3214a6b1801b37e5fff
d1bc0d0b51bb8145c4d1597a39f72d85b28f8b35
706c64c6f0ee8cca24715cf4d591ed504432ce0f
a5cf33d4041d44f1f8a80563b81f3bc6893bc7ce
ce464a83c76ce612171f3df4933058695210915e
Comment 29 Samba QA Contact 2022-08-23 08:58:19 UTC
This bug was referenced in samba v4-17-test:

0b33961e71a563835241c8beeae379151742e6f1
c47b7479e74944d20a82bc109dcf35aeb099eed0
5fc9bf0f63cca19a25b26d64175746431e45a47f
91273a969abcfc717b87e0a019341e78ac77b975
76bff90824aabf95147470b2f5877a4853a45de8
Comment 30 Jule Anger 2022-08-23 09:30:34 UTC
Closing out bug report.

Thanks!
Comment 31 Samba QA Contact 2022-08-23 14:51:24 UTC
This bug was referenced in samba v4-17-stable (Release samba-4.17.0rc3):

0b33961e71a563835241c8beeae379151742e6f1
c47b7479e74944d20a82bc109dcf35aeb099eed0
5fc9bf0f63cca19a25b26d64175746431e45a47f
91273a969abcfc717b87e0a019341e78ac77b975
76bff90824aabf95147470b2f5877a4853a45de8
Comment 32 Samba QA Contact 2022-09-07 19:02:25 UTC
This bug was referenced in samba v4-16-stable (Release samba-4.16.5):

56e1a9fc623ae184fefcf3214a6b1801b37e5fff
d1bc0d0b51bb8145c4d1597a39f72d85b28f8b35
706c64c6f0ee8cca24715cf4d591ed504432ce0f
a5cf33d4041d44f1f8a80563b81f3bc6893bc7ce
ce464a83c76ce612171f3df4933058695210915e
Comment 33 Samba QA Contact 2022-09-28 15:42:12 UTC
This bug was referenced in samba v4-15-stable (Release samba-4.15.10):

6b54bb8abead164487d6c061f729ac3dc25580b9
659dfb93c2acbd038aa5465abae89cc12e394ad4
6cd04ec396cf101fb86c1cc2c17f5547d2e2e154
4c436dfe8cca162f086bd49f4032bdd3eb77553b
907e4ce03ab6809dd00d984c78ff1c006d925f97