Bug 15128 - possible use after free of connection_struct when iterating smbd_server_connection->connections
Summary: possible use after free of connection_struct when iterating smbd_server_conne...
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: unspecified
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Noel Power
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-23 11:16 UTC by Noel Power
Modified: 2022-08-03 16:16 UTC (History)
0 users

See Also:


Attachments
valgrind output (3.76 KB, text/plain)
2022-07-23 12:08 UTC, Noel Power
no flags Details
valgrind output (4.05 KB, text/plain)
2022-07-23 12:08 UTC, Noel Power
no flags Details
debug & hacky repro tweak (15.57 KB, patch)
2022-07-24 14:00 UTC, Noel Power
no flags Details
full valgrind trace (that I got with difficulty) (72.65 KB, text/plain)
2022-07-24 14:02 UTC, Noel Power
no flags Details
associated debug from time of valgrind trace (34.04 KB, text/plain)
2022-07-24 14:02 UTC, Noel Power
no flags Details
debug & hacky repro tweak (18.75 KB, patch)
2022-08-03 16:14 UTC, Noel Power
no flags Details
full valgrind output (37.77 KB, text/plain)
2022-08-03 16:15 UTC, Noel Power
no flags Details
associated debug log.smbd from time of valgrind trace (104.22 KB, text/plain)
2022-08-03 16:16 UTC, Noel Power
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Noel Power 2022-07-23 11:16:04 UTC

    
Comment 1 Noel Power 2022-07-23 12:05:23 UTC
We got reports from a customer of random core dumps. Unfortunately it has been really hard to figure out what is going on because of pretty strict security restrictions on the information they can share with us. So we have some core dumps (but only text, not the actual binaries) Also some samba logs but heavily redacted. By heavily redacted I mean nothing but the header (e.g. date/time file and line number :/)

e.g.

there are a number of variation of the core back trace, all essentially indicating some memory error associated with the connection_structs hanging off scon->connections e.g.

#5  0x00007f332f4f3854 in fault_report (sig=11) at ../../lib/util/fault.c:81
#6  sig_fault (sig=11) at ../../lib/util/fault.c:92
#7  <signal handler called>
#8  conn_lastused_update (sconn=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:38
#9  conn_idle_all (sconn=sconn@entry=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:54
#10 0x00007f332fab3734 in deadtime_fn (now=0x7ffcb86af710, private_data=0x55ce2bcbd8f0)
    at ../../source3/smbd/process.c:2926
#11 0x00007f332e6490d5 in smbd_idle_event_handler (ctx=0x55ce2bc6b240, te=<optimized out>, now=..., 
    private_data=<optimized out>) at ../../source3/lib/util_event.c:45
#12 0x00007f332c36dae6 in tevent_common_invoke_timer_handler (te=te@entry=0x55ce2bca5570, current_time=..., 
    removed=removed@entry=0x0) at ../../tevent_timed.c:376


Anyway, after some head scratching I managed to (only on a handful of rare occasions) reproduce very similar back traces by introducing a psuedo random failure in smbXsrv_tcon_disconnect (by injecting some custom code into  source3/smbd/smbXsrv_tcon.c) Note: although for a brief time I managed to reproduce the bug a couple of times easily after that I managed only to reproduce it 2 in about 3 days of trying, it's a real pita

For reference I will attach 2 valgrind reports I managed to get (with my hacked smbd binary to introduce some failing smbXsrv_tcon_disconnect badness)
Comment 2 Noel Power 2022-07-23 12:08:32 UTC
Created attachment 17440 [details]
valgrind output

note the line numbers here are a bit skewed due to other debug statements I added
Comment 3 Noel Power 2022-07-23 12:08:51 UTC
Created attachment 17441 [details]
valgrind output
Comment 4 Noel Power 2022-07-23 12:16:08 UTC
So while the cause of these cores seems to be the failing SMB2_OP_TDIS (there is indication this is happening with the debug header info in from customer logs) and I haven't yet got the chance to find out why this is happening, this failure shouldn't allow the use after free scenarios to happen.

Attempt to mitigate against either a failing or missing SMB2_OP_TDIS before getting a SMB2_OP_LOGOFF by introducing a talloc_destructor for the connection_struct

https://gitlab.com/samba-team/samba/-/merge_requests/2635
Comment 5 Jeremy Allison 2022-07-23 23:09:21 UTC
Hi Noel, can you also upload the hacks you added to the smbd binary so I can match the line numbers with the valgrind traces.

Having the line numbers not match makes this almost impossible to understand.

Also, full valgrind backtraces all the way up to main() helps a lot too.
Comment 6 Jeremy Allison 2022-07-23 23:10:21 UTC
For example, *where* inside source3/smbd/smbXsrv_tcon.c did you add the random failure ? There's not enough data here, sorry.
Comment 7 Noel Power 2022-07-24 14:00:52 UTC
Created attachment 17442 [details]
debug & hacky repro tweak
Comment 8 Noel Power 2022-07-24 14:02:17 UTC
Created attachment 17443 [details]
full valgrind trace (that I got with difficulty)
Comment 9 Noel Power 2022-07-24 14:02:41 UTC
Created attachment 17444 [details]
associated debug from time of valgrind trace
Comment 10 Noel Power 2022-07-24 14:23:52 UTC
(In reply to Jeremy Allison from comment #5)

for the benefit of getting matching line numbers the repro patch here was applied to 4.15.4 sources (but... confirmed by customer it is happening on 4.15.7 also)

regarding the hacky fail of the tree disconnect, I don't think the heuristic I introduced matters in the slightest, I only did it that way (when first testing my theory) because I wanted a spread of disconnects over IPC$ and whatever shares I was using (my test smbd server had 3 simple shares) You probably would be better off failing *every* time (so every disconnect results in a connection_struct not removed from sconn->connection)

Unfortunately I can't even offer a set of steps that seem to help in reproducing the valgrind trace, the first day I tried it I was just randomly clicking around, and I managed to get a good few 'hits' e.g. valgrind traces. Subsequently I really struggled to reproduce it (at one stage I spend 2 days+ on and off randomly browsing the shares and didn't once get a valgrind error trace.

Customer reports that they see cores frequently (within 3 hours of bringing the system online) I believe from what information we have that the system is heavily loaded (connection wise) Also, multi-channel is negotiated but, this isn't necessary. For a while I though maybe it was somehow involved, but they have reproduced this with/without mutlichannel enabled. Customer also reports that multichannel enabled seems to make it happen faster.

Hope this helps
Comment 11 Noel Power 2022-08-03 16:14:57 UTC
Created attachment 17457 [details]
debug & hacky repro tweak

debug & hacky repro tweak against master (commit id 766151bf5b7)
Comment 12 Noel Power 2022-08-03 16:15:23 UTC
Created attachment 17458 [details]
full valgrind output
Comment 13 Noel Power 2022-08-03 16:16:13 UTC
Created attachment 17459 [details]
associated debug log.smbd from time of valgrind trace