We got reports from a customer of random core dumps. Unfortunately it has been really hard to figure out what is going on because of pretty strict security restrictions on the information they can share with us. So we have some core dumps (but only text, not the actual binaries) Also some samba logs but heavily redacted. By heavily redacted I mean nothing but the header (e.g. date/time file and line number :/) e.g. there are a number of variation of the core back trace, all essentially indicating some memory error associated with the connection_structs hanging off scon->connections e.g. #5 0x00007f332f4f3854 in fault_report (sig=11) at ../../lib/util/fault.c:81 #6 sig_fault (sig=11) at ../../lib/util/fault.c:92 #7 <signal handler called> #8 conn_lastused_update (sconn=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:38 #9 conn_idle_all (sconn=sconn@entry=0x55ce2bcbd8f0, t=1654159988) at ../../source3/smbd/conn_idle.c:54 #10 0x00007f332fab3734 in deadtime_fn (now=0x7ffcb86af710, private_data=0x55ce2bcbd8f0) at ../../source3/smbd/process.c:2926 #11 0x00007f332e6490d5 in smbd_idle_event_handler (ctx=0x55ce2bc6b240, te=<optimized out>, now=..., private_data=<optimized out>) at ../../source3/lib/util_event.c:45 #12 0x00007f332c36dae6 in tevent_common_invoke_timer_handler (te=te@entry=0x55ce2bca5570, current_time=..., removed=removed@entry=0x0) at ../../tevent_timed.c:376 Anyway, after some head scratching I managed to (only on a handful of rare occasions) reproduce very similar back traces by introducing a psuedo random failure in smbXsrv_tcon_disconnect (by injecting some custom code into source3/smbd/smbXsrv_tcon.c) Note: although for a brief time I managed to reproduce the bug a couple of times easily after that I managed only to reproduce it 2 in about 3 days of trying, it's a real pita For reference I will attach 2 valgrind reports I managed to get (with my hacked smbd binary to introduce some failing smbXsrv_tcon_disconnect badness)
Created attachment 17440 [details] valgrind output note the line numbers here are a bit skewed due to other debug statements I added
Created attachment 17441 [details] valgrind output
So while the cause of these cores seems to be the failing SMB2_OP_TDIS (there is indication this is happening with the debug header info in from customer logs) and I haven't yet got the chance to find out why this is happening, this failure shouldn't allow the use after free scenarios to happen. Attempt to mitigate against either a failing or missing SMB2_OP_TDIS before getting a SMB2_OP_LOGOFF by introducing a talloc_destructor for the connection_struct https://gitlab.com/samba-team/samba/-/merge_requests/2635
Hi Noel, can you also upload the hacks you added to the smbd binary so I can match the line numbers with the valgrind traces. Having the line numbers not match makes this almost impossible to understand. Also, full valgrind backtraces all the way up to main() helps a lot too.
For example, *where* inside source3/smbd/smbXsrv_tcon.c did you add the random failure ? There's not enough data here, sorry.
Created attachment 17442 [details] debug & hacky repro tweak
Created attachment 17443 [details] full valgrind trace (that I got with difficulty)
Created attachment 17444 [details] associated debug from time of valgrind trace
(In reply to Jeremy Allison from comment #5) for the benefit of getting matching line numbers the repro patch here was applied to 4.15.4 sources (but... confirmed by customer it is happening on 4.15.7 also) regarding the hacky fail of the tree disconnect, I don't think the heuristic I introduced matters in the slightest, I only did it that way (when first testing my theory) because I wanted a spread of disconnects over IPC$ and whatever shares I was using (my test smbd server had 3 simple shares) You probably would be better off failing *every* time (so every disconnect results in a connection_struct not removed from sconn->connection) Unfortunately I can't even offer a set of steps that seem to help in reproducing the valgrind trace, the first day I tried it I was just randomly clicking around, and I managed to get a good few 'hits' e.g. valgrind traces. Subsequently I really struggled to reproduce it (at one stage I spend 2 days+ on and off randomly browsing the shares and didn't once get a valgrind error trace. Customer reports that they see cores frequently (within 3 hours of bringing the system online) I believe from what information we have that the system is heavily loaded (connection wise) Also, multi-channel is negotiated but, this isn't necessary. For a while I though maybe it was somehow involved, but they have reproduced this with/without mutlichannel enabled. Customer also reports that multichannel enabled seems to make it happen faster. Hope this helps
Created attachment 17457 [details] debug & hacky repro tweak debug & hacky repro tweak against master (commit id 766151bf5b7)
Created attachment 17458 [details] full valgrind output
Created attachment 17459 [details] associated debug log.smbd from time of valgrind trace
OK, now I fully understand this. In SMB2 smbd_smb2_tree_connect() we create the new conn struct using: compat_conn = make_connection_smb2(req, tcon, snum, "???", &status); then move the ownership to tcon using: tcon->compat = talloc_move(tcon, &compat_conn); so the lifetime of tcon->compat is tied directly to tcon. You correctly identified the error: 905 if (tcon->compat) { 906 bool ok; 907 static int i; 908 int res = i % 3; 909 i = i + 1; 910 ok = chdir_current_service(tcon->compat); 911 DBG_ERR("## %d\n", res); 912 if (!ok || res == 0) { 913 status = NT_STATUS_INTERNAL_ERROR; 914 DEBUG(0, ("#### smbXsrv_tcon_disconnect(0x%08x, '%s'): " 915 "chdir_current_service() failed: %s\n", 916 tcon->global->tcon_global_id, 917 tcon->global->share_name, 918 nt_errstr(status))); 919 tcon->compat = NULL; 920 return status; 921 } 922 923 close_cnum(tcon->compat, vuid); 924 tcon->compat = NULL; 925 } If "chdir_current_service(tcon->compat);" fails for whatever reason, we exit this block without ever having called close_cnum(tcon->compat, vuid), leaving the conn pointer left attached to: sconn->connections. Then tcon get's freed and tcon->compat goes with it, never having called DLIST_REMOVE(conn->sconn->connections, conn); Next list walk on conn->sconn->connections and *BOOM* valgrind error :-). Your fix is 100% correct. There also needs to be a call to close_cnum(tcon->compat, vuid) before: 919 tcon->compat = NULL; 920 return status; but that on top of your fix will completely fix this. Great work !
This bug was referenced in samba master: f92bacbe216d2d74ea3ccf3fe0df5c1cc9860996
Created attachment 17478 [details] patch for v4.17, v4.16, v.15 not sure if all versions are necessary but if required the attached patch will apply cleanly for v4.15, v4.16 & v4.17
Created attachment 17479 [details] patch for v4.17, v4.16, v.15 with cherry-pick detail this time
Comment on attachment 17479 [details] patch for v4.17, v4.16, v.15 Sorry Noel, I realized there's a follow-up patch we need here to complete this bug. I'll add the MR once it's passed ci.
Extra fixes in ci here: https://gitlab.com/samba-team/devel/samba/-/pipelines/616086697
This bug was referenced in samba master: 9203d17106c0e55a30813ff1ed76869c7581a343 7005a6354df5522d9f665fb30052c458dfc93124 cf5f7b1489930f6d64c3e3512f116ccf286d4605
Created attachment 17481 [details] patch for v4.17
Created attachment 17482 [details] patch for v4.16
Created attachment 17483 [details] patch for v4.15
not sure if v4.15 is still needed (says on the schedule page it is still in maint but from the date it looks like not) but... this is the one I am most interested in so I'd be really grateful for a double check on that (files.c changes in last patch are different) to make sure I didn't forget or miss something
Re-assigning to Jule for inclusion in 4.17.rcNext, 4.16.next, 4.15.next.
Pushed to autobuild-v4-{17,16,15}-test.
This bug was referenced in samba v4-15-test: 6b54bb8abead164487d6c061f729ac3dc25580b9 659dfb93c2acbd038aa5465abae89cc12e394ad4 6cd04ec396cf101fb86c1cc2c17f5547d2e2e154 4c436dfe8cca162f086bd49f4032bdd3eb77553b 907e4ce03ab6809dd00d984c78ff1c006d925f97
This bug was referenced in samba v4-16-test: 56e1a9fc623ae184fefcf3214a6b1801b37e5fff d1bc0d0b51bb8145c4d1597a39f72d85b28f8b35 706c64c6f0ee8cca24715cf4d591ed504432ce0f a5cf33d4041d44f1f8a80563b81f3bc6893bc7ce ce464a83c76ce612171f3df4933058695210915e
This bug was referenced in samba v4-17-test: 0b33961e71a563835241c8beeae379151742e6f1 c47b7479e74944d20a82bc109dcf35aeb099eed0 5fc9bf0f63cca19a25b26d64175746431e45a47f 91273a969abcfc717b87e0a019341e78ac77b975 76bff90824aabf95147470b2f5877a4853a45de8
Closing out bug report. Thanks!
This bug was referenced in samba v4-17-stable (Release samba-4.17.0rc3): 0b33961e71a563835241c8beeae379151742e6f1 c47b7479e74944d20a82bc109dcf35aeb099eed0 5fc9bf0f63cca19a25b26d64175746431e45a47f 91273a969abcfc717b87e0a019341e78ac77b975 76bff90824aabf95147470b2f5877a4853a45de8
This bug was referenced in samba v4-16-stable (Release samba-4.16.5): 56e1a9fc623ae184fefcf3214a6b1801b37e5fff d1bc0d0b51bb8145c4d1597a39f72d85b28f8b35 706c64c6f0ee8cca24715cf4d591ed504432ce0f a5cf33d4041d44f1f8a80563b81f3bc6893bc7ce ce464a83c76ce612171f3df4933058695210915e
This bug was referenced in samba v4-15-stable (Release samba-4.15.10): 6b54bb8abead164487d6c061f729ac3dc25580b9 659dfb93c2acbd038aa5465abae89cc12e394ad4 6cd04ec396cf101fb86c1cc2c17f5547d2e2e154 4c436dfe8cca162f086bd49f4032bdd3eb77553b 907e4ce03ab6809dd00d984c78ff1c006d925f97