Bug 15523 - ctdb RELEASE_IP causes a crash in release_ip if a connection to a non-public address disconnects first
Summary: ctdb RELEASE_IP causes a crash in release_ip if a connection to a non-public ...
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Clustering (show other bugs)
Version: 4.19.0
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL: https://gitlab.com/samba-team/samba/-...
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-21 16:16 UTC by Stefan Metzmacher
Modified: 2024-01-31 20:42 UTC (History)
4 users (show)

See Also:


Attachments
Patches for v4-19-test (62.88 KB, patch)
2023-12-15 12:25 UTC, Stefan Metzmacher
martins: review+
Details
Patches for v4-18-test (62.92 KB, patch)
2023-12-15 12:25 UTC, Stefan Metzmacher
martins: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Metzmacher 2023-11-21 16:16:54 UTC
If a client connects to a non-public address first followed by a connect
to public address with the same client_guid and a connection to
the non-public address gets disconnected first, we hit by a use-after-free
talloc_get_type_abort() called from release_ip() as
"xconn" is already gone, taking smbd_release_ip_state with it.
Comment 1 Martin Schwenke 2023-11-21 21:54:02 UTC
Mostly to give Amitay some context on this, if he gets time to follow this...

SMB multichannel advertises non-public-IPs for "secondary" connections after a client connects to a "primary" public IP.  While this is surprising to CTDB developers, it is the only sane way to make multichannel work so that clients don't end up with connections to different nodes.

In this bug, I still don't understand why a client would connect to a non-public IP first.  Maybe they shouldn't... but we also shouldn't crash.

In the future, with more flexible failover in CTDB, we will have a better design for multichannel...
Comment 2 Stefan Metzmacher 2023-11-22 11:51:05 UTC
(In reply to Martin Schwenke from comment #1)

I don't know how the problem happens in real setups,
but customer logs show, the use-after-free happened there.

From reading the code I was able to trigger it like this:
using smbclient //cluster/share -Uuser

All smbclient commands got this in addition:
--option="libsmb:client_guid=6112f7d3-9528-4a2a-8861-0ca129aae6c4"
It means all connections will be merged to same smbd process
in order to allow multichannel to that process.

The cluster looks like this:

# ctdb status
Number of nodes:3
pnn:0 172.31.99.166    OK (THIS NODE)
pnn:1 172.31.99.167    OK
pnn:2 172.31.99.168    OK
Generation:1237225803
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Leader:0
# ctdb ip
Public IPs on node 0
172.31.91.20 2
172.31.91.21 1
172.31.91.22 0
172.31.92.20 2
172.31.92.21 1
172.31.92.22 0

I reproduced the crash like this:
1. (In an extra terminal): smbclient //172.31.99.166/shm -Uroot%test
2. (In an extra terminal): smbclient //172.31.91.22/shm -Uroot%test

Now we have two connections to the same smbd:

# netstat -tpnW | grep '172.31.99.166'
tcp 0 0 172.31.91.166:445 172.31.91.1:42470 ESTABLISHED 1634797/smbd: clien
# netstat -tpnW | grep '172.31.91.22'
tcp 0 0 172.31.91.22:445  172.31.91.1:55462 ESTABLISHED 1634797/smbd: clien

But ctdb don't see the tcp connection for the connection to the
public ip:

 # ctdb gettickles 172.31.91.22
 Connections for IP: 172.31.91.22
 Num connections: 0

This is a minor problem and the reason behind need for the extra
CTDB_CONTROL_TCP_CLIENT_PASSED

 The problem happens like this:

 - the parent smbd accepted the 2nd connection 
   172.31.91.22:445  172.31.91.1:55462
   and forked a child smbd (PID=1634798) (as always)
 - smbd (PID=1634798) uses CTDB_CONTROL_TCP_CLIENT in order
   to register the connection with ctdb.
 - ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds
   the connection to client->tcp_list and forwards it
   as CTDB_CONTROL_TCP_ADD to all nodes (including itself)
 - ctdb gets the CTDB_CONTROL_TCP_ADD message and
   adds the connection to vnn->tcp_array
 - smbd (PID=1634798) processed the SMB2 Negotiate request
   and noticed the client guid was already registered by
   smbd with PID=1634797.
 - smbd (PID=1634798) sends a MSG_SMBXSRV_CONNECTION_PASS
   message to PID=1634797 with the socket fd and the
   content of the SMB2 Negotiate request
 - smbd (PID=1634797) received the MSG_SMBXSRV_CONNECTION_PASS
   message and also uses CTDB_CONTROL_TCP_CLIENT in order
   to register the connection with ctdb
 - ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds
   the connection to client->tcp_list and broadcasts it
   as CTDB_CONTROL_TCP_ADD to all nodes (including itself)
 - ctdb gets the CTDB_CONTROL_TCP_ADD message and
   detects that the connection is already in vnn->tcp_array
   and the ignores the 2nd registration
 - smbd (PID=1634797) processes the SMB2 Negotiate request
   again and sends a MSG_SMBXSRV_CONNECTION_PASSED message
   to PID=1634798
 - smbd (PID=1634798) exits itself, which means the connection
   to ctdb is also closed
 - ctdb detects that PID=1634798 closed the connection
   and calls ctdb_takeover_client_destructor_hook() it
   calls ctdb_remove_connection() for the tcp connection
   and removes it from vnn->tcp_array too, which means
   "ctdb gettickles" no longer sees the connection,
   which is still valid for smbd (PID=1634797)

Now back to reproducing the crash problem:

3. The smbclient from step 1. is stopped with "exit"
   This means the release_ip callback registered for it
   by ctdbd_register_ips() is still registered, but
   its private_data "state" is gone.
4. "ctdb moveip 172.31.91.22 1" moves the ip to
   another node and ctdb sends a message to CTDB_SRVID_RELEASE_IP
   in order to let smbd (PID=1634797) disconnect/exit

The problem is that ctdbd_msg_call_back() calls all
registered callback in the order they called register_with_ctdbd(),
which means the release_ip function with the stale 'state' is
called first and hits the use-after-free.

Now we have an explicit ctdbd_unregister_ips() in order to
unregister the callback with the stale state and also
inform ctdb via CTDB_CONTROL_TCP_CLIENT_DISCONNECTED
that the tcp connection is gone now.

In the case of multichannel passing from the
temporary smbd, we need to use ctdbd_passed_ips()
and CTDB_CONTROL_TCP_CLIENT_PASSED in order to
remove the connection from client->tcp_list,
but keep it alive in vnn->tcp_array.

I hope that makes it clear now
Comment 3 Samba QA Contact 2023-12-15 12:10:05 UTC
This bug was referenced in samba master:

92badd3bdd82d1fa79727efcf81b6f479016811f
c6602b686b4e50d93272667ef86d3904181fb1ab
037e8e449deb136ad5ed5e4de05439411b545b6d
240139370aa19f53dd3de0ff468afd994d3bd973
77a559432ffde2d435e29bed126d20a09d33f48e
75aa6693940201a928b46f6880b43820c0e1c555
f3a03f3f774f0795fc1a163f12cccb9cedeebec1
2e784789d78d09dfbc599085e5eb1c70c5b866b8
38b74d4ca9a59e7f12850c20c410f9df26cbad0a
082c7df4d04c2a94c5413c1d6b7eae7be610f950
ddf47e7fe314e0f5bf71ff53e35350e0ba530d08
8fc3872557f715dc38f9898754a785fd073ace96
4b7329f15820f1b4d9a7b7f0947719c4217b312a
Comment 4 Stefan Metzmacher 2023-12-15 12:25:22 UTC
Created attachment 18200 [details]
Patches for v4-19-test
Comment 5 Stefan Metzmacher 2023-12-15 12:25:53 UTC
Created attachment 18201 [details]
Patches for v4-18-test
Comment 6 Martin Schwenke 2023-12-16 06:54:57 UTC
Hi Jule,

This is ready for 4.19 and 4.18.

Thanks...
Comment 7 Jule Anger 2023-12-16 14:29:25 UTC
Pushed to autobuild-v4-{19,18}-test.
Comment 8 Samba QA Contact 2023-12-16 15:27:04 UTC
This bug was referenced in samba v4-18-test:

30fddc01431ace1ea345db0fcd96415cf60495f9
f769415799332387ea7b2daa73d47f2f18fbe386
813e7186719d8f0277671918fe00d407ab8d35d3
24d960d02b84a498cfacc5ee2b283d7bd5b2e1f6
562e360ed7ce131e8e6f08d521aee2f21a1e0160
ecc0acbbff21ddae9b8b676993fa8ac20db4b55b
18d34cea2a19232dc30cbe45545885a6406e2be0
6ca3ce4db30cf2d4cd906cf2a695bb928ec9f2ba
d9ed96c908cf83fc0765372c4e54919551fe94f5
f8c02609f4807435cbdee1d1433429a549fc981e
279187965b8b2adbd2939bf2a9e587edce04431d
ff4ed4d760a61e59d2e23f1e3d956308585b846d
a149a96eaf250625b4be89aa1059330bd953a06c
Comment 9 Samba QA Contact 2023-12-16 16:09:04 UTC
This bug was referenced in samba v4-19-test:

38134f374ca5c76cf636dd9d10ab8e2cf5e4bf1e
b6906f37c667677dcac9bcfd6cc182869cb6f569
118d6c81ec9fd19a98d5f674ccff2a57aed81881
e09f92422ebff3454332ebbb28d578d8103e7203
d039fa07f7ee1f0eed6777f812a0ba4913e76ac7
acf080817b538968d6e7cbcd5a037399c9a66c28
e3a4feda112a174eebf9c866032aff899202eed7
8add947b2127e3a2dd2fd515d7701c396733f8e4
d96cb627b66d4c63d0407f4e434b429bb03a47af
43b7068676a00a5169ac6f34e97a1bacf8d0c29d
2640bae75e3abab1a9959b2de14dad29020852de
2e93e358ef9f3d55065cd5e9ad1b03126df06b63
853efb9dd7b7eccebc5bd8a4129f2dc487ad8120
Comment 10 Jule Anger 2023-12-18 08:37:02 UTC
Closing out bug report.

Thanks!
Comment 11 Samba QA Contact 2024-01-08 14:39:16 UTC
This bug was referenced in samba v4-19-stable (Release samba-4.19.4):

38134f374ca5c76cf636dd9d10ab8e2cf5e4bf1e
b6906f37c667677dcac9bcfd6cc182869cb6f569
118d6c81ec9fd19a98d5f674ccff2a57aed81881
e09f92422ebff3454332ebbb28d578d8103e7203
d039fa07f7ee1f0eed6777f812a0ba4913e76ac7
acf080817b538968d6e7cbcd5a037399c9a66c28
e3a4feda112a174eebf9c866032aff899202eed7
8add947b2127e3a2dd2fd515d7701c396733f8e4
d96cb627b66d4c63d0407f4e434b429bb03a47af
43b7068676a00a5169ac6f34e97a1bacf8d0c29d
2640bae75e3abab1a9959b2de14dad29020852de
2e93e358ef9f3d55065cd5e9ad1b03126df06b63
853efb9dd7b7eccebc5bd8a4129f2dc487ad8120
Comment 12 Samba QA Contact 2024-01-31 20:42:38 UTC
This bug was referenced in samba v4-18-stable (Release samba-4.18.10):

30fddc01431ace1ea345db0fcd96415cf60495f9
f769415799332387ea7b2daa73d47f2f18fbe386
813e7186719d8f0277671918fe00d407ab8d35d3
24d960d02b84a498cfacc5ee2b283d7bd5b2e1f6
562e360ed7ce131e8e6f08d521aee2f21a1e0160
ecc0acbbff21ddae9b8b676993fa8ac20db4b55b
18d34cea2a19232dc30cbe45545885a6406e2be0
6ca3ce4db30cf2d4cd906cf2a695bb928ec9f2ba
d9ed96c908cf83fc0765372c4e54919551fe94f5
f8c02609f4807435cbdee1d1433429a549fc981e
279187965b8b2adbd2939bf2a9e587edce04431d
ff4ed4d760a61e59d2e23f1e3d956308585b846d
a149a96eaf250625b4be89aa1059330bd953a06c