If a client connects to a non-public address first followed by a connect to public address with the same client_guid and a connection to the non-public address gets disconnected first, we hit by a use-after-free talloc_get_type_abort() called from release_ip() as "xconn" is already gone, taking smbd_release_ip_state with it.
Mostly to give Amitay some context on this, if he gets time to follow this... SMB multichannel advertises non-public-IPs for "secondary" connections after a client connects to a "primary" public IP. While this is surprising to CTDB developers, it is the only sane way to make multichannel work so that clients don't end up with connections to different nodes. In this bug, I still don't understand why a client would connect to a non-public IP first. Maybe they shouldn't... but we also shouldn't crash. In the future, with more flexible failover in CTDB, we will have a better design for multichannel...
(In reply to Martin Schwenke from comment #1) I don't know how the problem happens in real setups, but customer logs show, the use-after-free happened there. From reading the code I was able to trigger it like this: using smbclient //cluster/share -Uuser All smbclient commands got this in addition: --option="libsmb:client_guid=6112f7d3-9528-4a2a-8861-0ca129aae6c4" It means all connections will be merged to same smbd process in order to allow multichannel to that process. The cluster looks like this: # ctdb status Number of nodes:3 pnn:0 172.31.99.166 OK (THIS NODE) pnn:1 172.31.99.167 OK pnn:2 172.31.99.168 OK Generation:1237225803 Size:3 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 Recovery mode:NORMAL (0) Leader:0 # ctdb ip Public IPs on node 0 172.31.91.20 2 172.31.91.21 1 172.31.91.22 0 172.31.92.20 2 172.31.92.21 1 172.31.92.22 0 I reproduced the crash like this: 1. (In an extra terminal): smbclient //172.31.99.166/shm -Uroot%test 2. (In an extra terminal): smbclient //172.31.91.22/shm -Uroot%test Now we have two connections to the same smbd: # netstat -tpnW | grep '172.31.99.166' tcp 0 0 172.31.91.166:445 172.31.91.1:42470 ESTABLISHED 1634797/smbd: clien # netstat -tpnW | grep '172.31.91.22' tcp 0 0 172.31.91.22:445 172.31.91.1:55462 ESTABLISHED 1634797/smbd: clien But ctdb don't see the tcp connection for the connection to the public ip: # ctdb gettickles 172.31.91.22 Connections for IP: 172.31.91.22 Num connections: 0 This is a minor problem and the reason behind need for the extra CTDB_CONTROL_TCP_CLIENT_PASSED The problem happens like this: - the parent smbd accepted the 2nd connection 172.31.91.22:445 172.31.91.1:55462 and forked a child smbd (PID=1634798) (as always) - smbd (PID=1634798) uses CTDB_CONTROL_TCP_CLIENT in order to register the connection with ctdb. - ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds the connection to client->tcp_list and forwards it as CTDB_CONTROL_TCP_ADD to all nodes (including itself) - ctdb gets the CTDB_CONTROL_TCP_ADD message and adds the connection to vnn->tcp_array - smbd (PID=1634798) processed the SMB2 Negotiate request and noticed the client guid was already registered by smbd with PID=1634797. - smbd (PID=1634798) sends a MSG_SMBXSRV_CONNECTION_PASS message to PID=1634797 with the socket fd and the content of the SMB2 Negotiate request - smbd (PID=1634797) received the MSG_SMBXSRV_CONNECTION_PASS message and also uses CTDB_CONTROL_TCP_CLIENT in order to register the connection with ctdb - ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds the connection to client->tcp_list and broadcasts it as CTDB_CONTROL_TCP_ADD to all nodes (including itself) - ctdb gets the CTDB_CONTROL_TCP_ADD message and detects that the connection is already in vnn->tcp_array and the ignores the 2nd registration - smbd (PID=1634797) processes the SMB2 Negotiate request again and sends a MSG_SMBXSRV_CONNECTION_PASSED message to PID=1634798 - smbd (PID=1634798) exits itself, which means the connection to ctdb is also closed - ctdb detects that PID=1634798 closed the connection and calls ctdb_takeover_client_destructor_hook() it calls ctdb_remove_connection() for the tcp connection and removes it from vnn->tcp_array too, which means "ctdb gettickles" no longer sees the connection, which is still valid for smbd (PID=1634797) Now back to reproducing the crash problem: 3. The smbclient from step 1. is stopped with "exit" This means the release_ip callback registered for it by ctdbd_register_ips() is still registered, but its private_data "state" is gone. 4. "ctdb moveip 172.31.91.22 1" moves the ip to another node and ctdb sends a message to CTDB_SRVID_RELEASE_IP in order to let smbd (PID=1634797) disconnect/exit The problem is that ctdbd_msg_call_back() calls all registered callback in the order they called register_with_ctdbd(), which means the release_ip function with the stale 'state' is called first and hits the use-after-free. Now we have an explicit ctdbd_unregister_ips() in order to unregister the callback with the stale state and also inform ctdb via CTDB_CONTROL_TCP_CLIENT_DISCONNECTED that the tcp connection is gone now. In the case of multichannel passing from the temporary smbd, we need to use ctdbd_passed_ips() and CTDB_CONTROL_TCP_CLIENT_PASSED in order to remove the connection from client->tcp_list, but keep it alive in vnn->tcp_array. I hope that makes it clear now
This bug was referenced in samba master: 92badd3bdd82d1fa79727efcf81b6f479016811f c6602b686b4e50d93272667ef86d3904181fb1ab 037e8e449deb136ad5ed5e4de05439411b545b6d 240139370aa19f53dd3de0ff468afd994d3bd973 77a559432ffde2d435e29bed126d20a09d33f48e 75aa6693940201a928b46f6880b43820c0e1c555 f3a03f3f774f0795fc1a163f12cccb9cedeebec1 2e784789d78d09dfbc599085e5eb1c70c5b866b8 38b74d4ca9a59e7f12850c20c410f9df26cbad0a 082c7df4d04c2a94c5413c1d6b7eae7be610f950 ddf47e7fe314e0f5bf71ff53e35350e0ba530d08 8fc3872557f715dc38f9898754a785fd073ace96 4b7329f15820f1b4d9a7b7f0947719c4217b312a
Created attachment 18200 [details] Patches for v4-19-test
Created attachment 18201 [details] Patches for v4-18-test
Hi Jule, This is ready for 4.19 and 4.18. Thanks...
Pushed to autobuild-v4-{19,18}-test.
This bug was referenced in samba v4-18-test: 30fddc01431ace1ea345db0fcd96415cf60495f9 f769415799332387ea7b2daa73d47f2f18fbe386 813e7186719d8f0277671918fe00d407ab8d35d3 24d960d02b84a498cfacc5ee2b283d7bd5b2e1f6 562e360ed7ce131e8e6f08d521aee2f21a1e0160 ecc0acbbff21ddae9b8b676993fa8ac20db4b55b 18d34cea2a19232dc30cbe45545885a6406e2be0 6ca3ce4db30cf2d4cd906cf2a695bb928ec9f2ba d9ed96c908cf83fc0765372c4e54919551fe94f5 f8c02609f4807435cbdee1d1433429a549fc981e 279187965b8b2adbd2939bf2a9e587edce04431d ff4ed4d760a61e59d2e23f1e3d956308585b846d a149a96eaf250625b4be89aa1059330bd953a06c
This bug was referenced in samba v4-19-test: 38134f374ca5c76cf636dd9d10ab8e2cf5e4bf1e b6906f37c667677dcac9bcfd6cc182869cb6f569 118d6c81ec9fd19a98d5f674ccff2a57aed81881 e09f92422ebff3454332ebbb28d578d8103e7203 d039fa07f7ee1f0eed6777f812a0ba4913e76ac7 acf080817b538968d6e7cbcd5a037399c9a66c28 e3a4feda112a174eebf9c866032aff899202eed7 8add947b2127e3a2dd2fd515d7701c396733f8e4 d96cb627b66d4c63d0407f4e434b429bb03a47af 43b7068676a00a5169ac6f34e97a1bacf8d0c29d 2640bae75e3abab1a9959b2de14dad29020852de 2e93e358ef9f3d55065cd5e9ad1b03126df06b63 853efb9dd7b7eccebc5bd8a4129f2dc487ad8120
Closing out bug report. Thanks!
This bug was referenced in samba v4-19-stable (Release samba-4.19.4): 38134f374ca5c76cf636dd9d10ab8e2cf5e4bf1e b6906f37c667677dcac9bcfd6cc182869cb6f569 118d6c81ec9fd19a98d5f674ccff2a57aed81881 e09f92422ebff3454332ebbb28d578d8103e7203 d039fa07f7ee1f0eed6777f812a0ba4913e76ac7 acf080817b538968d6e7cbcd5a037399c9a66c28 e3a4feda112a174eebf9c866032aff899202eed7 8add947b2127e3a2dd2fd515d7701c396733f8e4 d96cb627b66d4c63d0407f4e434b429bb03a47af 43b7068676a00a5169ac6f34e97a1bacf8d0c29d 2640bae75e3abab1a9959b2de14dad29020852de 2e93e358ef9f3d55065cd5e9ad1b03126df06b63 853efb9dd7b7eccebc5bd8a4129f2dc487ad8120
This bug was referenced in samba v4-18-stable (Release samba-4.18.10): 30fddc01431ace1ea345db0fcd96415cf60495f9 f769415799332387ea7b2daa73d47f2f18fbe386 813e7186719d8f0277671918fe00d407ab8d35d3 24d960d02b84a498cfacc5ee2b283d7bd5b2e1f6 562e360ed7ce131e8e6f08d521aee2f21a1e0160 ecc0acbbff21ddae9b8b676993fa8ac20db4b55b 18d34cea2a19232dc30cbe45545885a6406e2be0 6ca3ce4db30cf2d4cd906cf2a695bb928ec9f2ba d9ed96c908cf83fc0765372c4e54919551fe94f5 f8c02609f4807435cbdee1d1433429a549fc981e 279187965b8b2adbd2939bf2a9e587edce04431d ff4ed4d760a61e59d2e23f1e3d956308585b846d a149a96eaf250625b4be89aa1059330bd953a06c