It seems commit d0baad257e511280ff3e5c7372c38c43df841070 for bug 14175 introduces a regression the results in a recovery loop. The scenario that triggers this seems to be: - hard power off of a node A - all other nodes in the cluster fail to free struct ctdb_tcp_node.in_queue - restart node A and start ctdb - node A connect to other nodes but the other nodes reject the incoming connection with Feb 21 13:47:13 somenode ctdbd[302424]: ctdb_listen_event: Incoming queue active, rejecting connection from SOMEIP struct ctdb_tcp_node.in_queue is only ever freed in the fd readable handler ctdb_tcp_read_cb(), but this gets never called as the TCP stacks on the nodes doesn't notice the connection is dead. ctdb sets SO_KEEPALIVE on the socket, but the default timeout for tcp_keepalive_time is 2 hours. A possible fix might be to free the in_queue in ctdb_tcp_stop_connection(). Have untested WIP patch, need bugnumber.
Created attachment 15823 [details] WIP patch for master
With this patch, if I build and test from the ctdb/ subdirectory, I do something like this: $ ./tests/local_daemons.sh foo setup -P /dev/null $ ./tests/local_daemons.sh foo start all [wait a while] $ ./tests/local_daemons.sh foo print-log all | less I see that this is behaving in some strange pathological way, with looping disconnects. I need to understand why this is...
I think I can see the problem... On node A, freeing tnode->in_queue in ctdb_tcp_stop_connection() means that an existing incoming connection from node B -> A is closed by ctdb_tcp_node_connect() when it is trying to establish A-> B connection. That causes B's outgoing connection B -> A to be closed. So then B tries to reestablish connection B -> A, tearing down the connection from A -> B. Rinse and repeat... Noel's patch at https://gitlab.com/samba-team/samba/-/merge_requests/1177 only tears down the inbound connection once when the connection goes down, so doesn't exhibit the same looping behaviour. Ralph, can you please test Noel's patch? Thanks...
Created attachment 15826 [details] WIP patch for master from Noel This moves the freeing to the keepalive code which sounds correct. I'll give it a whirl.
(In reply to Martin Schwenke from comment #3) yep, my first attempt was exactly the same as Ralphs and it didn't work for me
Sorry, but this is a regression compared to 4.11, so tagging as such. It doesn't result in complete breakage but does break an important feature. The patch looks good. I'll put it into my test branch and make sure that it survives several runs over the weekend...
Created attachment 15827 [details] More complete patch Here's an updated patch that avoids spreading the TALLOC_FREE of the incoming queue will-nilly over the code, but instead introduces dedicated helper functions to do the task. I believe we should free the incoming queue in ctdb_tcp_tnode_cb() as well, so we'd already have to places where we'd call ctdb_tcp_stop_connection() plus an explicit TALLOC_FREE(tnode->incoming). Thoughts?
I'll put an idea here because a headache is stopping me from thinking about it in details. We obviously can't bikeshed this one for too long, because it needs to be fixed for 4.12. The places where both connections need to be torn down should probably all be "node dead" cases. For the original one in ctdb_tcp_read_cb(), we could drop the explicit TALLOC_FREE(tnode->in_queue) and depend on node_dead() -> ctdb_tcp_restart() to do it for us. Same for the other case(s): need to understand when data != NULL in ctdb_tcp_tnode_cb(). The only problem is that ctdb_node_dead() doesn't unconditionally call restart()... and if it does then I'm a little concerned about pathological behaviour... Anyway, will try to figure this out by the end of Monday... more input is very welcome! :-)
(In reply to Martin Schwenke from comment #8) Nodes are already marked connected (NODE_FLAGS_DISCONNECTED removed) in half-connected state, so relying on ctdb_node_dead() with it's dependency on NODE_FLAGS_DISCONNECTED seems a bit fragile.
(In reply to Ralph Böhme from comment #9) Exactly! Although I'm wondering if it would be sane to re-order the code in ctdb_node_dead() so that it unconditionally tears down the connection(s) first and then changes the state. As long as nothing goes to the event loop then this should be a safe change. My concern is that always tearing down both sides of the connection will result in more of that pathological behaviour, but I'm too tired and headachey to get that deep into it today. My overall goal is to streamline the code so that the connections are only torn down in one place (i.e. no stray TALLOC_FREE()s and it is easier to reason about. At the moment the changes will mean that ctdb_tcp_read_cb() frees the incoming queue, then calls node_dead(), which calls restart(), which redundantly frees the incoming queue... though I feel that the first free is the redundant one. However, this may not be the time for that... :-)
(In reply to Martin Schwenke from comment #10) Ah, I was wrong about the half-connected state, 8c98c10f242bc722beffc711e85c0e4f2e74cd57 fixed this. Guess this means we can safely replace the calls to ctdb_tdb_stop_connection() by an upcall to ->node_dead().
Created attachment 15828 [details] Patch for master v4 Ok, so what about this one. * additionally adds ctdb_tcp_stop_incoming_connection() * let ctdb_node_dead() restart transport in half-connected state * rely on node_dead() at the transport layer and remove calls to ctdb_tdb_stop_*connection() Pipeline running at https://gitlab.com/samba-team/samba/-/merge_requests/1180
(In reply to Ralph Böhme from comment #12) Hi I agree with Martin that node_dead is the place that should be responsible for tearing down the both connections and I also agree with you with the idea that TALLOC_FREE(in_queue) should be managed in one place and not peppered over the code.. However think the way the patch set is ordered is quite confusing because it mixes both a rename of old function, addition of 2 new functions and partial bug fix mixing all of the above Also in interim, (e.g. by Patch 6/9) imo we have exchanged a situation where TALLOC_FREE(in_queue) is peppered around the code to a situation where ctdb_tcp_stop_connection is. By the end of the patch set ctdb_tcp_stop_connection in only called by ctdb_tcp_restart and ctdb_tcp_stop_incoming_connection is not called externally at all. To be honest from Martins descriptions I guess it is is desired to prevent calling these functions at all outside the node_dead/ctdb_tcp_restart handling How about something like the following (untested) patch ? and then further rework to rename, reorganize functions and whatnot which can be done with less churn if you think it is necessary with follow on patches ps not a nak just my thoughts on an alternative (and less confusing for me) way
Created attachment 15829 [details] alternative patch
(In reply to Noel Power from comment #14) >However think the way the patch set is ordered is quite confusing because it mixes both a rename of old function, addition of 2 new functions and partial bug fix mixing all of the above Remember, these are micro commits which are meant to do represent minimal digastible logical changes per commit to get us from A to B. >Also in interim, (e.g. by Patch 6/9) imo we have exchanged a situation where TALLOC_FREE(in_queue) is peppered around the code to a situation where ctdb_tcp_stop_connection is. That's just the interim while me move from A to B. It can be done differently of course, but I think we need explicit ctdb_tcp_stop_*connection() functions, even if in the end ctdb_tcp_stop_incoming_connection() is only called from ctdb_tcp_stop_connection() and could be made static. What's missing in ctdb_tcp_stop_[outgoing|incoming]_connection are comments saying that TALLOC_FREE([in|out]_queue) should only be called from these places. With your minimal change we still have this code TALLOC_FREE(tnode->in_queue); ctdb_tcp_stop_connection(node); which is exactly what brought us here. But let's see what other version Martin comes up with. :)
Created attachment 15830 [details] WIP patch from Martin I think I have a nice compromise that does nearly everything Ralph's patch series does, but without some of the churn. :-D Compared to Noel's patches, I have moved the change to ctdb_node_dead() to the beginning. This makes the series bisectable. I've added a cosmetic patch that it intended to be squashed in with the ctdb_node_dead() one. Other than that, I think the first (real) 3 commits are the same as Noel's latest patch, though I have done some commit message tweaking. Then one redundant call to ctdb_tcp_stop_connection() is dropped via a cut down version Ralph's redundancy removing patch. Then, I basically use Ralph's patches to add ctdb_tcp_stop_outgoing() (shorter name, says as much) and ctdb_tcp_stop_incoming(). The latter is added and used in a single patch because I think it is simple enough. Doing this at the end results in the least churn. I didn't add ctdb_tcp_stop_connection() because it isn't really necessary, since it would only be used once and the code seems nice and clear without it. The only place I added a Signed-off-by: is in the last patch because I thought it was a good idea to add & use the function at the same time. Happy to drop that... We *could* backport just the first 2 or 3 commits to get a minimal fix. However, I think that it is worth improving the code while we fix the bug, so I think we should push and backport it all. Obviously, if you think I have committed atrocities with your patches and/or commit messages then please say so and we'll come to another arrangement. I have smoke tested this with a few scenarios using local_daemons.sh. I'm also running a pipeline at: https://gitlab.com/samba-team/devel/samba/pipelines/122223866 and running it through our local regression testing. It would be good if you could both test in your original scenarios.
(In reply to Martin Schwenke from comment #16) Lgtm. But I'd keep the squash commit as seperate commit. Pushed to <https://gitlab.com/samba-team/samba/-/merge_requests/1180>. Please push if happy.
(In reply to Ralph Böhme from comment #17) Ups, please wait before pushing. result from testing is still outstanding. :)
I have tested this by doing: iptables -I OUTPUT -p tcp -s 127.0.0.2 --dport 4379 -j DROP Without the fix I see: ==> ./foo/node.0/log.ctdb <== 2020/03/02 12:39:02.258590 ctdbd[1210065]: ctdb_listen_event: Incoming queue active, rejecting connection from 127.0.0.2 ==> ./foo/node.1/log.ctdb <== 2020/03/02 12:39:02.258588 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.1:4379 is already marked connected: 1 connected 2020/03/02 12:39:02.258755 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.1:4379 is dead: 0 connected 2020/03/02 12:39:02.258806 ctdbd[1210086]: Tearing down connection to dead node :0 ==> ./foo/node.2/log.ctdb <== 2020/03/02 12:39:02.272307 ctdbd[1210106]: ctdb_listen_event: Incoming queue active, rejecting connection from 127.0.0.2 ==> ./foo/node.1/log.ctdb <== 2020/03/02 12:39:02.272452 ctdbd[1210086]: 127.0.0.2:4379: connected to 127.0.0.3:4379 - 1 connected 2020/03/02 12:39:02.272535 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.3:4379 is dead: 0 connected 2020/03/02 12:39:02.272567 ctdbd[1210086]: Tearing down connection to dead node :2 With the fix, the node is properly disconnected after ~25s and is banned when it tries to take the recovery lock. If I remove the firewall rule: iptables -D OUTPUT -p tcp -s 127.0.0.2 --dport 4379 -j DROP then the node is reconnected and is unbanned when the ban expires. Looks good!
Also waiting for Amitay to take a look...
(In reply to Martin Schwenke from comment #20) thanks alot Martin/Ralph I'll test the new changes here again with the previously failing HA setup this morning
(In reply to Noel Power from comment #21) the previous failing HA scenario now works well with new patches
Created attachment 15855 [details] New WIP patch from Martin New WIP patch with simplifications at the end to drop redundant stop of outgoing connection. This requires the commit in the middle called: SQ: Remove redundant restart in ctdb_tcp_tnode_cb() to be squashed in the previous commit. That took a long time to find, so I think the resulting commit becomes Signed-off-by: Ralph and I. Initial unsquashed pipeline at: https://gitlab.com/samba-team/devel/samba/pipelines/125132222 Happy for this to be squashed into: https://gitlab.com/samba-team/samba/-/merge_requests/1180
I'm planning to try to add a couple of tests where ctdbd on one node is sent SIGSTOP to simulate a node timing out (thanks Volker!). After all other nodes notice the node DISCONNECTED then there are 2 scenarios to play with: SIGCONT (and see what happens) & SIGKILL + restart (and see what happens). In both cases "what happens" should be that all nodes become connected and come out of recovery. However, if people are happy for this to be pushed tomorrow before these tests are available then I'm fine with that. The patches have been looked over by quite a few sets of eyes and my confidence is high.
(In reply to Martin Schwenke from comment #24) nice job, thanks Martin. I'll test this again with the HA env (I still have it) I don't expect this will change anything, I am pretty confident it will work
(In reply to Martin Schwenke from comment #24) Lgtm. Pushed to https://gitlab.com/samba-team/samba/-/merge_requests/1180. I've fixed the bug url in Noels commit and squashed your change into the previous one. Please push if happy.
(In reply to Martin Schwenke from comment #24) Unfortunately using SIGSTOP doesn't keep the node disconnected. Connections to the SIGSTOP-ed daemon time-out and are reconnected immediately at the TCP level. This is due to commit 38dc6d11a26c2e9a2cae7927321f2216ceb1c5ec, which reverted behaviour to mark nodes connected when they are connected at the TCP level. Not doing this caused other problems, so the revert is necessary. In the current design, we can't have everything... especially these nice, artificial tests. ;-) Right now we don't have a good way of automatically testing this. In the new CTDB design the transport will only a mark node as connected when it has verified the connection by validating a "connect" packet... so this sort of test may be possible in the future.
I've now pushed this to autobuild with minor commit message changes: BUG tags in the last 2 commits and fixed the commit message in the squashed patch (and re-added BUG tag there). I don't think the last round of changes invalidate Noel's previous testing because they're essentially a no-op - just code clean ups really.
I've marked all but the last patch attached here as obsolete.
Created attachment 15859 [details] Patch for 4.12 cherry-picked from master
Created attachment 15860 [details] Patch for 4.11 backported from master Minimal conflicts due to node.transport_data still being node.private_data in 4.11.
Created attachment 15862 [details] Patch for 4.11 cherry-picked from master (after some preliminary cosmetic cherry-picks) Hi Ralph, For 4.11, I'd rather cherry-pick the 2 extra cosmetic patches from master (which are already in 4.12) and then cleanly cherry-pick the patches for this bug on top of those. This will make life easier and less error prone if we end up having to patch something else in this code. If you have strong objections to this approach then please flip the "obsolete" bits on your 4.11 patch and this one to bring your patch back. :-) Thanks...
Should we also mark the patch for master as obsolete? Or give it a better name? For example, "Close relative of the patch that went in to master". :-)
Comment on attachment 15862 [details] Patch for 4.11 cherry-picked from master (after some preliminary cosmetic cherry-picks) Reassigning to Karolin for inclusion in 4.11 and 4.12.
(In reply to Ralph Böhme from comment #34) Pushed to autobuild-v4-{12,11}-test.
(In reply to Karolin Seeger from comment #35) Pushed to both branches. Closing out bug report. Thanks!
Created attachment 15869 [details] Patch for 4.10
Reopening for inclusion in 4.10.
(In reply to Martin Schwenke from comment #38) Thanks, Martin! Pushed to autobuild-v4-10-test.
(In reply to Karolin Seeger from comment #39) Pushed to v4-10-test. Closing out bug report. Thanks!