Bug 14295 - Starting ctdb node that was powered off hard before results in recovery loop
Summary: Starting ctdb node that was powered off hard before results in recovery loop
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.12.0rc4
Hardware: All All
: P3 normal (vote)
Target Milestone: 4.12
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-26 09:25 UTC by Ralph Böhme
Modified: 2020-03-24 11:18 UTC (History)
5 users (show)

See Also:


Attachments
WIP patch for master (1.60 KB, patch)
2020-02-26 09:27 UTC, Ralph Böhme
no flags Details
WIP patch for master from Noel (1.17 KB, patch)
2020-02-28 08:36 UTC, Ralph Böhme
no flags Details
More complete patch (8.27 KB, patch)
2020-02-28 11:15 UTC, Ralph Böhme
no flags Details
Patch for master v4 (15.53 KB, patch)
2020-02-29 11:40 UTC, Ralph Böhme
no flags Details
alternative patch (4.44 KB, patch)
2020-02-29 16:29 UTC, Noel Power
no flags Details
WIP patch from Martin (12.15 KB, patch)
2020-03-01 06:10 UTC, Martin Schwenke
no flags Details
New WIP patch from Martin (17.85 KB, patch)
2020-03-11 07:21 UTC, Martin Schwenke
no flags Details
Patch for 4.12 cherry-picked from master (17.90 KB, patch)
2020-03-13 14:27 UTC, Ralph Böhme
martins: review+
Details
Patch for 4.11 backported from master (17.88 KB, patch)
2020-03-13 14:28 UTC, Ralph Böhme
no flags Details
Patch for 4.11 cherry-picked from master (after some preliminary cosmetic cherry-picks) (30.02 KB, patch)
2020-03-14 02:54 UTC, Martin Schwenke
slow: review+
Details
Patch for 4.10 (30.02 KB, patch)
2020-03-20 23:32 UTC, Martin Schwenke
martins: review? (amitay)
martins: review? (slow)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ralph Böhme 2020-02-26 09:25:35 UTC
It seems commit d0baad257e511280ff3e5c7372c38c43df841070 for bug 14175 introduces a regression the results in a recovery loop.

The scenario that triggers this seems to be:

- hard power off of a node A

- all other nodes in the cluster fail to free
  struct ctdb_tcp_node.in_queue

- restart node A and start ctdb

- node A connect to other nodes but the other nodes
  reject the incoming connection with

  Feb 21 13:47:13 somenode ctdbd[302424]: ctdb_listen_event:
  Incoming queue active, rejecting connection from SOMEIP


struct ctdb_tcp_node.in_queue is only ever freed in the fd readable handler ctdb_tcp_read_cb(), but this gets never called as the TCP stacks on the nodes doesn't notice the connection is dead. ctdb sets SO_KEEPALIVE on the socket, but the default timeout for tcp_keepalive_time is 2 hours.

A possible fix might be to free the in_queue in ctdb_tcp_stop_connection().

Have untested WIP patch, need bugnumber.
Comment 1 Ralph Böhme 2020-02-26 09:27:08 UTC
Created attachment 15823 [details]
WIP patch for master
Comment 2 Martin Schwenke 2020-02-28 06:32:35 UTC
With this patch, if I build and test from the ctdb/ subdirectory, I do something like this:

$ ./tests/local_daemons.sh foo setup -P /dev/null
$ ./tests/local_daemons.sh foo start all
[wait a while]
$ ./tests/local_daemons.sh foo print-log all | less

I see that this is behaving in some strange pathological way, with looping disconnects.  I need to understand why this is...
Comment 3 Martin Schwenke 2020-02-28 07:07:23 UTC
I think I can see the problem...

On node A, freeing tnode->in_queue in ctdb_tcp_stop_connection() means
that an existing incoming connection from node B -> A is closed by ctdb_tcp_node_connect() when it is trying to establish A-> B connection.
That causes B's outgoing connection B -> A to be closed.  So then B
tries to reestablish connection B -> A, tearing down the connection
from A -> B.  Rinse and repeat...

Noel's patch at https://gitlab.com/samba-team/samba/-/merge_requests/1177
only tears down the inbound connection once when the connection goes down,
so doesn't exhibit the same looping behaviour.

Ralph, can you please test Noel's patch?

Thanks...
Comment 4 Ralph Böhme 2020-02-28 08:36:31 UTC
Created attachment 15826 [details]
WIP patch for master from Noel

This moves the freeing to the keepalive code which sounds correct. I'll give it a whirl.
Comment 5 Noel Power 2020-02-28 09:13:37 UTC
(In reply to Martin Schwenke from comment #3)
yep, my first attempt was exactly the same as Ralphs and it didn't work for me
Comment 6 Martin Schwenke 2020-02-28 10:55:41 UTC
Sorry, but this is a regression compared to 4.11, so tagging as such.

It doesn't result in complete breakage but does break an important feature.

The patch looks good.   I'll put it into my test branch and make sure that it survives several runs over the weekend...
Comment 7 Ralph Böhme 2020-02-28 11:15:36 UTC
Created attachment 15827 [details]
More complete patch

Here's an updated patch that avoids spreading the TALLOC_FREE of the incoming queue will-nilly over the code, but instead introduces dedicated helper functions to do the task.

I believe we should free the incoming queue in ctdb_tcp_tnode_cb() as well, so we'd already have to places where we'd call ctdb_tcp_stop_connection() plus an explicit TALLOC_FREE(tnode->incoming).

Thoughts?
Comment 8 Martin Schwenke 2020-02-29 06:44:35 UTC
I'll put an idea here because a headache is stopping me from thinking about it in details.  We obviously can't bikeshed this one for too long, because it needs to be fixed for 4.12.

The places where both connections need to be torn down should probably all be "node dead" cases.  For the original one in ctdb_tcp_read_cb(), we could drop
the explicit TALLOC_FREE(tnode->in_queue) and depend on node_dead() -> ctdb_tcp_restart() to do it for us.  Same for the other case(s): need to understand when data != NULL in ctdb_tcp_tnode_cb().

The only problem is that ctdb_node_dead() doesn't unconditionally call restart()... and if it does then I'm a little concerned about pathological behaviour...

Anyway, will try to figure this out by the end of Monday...  more input is very welcome!  :-)
Comment 9 Ralph Böhme 2020-02-29 09:21:26 UTC
(In reply to Martin Schwenke from comment #8)
Nodes are already marked connected (NODE_FLAGS_DISCONNECTED removed) in half-connected state, so relying on ctdb_node_dead() with it's dependency on NODE_FLAGS_DISCONNECTED seems a bit fragile.
Comment 10 Martin Schwenke 2020-02-29 09:41:15 UTC
(In reply to Ralph Böhme from comment #9)
Exactly!

Although I'm wondering if it would be sane to re-order the code in  ctdb_node_dead() so that it unconditionally tears down the connection(s) first and then changes the state.  As long as nothing goes to the event loop then this should be a safe change.

My concern is that always tearing down both sides of the connection will result in more of that pathological behaviour, but I'm too tired and headachey to get that deep into it today.

My overall goal is to streamline the code so that the connections are only torn down in one place (i.e. no stray TALLOC_FREE()s and it is easier to reason about.  At the moment the changes will mean that ctdb_tcp_read_cb() frees the incoming queue, then calls node_dead(), which calls restart(), which redundantly frees the incoming queue... though I feel that the first free is the redundant one.

However, this may not be the time for that...  :-)
Comment 11 Ralph Böhme 2020-02-29 11:07:52 UTC
(In reply to Martin Schwenke from comment #10)
Ah, I was wrong about the half-connected state, 8c98c10f242bc722beffc711e85c0e4f2e74cd57 fixed this.

Guess this means we can safely replace the calls to ctdb_tdb_stop_connection() by an upcall to ->node_dead().
Comment 12 Ralph Böhme 2020-02-29 11:40:40 UTC
Created attachment 15828 [details]
Patch for master v4

Ok, so what about this one.

* additionally adds ctdb_tcp_stop_incoming_connection()

* let ctdb_node_dead() restart transport in half-connected state

* rely on node_dead() at the transport layer and remove calls to ctdb_tdb_stop_*connection()

Pipeline running at https://gitlab.com/samba-team/samba/-/merge_requests/1180
Comment 13 Noel Power 2020-02-29 16:28:49 UTC
(In reply to Ralph Böhme from comment #12)

Hi I agree with Martin that node_dead is the place that should be responsible for tearing down the both connections and I also agree with you with the idea that TALLOC_FREE(in_queue) should be managed in one place and not peppered over the code..

However think the way the patch set is ordered is quite confusing because it mixes both a rename of old function, addition of 2 new functions and partial bug fix mixing all of the above

Also in interim, (e.g. by Patch 6/9) imo we have exchanged a situation where TALLOC_FREE(in_queue) is peppered around the code to a situation where ctdb_tcp_stop_connection is.

By the end of the patch set ctdb_tcp_stop_connection in only called by ctdb_tcp_restart and ctdb_tcp_stop_incoming_connection is not called externally at all.
To be honest from Martins descriptions I guess it is is desired to prevent calling these functions at all outside the node_dead/ctdb_tcp_restart handling

How about something like the following (untested) patch ?

and then further rework to rename, reorganize functions and whatnot which can be done with less churn if you think it is necessary with follow on patches

ps not a nak just my thoughts on an alternative (and less confusing for me) way
Comment 14 Noel Power 2020-02-29 16:29:38 UTC
Created attachment 15829 [details]
alternative patch
Comment 15 Ralph Böhme 2020-02-29 18:00:25 UTC
(In reply to Noel Power from comment #14)
>However think the way the patch set is ordered is quite confusing because it mixes both a rename of old function, addition of 2 new functions and partial bug fix mixing all of the above

Remember, these are micro commits which are meant to do represent minimal digastible logical changes per commit to get us from A to B.

>Also in interim, (e.g. by Patch 6/9) imo we have exchanged a situation where TALLOC_FREE(in_queue) is peppered around the code to a situation where ctdb_tcp_stop_connection is.

That's just the interim while me move from A to B. It can be done differently of course, but I think we need explicit ctdb_tcp_stop_*connection() functions, even if in the end ctdb_tcp_stop_incoming_connection() is only called from ctdb_tcp_stop_connection() and could be made static.

What's missing in ctdb_tcp_stop_[outgoing|incoming]_connection are comments saying that TALLOC_FREE([in|out]_queue) should only be called from these places.

With your minimal change we still have this code

 	TALLOC_FREE(tnode->in_queue);
 	ctdb_tcp_stop_connection(node);

which is exactly what brought us here.

But let's see what other version Martin comes up with. :)
Comment 16 Martin Schwenke 2020-03-01 06:10:19 UTC
Created attachment 15830 [details]
WIP patch from Martin

I think I have a nice compromise that does nearly everything Ralph's patch series does, but without some of the churn.  :-D

Compared to Noel's patches, I have moved the change to ctdb_node_dead() to the beginning.  This makes the series bisectable.  I've added a cosmetic patch that it intended to be squashed in with the ctdb_node_dead() one.

Other than that, I think the first (real) 3 commits are the same as Noel's latest patch, though I have done some commit message tweaking.

Then one redundant call to ctdb_tcp_stop_connection() is dropped via a cut down version Ralph's redundancy removing patch.

Then, I basically use Ralph's patches to add ctdb_tcp_stop_outgoing() (shorter name, says as much) and ctdb_tcp_stop_incoming().  The latter is added and used in a single patch because I think it is simple enough.  Doing this at the end results in the least churn.

I didn't add ctdb_tcp_stop_connection() because it isn't really necessary, since it would only be used once and the code seems nice and clear without it.

The only place I added a Signed-off-by: is in the last patch because I thought it was a good idea to add & use the function at the same time.  Happy to drop that...

We *could* backport just the first 2 or 3 commits to get a minimal fix.  However, I think that it is worth improving the code while we fix the bug, so I think we should push and backport it all.

Obviously, if you think I have committed atrocities with your patches and/or commit messages then please say so and we'll come to another arrangement.

I have smoke tested this with a few scenarios using local_daemons.sh.  I'm also running a pipeline at:

  https://gitlab.com/samba-team/devel/samba/pipelines/122223866

and running it through our local regression testing.

It would be good if you could both test in your original scenarios.
Comment 17 Ralph Böhme 2020-03-01 11:17:15 UTC
(In reply to Martin Schwenke from comment #16)
Lgtm. But I'd keep the squash commit as seperate commit.

Pushed to <https://gitlab.com/samba-team/samba/-/merge_requests/1180>.

Please push if happy.
Comment 18 Ralph Böhme 2020-03-01 11:20:36 UTC
(In reply to Ralph Böhme from comment #17)
Ups, please wait before pushing. result from testing is still outstanding. :)
Comment 19 Martin Schwenke 2020-03-02 02:25:55 UTC
I have tested this by doing:

  iptables -I OUTPUT -p tcp -s 127.0.0.2 --dport 4379 -j DROP

Without the fix I see:

==> ./foo/node.0/log.ctdb <==
2020/03/02 12:39:02.258590 ctdbd[1210065]: ctdb_listen_event: Incoming queue active, rejecting connection from 127.0.0.2

==> ./foo/node.1/log.ctdb <==
2020/03/02 12:39:02.258588 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.1:4379 is already marked connected: 1 connected
2020/03/02 12:39:02.258755 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.1:4379 is dead: 0 connected
2020/03/02 12:39:02.258806 ctdbd[1210086]: Tearing down connection to dead node :0

==> ./foo/node.2/log.ctdb <==
2020/03/02 12:39:02.272307 ctdbd[1210106]: ctdb_listen_event: Incoming queue active, rejecting connection from 127.0.0.2

==> ./foo/node.1/log.ctdb <==
2020/03/02 12:39:02.272452 ctdbd[1210086]: 127.0.0.2:4379: connected to 127.0.0.3:4379 - 1 connected
2020/03/02 12:39:02.272535 ctdbd[1210086]: 127.0.0.2:4379: node 127.0.0.3:4379 is dead: 0 connected
2020/03/02 12:39:02.272567 ctdbd[1210086]: Tearing down connection to dead node :2

With the fix, the node is properly disconnected after ~25s and is banned when it tries to take the recovery lock.  If I remove the firewall rule:

  iptables -D OUTPUT -p tcp -s 127.0.0.2 --dport 4379 -j DROP

then the node is reconnected and is unbanned when the ban expires.

Looks good!
Comment 20 Martin Schwenke 2020-03-02 04:47:45 UTC
Also waiting for Amitay to take a look...
Comment 21 Noel Power 2020-03-02 07:49:45 UTC
(In reply to Martin Schwenke from comment #20)
thanks alot Martin/Ralph I'll test the new changes here again with the previously failing HA setup this morning
Comment 22 Noel Power 2020-03-02 14:55:53 UTC
(In reply to Noel Power from comment #21)
the previous failing HA scenario now works well with new patches
Comment 23 Martin Schwenke 2020-03-11 07:21:17 UTC
Created attachment 15855 [details]
New WIP patch from Martin

New WIP patch with simplifications at the end to drop redundant stop of outgoing connection.  This requires the commit in the middle called:

  SQ: Remove redundant restart in ctdb_tcp_tnode_cb()

to be squashed in the previous commit.  That took a long time to find, so I think the resulting commit becomes Signed-off-by: Ralph and I.

Initial unsquashed pipeline at:

  https://gitlab.com/samba-team/devel/samba/pipelines/125132222

Happy for this to be squashed into:

  https://gitlab.com/samba-team/samba/-/merge_requests/1180
Comment 24 Martin Schwenke 2020-03-11 08:29:40 UTC
I'm planning to try to add a couple of tests where ctdbd on one node is sent SIGSTOP to simulate a node timing out (thanks Volker!).  After all other nodes notice the node DISCONNECTED then there are 2 scenarios to play with: SIGCONT (and see what happens) & SIGKILL + restart (and see what happens).  In both cases "what happens" should be that all nodes become connected and come out of recovery.

However, if people are happy for this to be pushed tomorrow before these tests are available then I'm fine with that.  The patches have been looked over by quite a few sets of eyes and my confidence is high.
Comment 25 Noel Power 2020-03-11 10:17:35 UTC
(In reply to Martin Schwenke from comment #24)
nice job, thanks Martin. I'll test this again with the HA env (I still have it) I don't expect this will change anything, I am pretty confident it will work
Comment 26 Ralph Böhme 2020-03-11 11:32:25 UTC
(In reply to Martin Schwenke from comment #24)
Lgtm. Pushed to https://gitlab.com/samba-team/samba/-/merge_requests/1180.

I've fixed the bug url in Noels commit and squashed your change into the previous one.

Please push if happy.
Comment 27 Martin Schwenke 2020-03-12 00:47:55 UTC
(In reply to Martin Schwenke from comment #24)

Unfortunately using SIGSTOP doesn't keep the node disconnected.  Connections to the SIGSTOP-ed daemon time-out and are reconnected immediately at the TCP level.  This is due to commit 38dc6d11a26c2e9a2cae7927321f2216ceb1c5ec, which reverted behaviour to mark nodes connected when they are connected at the TCP level.  Not doing this caused other problems, so the revert is necessary.  In the current design, we can't have everything... especially these nice, artificial tests.  ;-)

Right now we don't have a good way of automatically testing this.

In the new CTDB design the transport will only a mark node as connected when it has verified the connection by validating a "connect" packet... so this sort of test may be possible in the future.
Comment 28 Martin Schwenke 2020-03-12 03:48:26 UTC
I've now pushed this to autobuild with minor commit message changes: BUG tags in the last 2 commits and fixed the commit message in the squashed patch (and re-added BUG tag there).

I don't think the last round of changes invalidate Noel's previous testing because they're essentially a no-op - just code clean ups really.
Comment 29 Martin Schwenke 2020-03-12 05:51:26 UTC
I've marked all but the last patch attached here as obsolete.
Comment 30 Ralph Böhme 2020-03-13 14:27:10 UTC
Created attachment 15859 [details]
Patch for 4.12 cherry-picked from master
Comment 31 Ralph Böhme 2020-03-13 14:28:22 UTC
Created attachment 15860 [details]
Patch for 4.11 backported from master

Minimal conflicts due to node.transport_data still being node.private_data in 4.11.
Comment 32 Martin Schwenke 2020-03-14 02:54:39 UTC
Created attachment 15862 [details]
Patch for 4.11 cherry-picked from master (after some preliminary cosmetic cherry-picks)

Hi Ralph,

For 4.11, I'd rather cherry-pick the 2 extra cosmetic patches from master (which are already in 4.12) and then cleanly cherry-pick the patches for this bug on top of those.  This will make life easier and less error prone if we end up having to patch something else in this code.

If you have strong objections to this approach then please flip the "obsolete" bits on your 4.11 patch and this one to bring your patch back.  :-)

Thanks...
Comment 33 Martin Schwenke 2020-03-14 02:55:26 UTC
Should we also mark the patch for master as obsolete?  Or give it a better name?  For example, "Close relative of the patch that went in to master".  :-)
Comment 34 Ralph Böhme 2020-03-16 13:13:14 UTC
Comment on attachment 15862 [details]
Patch for 4.11 cherry-picked from master (after some preliminary cosmetic cherry-picks)

Reassigning to Karolin for inclusion in 4.11 and 4.12.
Comment 35 Karolin Seeger 2020-03-19 09:37:05 UTC
(In reply to Ralph Böhme from comment #34)
Pushed to autobuild-v4-{12,11}-test.
Comment 36 Karolin Seeger 2020-03-20 07:35:54 UTC
(In reply to Karolin Seeger from comment #35)
Pushed to both branches.
Closing out bug report.

Thanks!
Comment 37 Martin Schwenke 2020-03-20 23:32:59 UTC
Created attachment 15869 [details]
Patch for 4.10
Comment 38 Martin Schwenke 2020-03-20 23:33:59 UTC
Reopening for inclusion in 4.10.
Comment 39 Karolin Seeger 2020-03-23 07:02:43 UTC
(In reply to Martin Schwenke from comment #38)
Thanks, Martin!
Pushed to autobuild-v4-10-test.
Comment 40 Karolin Seeger 2020-03-24 11:18:51 UTC
(In reply to Karolin Seeger from comment #39)
Pushed to v4-10-test.
Closing out bug report.

Thanks!