Commit 66919db3d7ab1e091223faf515b183af8bfddc83 puts the onus on the keepalive code to mark nodes as connected. However, the keepalive code marks a node as connected if it receives packets from it. Keepalives are not sent to disconnected nodes so 2 nodes that are disconnected from each other will not send keepalives to each other. It appears that CTDB currently depends on broadcasts to generate traffic so that nodes are marks as connected to each other. Depending on relative timings of node startup, nodes will sometimes never become connected to each other. This is more likely when starting a lot of nodes at the same time (e.g. 256). It is better to mark the nodes as connected when the transport connects and revert to disconnected if no packets are received.
Created attachment 15057 [details] Patch for 4.9 and 4.10
Hi Karolin, This is ready for v4-9 and v4-10. Thanks.
(In reply to Amitay Isaacs from comment #2) Pushed to autobuild-v4-{10,9}-test.
(In reply to Karolin Seeger from comment #3) Pushed to both branches. Closing out bug report. Thanks!