Bug 13888 - CTDB nodes sometimes do not connect to each other
Summary: CTDB nodes sometimes do not connect to each other
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.8.9
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
Depends on:
Reported: 2019-04-05 05:32 UTC by Martin Schwenke
Modified: 2019-11-25 19:49 UTC (History)
2 users (show)

See Also:

Patch for 4.9 and 4.10 (1.47 KB, patch)
2019-04-13 01:59 UTC, Martin Schwenke
amitay: review+

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Schwenke 2019-04-05 05:32:25 UTC
Commit 66919db3d7ab1e091223faf515b183af8bfddc83 puts the onus on the keepalive code to mark nodes as connected.  However, the keepalive code marks a node as connected if it receives packets from it.  Keepalives are not sent to disconnected nodes so 2 nodes that are disconnected from each other will not send keepalives to each other.  It appears that CTDB currently depends on broadcasts to generate traffic so that nodes are marks as connected to each other.  

Depending on relative timings of node startup, nodes will sometimes never become connected to each other.  This is more likely when starting a lot of nodes at the same time (e.g. 256).

It is better to mark the nodes as connected when the transport connects and revert to disconnected if no packets are received.
Comment 1 Martin Schwenke 2019-04-13 01:59:29 UTC
Created attachment 15057 [details]
Patch for 4.9 and 4.10
Comment 2 Amitay Isaacs 2019-04-15 07:23:30 UTC
Hi Karolin,

This is ready for v4-9 and v4-10.

Comment 3 Karolin Seeger 2019-04-15 07:29:59 UTC
(In reply to Amitay Isaacs from comment #2)
Pushed to autobuild-v4-{10,9}-test.
Comment 4 Karolin Seeger 2019-04-16 07:10:05 UTC
(In reply to Karolin Seeger from comment #3)
Pushed to both branches.
Closing out bug report.