CTDB's incoming queue handling does not check whether an existing queue exists, so can overwrite the pointer to the queue. This used to be harmless until commit c68b6f96f26664459187ab2fbd56767fb31767e0 changed the read callback to use a parent structure as the callback data. Instead of cleaning up an orphaned queue on disconnect, as before, this will now free the new queue. The situation can not be recovered without restarting CTDB on *all* affected nodes, which may be the whole cluster. The queue can become orphaned when the following sequence occurs: 1. Node A comes up 2. Node A accepts an incoming connection from node B 3. Node B processes a timeout before noticing that outgoing the queue is writable 4. Node B tears down the outgoing connection to node A 5. Node B initiates a new connection to node A 6. Node A accepts an incoming connection from node B Node A processes the disconnect of the old incoming connection from (2) but tears down the new incoming connection from (6). The problem can occur any time CTDB is started on a node. The fix is to avoid accepting new incoming connections when a queue for incoming connections is already present.
When one node is affected, outgoing connections are also constantly torn down and retried. This is likely to induce the same bug on for the incoming queue of all nodes being connected to. After a short amount of time it will probably affect all nodes, so the whole cluster probably needs to be restarted to recover from this bug.
Created attachment 15599 [details] Patch for 4.11, 4.10 and possibly 4.9 Patch for v4-11-test applies cleanly to v4-10-test and v4-9-test too, so just a single patch.
As discussed, the fix for https://bugzilla.samba.org/show_bug.cgi?id=14084 introduced a serious regression. Given that this was applied to 4.9, it would be glorious if this fix could also be applied to 4.9.
Created attachment 15620 [details] Patch for 4.11, 4.10 and possibly 4.9
Hi Karolin, This is ready for v4-11, v4-10 and hopefully v4-9. We definitely don't want to see this bug in 4.9. Thanks.
(In reply to Amitay Isaacs from comment #5) Pushed to autobuild-v4-{11,10,9}-test. There will be a bugfix release asap.
I believe this is related to what I've been seeing on AIX, but it has been present since before 4.9.
(In reply to Chris Cowan from comment #7) This exact bug can't occur before 4.9, since it is due to the fix for https://bugzilla.samba.org/show_bug.cgi?id=14084 and that fix only went in as far back as 4.9. That fix was for issues associated with https://bugzilla.samba.org/show_bug.cgi?id=13888 which was also only in versions >=4.9. Yes, it has been a bit of saga... :-( If you are seeing 2 nodes simply not connecting then it could be due to the 2nd bug above. That bug existed for a very long time. Alternatively, if you're seeing other connectivity issues, with many messages logged about connectivity, then I'd love to see the logs. Thanks...
Pushed to all branches. Closing out bug report. Thanks!