Bug 14175 - Incoming queue can be orphaned causing communication breakdown
Summary: Incoming queue can be orphaned causing communication breakdown
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.9.14
Hardware: All All
: P5 regression (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-29 06:07 UTC by Martin Schwenke
Modified: 2020-02-26 09:52 UTC (History)
3 users (show)

See Also:


Attachments
Patch for 4.11, 4.10 and possibly 4.9 (6.28 KB, patch)
2019-11-06 05:55 UTC, Martin Schwenke
no flags Details
Patch for 4.11, 4.10 and possibly 4.9 (8.07 KB, patch)
2019-11-14 22:22 UTC, Martin Schwenke
amitay: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Schwenke 2019-10-29 06:07:29 UTC
CTDB's incoming queue handling does not check whether an existing queue exists, so can overwrite the pointer to the queue.  This used to be harmless until commit c68b6f96f26664459187ab2fbd56767fb31767e0 changed the read callback to use a parent structure as the callback data.  Instead of cleaning up an orphaned queue on disconnect, as before, this will now free the new queue.  The situation can not be recovered without restarting CTDB on *all* affected nodes, which may be the whole cluster.

The queue can become orphaned when the following sequence occurs:

1. Node A comes up
2. Node A accepts an incoming connection from node B
3. Node B processes a timeout before noticing that outgoing the queue is writable
4. Node B tears down the outgoing connection to node A
5. Node B initiates a new connection to node A
6. Node A accepts an incoming connection from node B

Node A processes the disconnect of the old incoming connection from (2) but tears down the new incoming connection from (6).

The problem can occur any time CTDB is started on a node.

The fix is to avoid accepting new incoming connections when a queue for incoming connections is already present.
Comment 1 Martin Schwenke 2019-10-30 06:28:43 UTC
When one node is affected, outgoing connections are also constantly torn down and retried.  This is likely to induce the same bug on for the incoming queue of all nodes being connected to.  After a short amount of time it will probably affect all nodes, so the whole cluster probably needs to be restarted to recover from this bug.
Comment 2 Martin Schwenke 2019-11-06 05:55:45 UTC
Created attachment 15599 [details]
Patch for 4.11, 4.10 and possibly 4.9

Patch for v4-11-test applies cleanly to v4-10-test and v4-9-test too, so just a single patch.
Comment 3 Martin Schwenke 2019-11-06 05:57:30 UTC
As discussed, the fix for https://bugzilla.samba.org/show_bug.cgi?id=14084 introduced a serious regression.  Given that this was applied to 4.9, it would be glorious if this fix could also be applied to 4.9.
Comment 4 Martin Schwenke 2019-11-14 22:22:46 UTC
Created attachment 15620 [details]
Patch for 4.11, 4.10 and possibly 4.9
Comment 5 Amitay Isaacs 2019-11-15 05:52:24 UTC
Hi Karolin,

This is ready for v4-11, v4-10 and hopefully v4-9.

We definitely don't want to see this bug in 4.9.

Thanks.
Comment 6 Karolin Seeger 2019-11-19 09:28:52 UTC
(In reply to Amitay Isaacs from comment #5)
Pushed to autobuild-v4-{11,10,9}-test.
There will be a bugfix release asap.
Comment 7 Chris Cowan 2019-11-22 15:53:28 UTC
I believe this is related to what I've been seeing on AIX, but it has been present since before 4.9.
Comment 8 Martin Schwenke 2019-11-24 09:32:04 UTC
(In reply to Chris Cowan from comment #7)

This exact bug can't occur before 4.9, since it is due to the fix for 

  https://bugzilla.samba.org/show_bug.cgi?id=14084

and that fix only went in as far back as 4.9.

That fix was for issues associated with

  https://bugzilla.samba.org/show_bug.cgi?id=13888

which was also only in versions >=4.9.

Yes, it has been a bit of saga...  :-(

If you are seeing 2 nodes simply not connecting then it could be due to the 2nd bug above.  That bug existed for a very long time.

Alternatively, if you're seeing other connectivity issues, with many messages logged about connectivity, then I'd love to see the logs.

Thanks...
Comment 9 Karolin Seeger 2019-11-26 12:25:40 UTC
Pushed to all branches.
Closing out bug report.

Thanks!