Bug 14294 - CTDB recovery corner cases can cause record resurrection and node banning
Summary: CTDB recovery corner cases can cause record resurrection and node banning
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.12.0rc3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-02-26 07:21 UTC by Martin Schwenke
Modified: 2020-03-31 13:57 UTC (History)
1 user (show)

See Also:


Attachments
Patch for 4.12 (102.35 KB, patch)
2020-03-30 01:21 UTC, Martin Schwenke
amitay: review+
Details
Patch for 4.11 (185.70 KB, patch)
2020-03-30 01:31 UTC, Martin Schwenke
amitay: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Schwenke 2020-02-26 07:21:58 UTC
There are 2 similar but opposite issues.

The first scenario involves stopping a node (i.e. use of "ctdb stop").  If the recovery daemon causes a recovery before the local daemon has been updated to mark the remote node as inactive, then the inactive node can be included in the recovery, which clears the invalid records flag.  When the node is continued, deleted records may be resurrected.  This can be fixed by confirming the flags of remote nodes and dropping inactive nodes from recovery.

The second scenario involves a node becoming active between the recovery daemon creating missing databases and the recovery helper freezing databases (after it has independently fetched the state of nodes).  In this case the recovery helper may attempt to freeze a database that was not created.  The solution is to create missing databases from the recovery helper, using the same set of active nodes that are used for the rest of recovery.
Comment 1 Martin Schwenke 2020-03-30 01:21:57 UTC
Created attachment 15873 [details]
Patch for 4.12
Comment 2 Martin Schwenke 2020-03-30 01:22:40 UTC
Fix cherry-picks cleanly into v4-12-test
Comment 3 Martin Schwenke 2020-03-30 01:31:15 UTC
Created attachment 15874 [details]
Patch for 4.11

Fix cherry-picks cleanly into v4-11-test but does not build because the fix removes old client functions that are still used in v4-11-test.  Additionally, the fix changes the behaviour of one of the client functions ctdb_ctrl_createdb() in a way that is incompatible with the usage in v4-11-test.

The incompatibility exists because master (and v4-12-test) has some vacuuming improvements and some additional changes made to facilitate testing.  Due to the presence of the tests, the vacuuming code is finally tested.

So, rather than backporting this fix and potentially making a mistake, will apply it on top of the vacuuming changes (and one other cosmetic typo fix).  This gives us a well tested and reviewed fix on top of some other previously well tested and reviewed changes.

Unfortunately the vacuuming tests themselves can not be pulled into v4-11-test because there were too many test infrastructure changes made before the vacuuming tests were written.
Comment 4 Amitay Isaacs 2020-03-30 02:06:11 UTC
(In reply to Martin Schwenke from comment #3)

Are all the patches clean cherry-picks?  I thought you had to add a patch for compatibility.  Or was that the other approach?
Comment 5 Martin Schwenke 2020-03-30 02:18:03 UTC
(In reply to Amitay Isaacs from comment #4)

Completely clean.

Adding a patch for compatibility (and resolving the resulting conflicts and dropping the patches that remove the client functions) is only needed if we decide not to do this on top of the vacuuming patches.
Comment 6 Amitay Isaacs 2020-03-30 02:31:28 UTC
Hi Karolin,

This is ready for v4-11 and v4-12.

Thanks.
Comment 7 Karolin Seeger 2020-03-30 08:03:34 UTC
(In reply to Amitay Isaacs from comment #6)
Pushed to autobuild-v4-{12,11}-test.
Comment 8 Karolin Seeger 2020-03-31 13:57:11 UTC
(In reply to Karolin Seeger from comment #7)
Pushed to both branches.
Closing out bug report.

Thanks!