Bug 14466 - ctdb disable/enable can fail due to race condition
Summary: ctdb disable/enable can fail due to race condition
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.12.6
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-15 07:22 UTC by Martin Schwenke
Modified: 2020-10-07 08:17 UTC (History)
1 user (show)

See Also:


Attachments
Patch for 4.13, 4.12, 4.11 (48.93 KB, patch)
2020-08-19 04:52 UTC, Martin Schwenke
amitay: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Schwenke 2020-08-15 07:22:40 UTC
The recovery daemon's main loop pulls flags from all nodes and pushes them out using update_local_flags().  These flags are saved in the nodemap variable as the recovery master's canonical view of node states.  Later, main loop re-fetches the flags from remote nodes using get_remote_nodemaps() and then conducts a consistency check.  If the flags on the remote node are different then this is "fixed":

                                        DEBUG(DEBUG_ERR,("Use flags 0x%02x from local recmaster node for cluster update of node %d flags\n", nodemap->nodes[i].flags, i));
                                        update_flags_on_all_nodes(
                                                rec,
                                                nodemap->nodes[i].pnn,
                                                nodemap->nodes[i].flags);
                                        ctdb_set_culprit(rec, nodemap->nodes[j].pnn);
                                        do_recovery(rec, mem_ctx, pnn, nodemap, 
                                                    vnnmap);
                                        return;

However, if a node is disabled or enabled (via the ctdb tool) after the call to update_local_flags() but before the call to get_remote_nodemaps() then the recovery master will note the inconsistency and "fix" it.  This can cause the ctdb tool command to fail because the node's state is reverted to its previous state and the tool never find the node in the desired state.

Retrieving flags of a node at 2 different times in main loop and comparing them is wrong.  The flags should only be retrieved once.  If pushing the flags out fails in update_local_flags() then an error is already flagged, so all of the appropriate error handling is in place at this point.
Comment 1 Martin Schwenke 2020-08-19 04:52:05 UTC
Created attachment 16174 [details]
Patch for 4.13, 4.12, 4.11

The patch includes 12 commits for the fix and 10 other recent staging commits that simplify the fix.  BUG tags are only in the last 12 - when I discussed this recently with Karolin she said this is fine because only the last commit really needs a BUG tag.

The same patch applies to v4-13-test, v4-12-test, v4-11-test.

I've run the CTDB testsuite over all of these, including running tests on a virtual cluster.  Looks good.  Although some of the commits have been reworked in the last couple of months to make the fix more obvious, the logic behind the fix has been in my test branch for over 2 years...  and is finally seeing daylight!
Comment 2 Amitay Isaacs 2020-08-20 02:53:05 UTC
Thanks for testing all the branches. We need to consolidate our testing.
Comment 3 Amitay Isaacs 2020-08-20 02:53:56 UTC
Hi Karolin,

This is ready for the branches v4-11, v4-12 and v4-13.

Thanks.
Comment 4 Karolin Seeger 2020-08-25 11:26:42 UTC
(In reply to Amitay Isaacs from comment #3)
Hi Amitay,

pushed to autobuild-v4-{13,12,11}-test.
Comment 5 Karolin Seeger 2020-08-31 08:14:50 UTC
(In reply to Karolin Seeger from comment #4)
Pushed to all branches.
Closing out bug report.

Thanks!
Comment 6 Martin Schwenke 2020-09-26 07:20:20 UTC
A further bug is being fixed here:

https://bugzilla.samba.org/show_bug.cgi?id=14513
Comment 7 Samba QA Contact 2020-10-06 07:17:25 UTC
This bug was referenced in samba v4-11-stable (Release samba-4.11.14):

55216cda60719012109856f0e2d71de510c552c2
9e52bb0c5c613b5e4b141b8268f1e10fe66a9c18
942db2b3d27236506c3a98ce2cc8f4975e01aed2
b0d4ae271decb51485a5e6f82d9004341bf4ba0e
90c0609df4ced66ae4dcc9d486c25d1f30a1efc9
d1f01ff312b74f28975910d6ef936b2ecacb8ff9
238564c74957f5488ebdbc5d779726e83e41b1c7
8faae66a253840e403cd3f66c8fff2ae7aec680c
1956ee1f6b8d03aad37300e318af7b5aea5e11ff
7723e7d23d7d02d281584b9ca74c966a6e3dc4e1
c348d7a58786f34d03737ff89be0d26538a15452
3d7572b40324f0ca0f72a799a2fc94e0077e1fd3
Comment 8 Samba QA Contact 2020-10-07 08:17:52 UTC
This bug was referenced in samba v4-12-stable (Release samba-4.12.8):

828807f52d37ec0d4bc5ba798b4c01b4ef84071d
3b35541c13d9843035ba4bc4bd00aa58b488ba40
bedd92b813c1c30d79f8c16454de12aa1241c845
d360401282d4127b466c79cf024d4c02cd70921f
db34c22ab12e3abc8e29029205ffd8940577f9bd
ad9780853ed3a18993fa0935ea4551716030b8e2
4aae8adc7182687db73c2e40b4a8a24ce5053d16
4c5fde56c309db48a8d6e535564b8effb012c6a9
4820778cfbe1c9077818bd9f6ab507d76157a0ad
99441077417bda37872d7999ad0cd603797ab01b
087f88682099f4b936094bc678f24a4113d1190d
4955925e9125a9a16f588feb94cac9d033037ce8