The recovery daemon's main loop pulls flags from all nodes and pushes them out using update_local_flags(). These flags are saved in the nodemap variable as the recovery master's canonical view of node states. Later, main loop re-fetches the flags from remote nodes using get_remote_nodemaps() and then conducts a consistency check. If the flags on the remote node are different then this is "fixed": DEBUG(DEBUG_ERR,("Use flags 0x%02x from local recmaster node for cluster update of node %d flags\n", nodemap->nodes[i].flags, i)); update_flags_on_all_nodes( rec, nodemap->nodes[i].pnn, nodemap->nodes[i].flags); ctdb_set_culprit(rec, nodemap->nodes[j].pnn); do_recovery(rec, mem_ctx, pnn, nodemap, vnnmap); return; However, if a node is disabled or enabled (via the ctdb tool) after the call to update_local_flags() but before the call to get_remote_nodemaps() then the recovery master will note the inconsistency and "fix" it. This can cause the ctdb tool command to fail because the node's state is reverted to its previous state and the tool never find the node in the desired state. Retrieving flags of a node at 2 different times in main loop and comparing them is wrong. The flags should only be retrieved once. If pushing the flags out fails in update_local_flags() then an error is already flagged, so all of the appropriate error handling is in place at this point.
Created attachment 16174 [details] Patch for 4.13, 4.12, 4.11 The patch includes 12 commits for the fix and 10 other recent staging commits that simplify the fix. BUG tags are only in the last 12 - when I discussed this recently with Karolin she said this is fine because only the last commit really needs a BUG tag. The same patch applies to v4-13-test, v4-12-test, v4-11-test. I've run the CTDB testsuite over all of these, including running tests on a virtual cluster. Looks good. Although some of the commits have been reworked in the last couple of months to make the fix more obvious, the logic behind the fix has been in my test branch for over 2 years... and is finally seeing daylight!
Thanks for testing all the branches. We need to consolidate our testing.
Hi Karolin, This is ready for the branches v4-11, v4-12 and v4-13. Thanks.
(In reply to Amitay Isaacs from comment #3) Hi Amitay, pushed to autobuild-v4-{13,12,11}-test.
(In reply to Karolin Seeger from comment #4) Pushed to all branches. Closing out bug report. Thanks!
A further bug is being fixed here: https://bugzilla.samba.org/show_bug.cgi?id=14513
This bug was referenced in samba v4-11-stable (Release samba-4.11.14): 55216cda60719012109856f0e2d71de510c552c2 9e52bb0c5c613b5e4b141b8268f1e10fe66a9c18 942db2b3d27236506c3a98ce2cc8f4975e01aed2 b0d4ae271decb51485a5e6f82d9004341bf4ba0e 90c0609df4ced66ae4dcc9d486c25d1f30a1efc9 d1f01ff312b74f28975910d6ef936b2ecacb8ff9 238564c74957f5488ebdbc5d779726e83e41b1c7 8faae66a253840e403cd3f66c8fff2ae7aec680c 1956ee1f6b8d03aad37300e318af7b5aea5e11ff 7723e7d23d7d02d281584b9ca74c966a6e3dc4e1 c348d7a58786f34d03737ff89be0d26538a15452 3d7572b40324f0ca0f72a799a2fc94e0077e1fd3
This bug was referenced in samba v4-12-stable (Release samba-4.12.8): 828807f52d37ec0d4bc5ba798b4c01b4ef84071d 3b35541c13d9843035ba4bc4bd00aa58b488ba40 bedd92b813c1c30d79f8c16454de12aa1241c845 d360401282d4127b466c79cf024d4c02cd70921f db34c22ab12e3abc8e29029205ffd8940577f9bd ad9780853ed3a18993fa0935ea4551716030b8e2 4aae8adc7182687db73c2e40b4a8a24ce5053d16 4c5fde56c309db48a8d6e535564b8effb012c6a9 4820778cfbe1c9077818bd9f6ab507d76157a0ad 99441077417bda37872d7999ad0cd603797ab01b 087f88682099f4b936094bc678f24a4113d1190d 4955925e9125a9a16f588feb94cac9d033037ce8