Near the beginning of main_loop(), the node map is fetched and the flags of the current node are saved. Later in main_loop(), on the leader node, decisions are made about whether a recovery is needed. Between these times, the state of the leader node may have changed and it may no longer be a viable leader, perhaps because it is inactive. A state change affecting the viability of the leader may also be the reason why recovery is needed. Recovery sets the dmaster of records in volatile databases to the leader. If an outgoing, inactive leader runs recovery this results in inconsistent databases. Here is an example showing a stopped node running recovery: 2026-05-21T14:48:00.201838+10:00 node.0 ctdbd[5494]: Stopping node 2026-05-21T14:48:00.202802+10:00 node.0 ctdbd[5494]: Making node INACTIVE 2026-05-21T14:48:00.203891+10:00 node.0 ctdbd[5494]: Recovery mode set to ACTIVE 2026-05-21T14:48:00.204981+10:00 node.0 ctdbd[5494]: Dropping all public IP addresses ... 2026-05-21T14:48:00.240410+10:00 node.0 ctdbd[5494]: Freeze all 2026-05-21T14:48:00.241286+10:00 node.0 ctdbd[5494]: Freeze db: rec_test.tdb 2026-05-21T14:48:00.245207+10:00 node.0 ctdb-recoverd[5511]: Node:0 was in recovery mode. Start recovery process 2026-05-21T14:48:00.245488+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Starting do_recovery 2026-05-21T14:48:00.245570+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery initiated due to problem with node 0 2026-05-21T14:48:00.248315+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery - updated flags 2026-05-21T14:48:00.255559+10:00 node.0 ctdbd[5494]: Connected client with pid:8656 2026-05-21T14:48:00.258061+10:00 node.0 ctdb-recovery[8656]: Set recovery mode to ACTIVE 2026-05-21T14:48:00.262878+10:00 node.0 ctdb-recovery[8656]: start_recovery event finished 2026-05-21T14:48:00.263332+10:00 node.0 ctdb-recovery[8656]: updated VNNMAP 2026-05-21T14:48:00.263363+10:00 node.0 ctdb-recovery[8656]: recover database 0x92421532 2026-05-21T14:48:00.288708+10:00 node.0 ctdbd[5494]: ../../server/ctdb_daemon.c:323 Registered message handler for srvid=17294104044079415297 2026-05-21T14:48:00.301158+10:00 node.0 ctdb-recovery[8656]: Pulled 1 records for db rec_test.tdb from node 1 2026-05-21T14:48:00.301685+10:00 node.0 ctdbd[5494]: ../../server/ctdb_daemon.c:323 Registered message handler for srvid=17294104044079415298 2026-05-21T14:48:00.314492+10:00 node.0 ctdb-recovery[8656]: Pulled 1 records for db rec_test.tdb from node 2 2026-05-21T14:48:00.378047+10:00 node.0 ctdb-recovery[8656]: Pushed 1 records for db rec_test.tdb 2026-05-21T14:48:00.383227+10:00 node.0 ctdb-recovery[8656]: 1 of 1 databases recovered 2026-05-21T14:48:00.407982+10:00 node.0 ctdb-recovery[8656]: Set recovery mode to NORMAL 2026-05-21T14:48:00.411630+10:00 node.0 ctdb-recovery[8656]: recovered event finished 2026-05-21T14:48:00.411803+10:00 node.0 ctdb-recoverd[5511]: Takeover run starting ... 2026-05-21T14:48:00.446576+10:00 node.0 ctdb-recoverd[5511]: Takeover run completed successfully 2026-05-21T14:48:00.446784+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery complete ... 2026-05-21T14:48:06.175857+10:00 node.0 ctdb-recoverd[5511]: Leader broadcast timeout 2026-05-21T14:48:06.176067+10:00 node.0 ctdb-recoverd[5511]: Start election 2026-05-21T14:48:06.178154+10:00 node.0 ctdbd[5494]: Recovery mode already set to ACTIVE 2026-05-21T14:48:06.178519+10:00 node.0 ctdbd[5494]: Recovery mode already set to ACTIVE 2026-05-21T14:48:06.737272+10:00 node.0 ctdb-recoverd[5511]: Received leader broadcast, leader=2 Recovery pulls records from and pushes records to the active nodes (1, 2). However, the dmaster of all records will be 0. This doesn't seem to occur often. I can recreate it fairly easily if I run ctdb/tests/INTEGRATION/database/recovery.003.no_resurrect.sh under valgrind *and* apply a ctdb tool change that delays when "ctdb stop" sends CTDB_SRVID_TAKEOVER_RUN. Both only affect timing and not overall recovery daemon behaviour. However, there is currently nothing stopping this behaviour. In the example above, some time later, an election is held, another node becomes leader and the new leader runs recovery: 2026-05-21T14:48:06.175072+10:00 node.2 ctdb-recoverd[5704]: Leader broadcast timeout 2026-05-21T14:48:06.176051+10:00 node.2 ctdb-recoverd[5704]: Start election 2026-05-21T14:48:06.176913+10:00 node.2 ctdbd[5649]: Recovery mode set to ACTIVE 2026-05-21T14:48:06.178319+10:00 node.2 ctdbd[5649]: Recovery mode already set to ACTIVE 2026-05-21T14:48:06.178919+10:00 node.2 ctdbd[5649]: Recovery mode already set to ACTIVE 2026-05-21T14:48:06.179961+10:00 node.2 ctdb-recoverd[5704]: Attempting to take cluster lock (./tests/var/INTEGRATION/database/shared/.ctdb/cluster.lock) 2026-05-21T14:48:06.182192+10:00 node.2 ctdb-recoverd[5704]: Set cluster mutex helper to "/home/martins/samba/samba/ctdb/bin/ctdb_mutex_fcntl_helper" 2026-05-21T14:48:06.202884+10:00 node.2 ctdb-recoverd[5704]: Cluster lock taken successfully 2026-05-21T14:48:06.203876+10:00 node.2 ctdb-recoverd[5704]: Took cluster lock, leader=2 2026-05-21T14:48:06.756771+10:00 node.2 ctdb-recoverd[5704]: Remote node 0 had flags 0x20, local had 0x0 - updating local 2026-05-21T14:48:06.760258+10:00 node.2 ctdb-recoverd[5704]: Pushing updated flags for node 0 (0x20) 2026-05-21T14:48:06.766434+10:00 node.2 ctdbd[5649]: Node 0 has changed flags - 0x0 -> 0x20 2026-05-21T14:48:06.777574+10:00 node.2 ctdb-recoverd[5704]: Node:2 was in recovery mode. Start recovery process 2026-05-21T14:48:06.778277+10:00 node.2 ctdb-recoverd[5704]: Node:1 was in recovery mode. Start recovery process 2026-05-21T14:48:06.779017+10:00 node.2 ctdb-recoverd[5704]: do_recovery: Starting do_recovery This makes the affected databases consistent again... and explains why this hasn't been noticed before. More details... During recovery, recbuf_filter_add() sets the dmaster of records in a volatile database to the leader. If this node is inactive then records with it as dmaster can't be migrated to other nodes after recovery completes. So, until another recovery occurs, the databases are inconsistent and any attempts to fetch records will hang. In terms of post-recovery distributed database performance, it might make more sense for recovery to set each record's dmaster to its lmaster. However, that would cost an additional lmaster (i.e. hash) calculation for each key. So, setting the dmaster of records to be the leader might be an important recovery performance optimisation. This bug was found while testing a "leader resignation" change, which aims to speed up operations like "ctdb stop" by having an outgoing leader resign, so other nodes do not have to wait for a leader broadcast timeout. This change does not require recovery to be run after an election if there is no other change to the cluster (e.g. leader capability removed: orderly transfer of power). For a stopped node, the new leader would run a recovery due to the stopped node becoming inactive... unless the outgoing leader runs recovery (as per this bug), which handles the cluster change so that it is no longer exposed to the new leader. So, with this bug and the leader resignation change, the databases stay inconsistent until a subsequent recovery. Although it is theoretically unnecessary, it would be possible to have leader resignation force a full election, which would always result in recovery, but that is a question for another day. The current behaviour is wrong because a recovery run by an inactive leader leaves volatile databases (at least temporarily) in an inconsistent state. So, this needs to be fixed. In the longer term, CTDB will hopefully become more modular. Elections and database recovery will happen in different modules. This situation will have to be carefully handled.