Bug 16082 - An inactive node can run recovery resulting in inconsistent databases
Summary: An inactive node can run recovery resulting in inconsistent databases
Status: ASSIGNED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: CTDB (show other bugs)
Version: 4.24.2
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Martin Schwenke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2026-05-25 06:09 UTC by Martin Schwenke
Modified: 2026-05-25 06:10 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Schwenke 2026-05-25 06:09:59 UTC
Near the beginning of main_loop(), the node map is fetched and the
flags of the current node are saved.  Later in main_loop(), on the
leader node, decisions are made about whether a recovery is needed.
Between these times, the state of the leader node may have changed and
it may no longer be a viable leader, perhaps because it is inactive.
A state change affecting the viability of the leader may also be the
reason why recovery is needed.

Recovery sets the dmaster of records in volatile databases to the
leader.  If an outgoing, inactive leader runs recovery this results in
inconsistent databases.

Here is an example showing a stopped node running recovery:

  2026-05-21T14:48:00.201838+10:00 node.0 ctdbd[5494]: Stopping node
  2026-05-21T14:48:00.202802+10:00 node.0 ctdbd[5494]: Making node INACTIVE
  2026-05-21T14:48:00.203891+10:00 node.0 ctdbd[5494]: Recovery mode set to ACTIVE
  2026-05-21T14:48:00.204981+10:00 node.0 ctdbd[5494]: Dropping all public IP addresses
  ...
  2026-05-21T14:48:00.240410+10:00 node.0 ctdbd[5494]: Freeze all
  2026-05-21T14:48:00.241286+10:00 node.0 ctdbd[5494]: Freeze db: rec_test.tdb
  2026-05-21T14:48:00.245207+10:00 node.0 ctdb-recoverd[5511]: Node:0 was in recovery mode. Start recovery process
  2026-05-21T14:48:00.245488+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Starting do_recovery
  2026-05-21T14:48:00.245570+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery initiated due to problem with node 0
  2026-05-21T14:48:00.248315+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery - updated flags
  2026-05-21T14:48:00.255559+10:00 node.0 ctdbd[5494]: Connected client with pid:8656
  2026-05-21T14:48:00.258061+10:00 node.0 ctdb-recovery[8656]: Set recovery mode to ACTIVE
  2026-05-21T14:48:00.262878+10:00 node.0 ctdb-recovery[8656]: start_recovery event finished
  2026-05-21T14:48:00.263332+10:00 node.0 ctdb-recovery[8656]: updated VNNMAP
  2026-05-21T14:48:00.263363+10:00 node.0 ctdb-recovery[8656]: recover database 0x92421532
  2026-05-21T14:48:00.288708+10:00 node.0 ctdbd[5494]: ../../server/ctdb_daemon.c:323 Registered message handler for srvid=17294104044079415297
  2026-05-21T14:48:00.301158+10:00 node.0 ctdb-recovery[8656]: Pulled 1 records for db rec_test.tdb from node 1
  2026-05-21T14:48:00.301685+10:00 node.0 ctdbd[5494]: ../../server/ctdb_daemon.c:323 Registered message handler for srvid=17294104044079415298
  2026-05-21T14:48:00.314492+10:00 node.0 ctdb-recovery[8656]: Pulled 1 records for db rec_test.tdb from node 2
  2026-05-21T14:48:00.378047+10:00 node.0 ctdb-recovery[8656]: Pushed 1 records for db rec_test.tdb
  2026-05-21T14:48:00.383227+10:00 node.0 ctdb-recovery[8656]: 1 of 1 databases recovered
  2026-05-21T14:48:00.407982+10:00 node.0 ctdb-recovery[8656]: Set recovery mode to NORMAL
  2026-05-21T14:48:00.411630+10:00 node.0 ctdb-recovery[8656]: recovered event finished
  2026-05-21T14:48:00.411803+10:00 node.0 ctdb-recoverd[5511]: Takeover run starting
  ...
  2026-05-21T14:48:00.446576+10:00 node.0 ctdb-recoverd[5511]: Takeover run completed successfully
  2026-05-21T14:48:00.446784+10:00 node.0 ctdb-recoverd[5511]: do_recovery: Recovery complete
  ...
  2026-05-21T14:48:06.175857+10:00 node.0 ctdb-recoverd[5511]: Leader broadcast timeout
  2026-05-21T14:48:06.176067+10:00 node.0 ctdb-recoverd[5511]: Start election
  2026-05-21T14:48:06.178154+10:00 node.0 ctdbd[5494]: Recovery mode already set to ACTIVE
  2026-05-21T14:48:06.178519+10:00 node.0 ctdbd[5494]: Recovery mode already set to ACTIVE
  2026-05-21T14:48:06.737272+10:00 node.0 ctdb-recoverd[5511]: Received leader broadcast, leader=2

Recovery pulls records from and pushes records to the active nodes (1,
2).  However, the dmaster of all records will be 0.

This doesn't seem to occur often.  I can recreate it fairly easily if
I run ctdb/tests/INTEGRATION/database/recovery.003.no_resurrect.sh
under valgrind *and* apply a ctdb tool change that delays when "ctdb
stop" sends CTDB_SRVID_TAKEOVER_RUN.  Both only affect timing and not
overall recovery daemon behaviour.  However, there is currently
nothing stopping this behaviour.

In the example above, some time later, an election is held, another
node becomes leader and the new leader runs recovery:

  2026-05-21T14:48:06.175072+10:00 node.2 ctdb-recoverd[5704]: Leader broadcast timeout
  2026-05-21T14:48:06.176051+10:00 node.2 ctdb-recoverd[5704]: Start election
  2026-05-21T14:48:06.176913+10:00 node.2 ctdbd[5649]: Recovery mode set to ACTIVE
  2026-05-21T14:48:06.178319+10:00 node.2 ctdbd[5649]: Recovery mode already set to ACTIVE
  2026-05-21T14:48:06.178919+10:00 node.2 ctdbd[5649]: Recovery mode already set to ACTIVE
  2026-05-21T14:48:06.179961+10:00 node.2 ctdb-recoverd[5704]: Attempting to take cluster lock (./tests/var/INTEGRATION/database/shared/.ctdb/cluster.lock)
  2026-05-21T14:48:06.182192+10:00 node.2 ctdb-recoverd[5704]: Set cluster mutex helper to "/home/martins/samba/samba/ctdb/bin/ctdb_mutex_fcntl_helper"
  2026-05-21T14:48:06.202884+10:00 node.2 ctdb-recoverd[5704]: Cluster lock taken successfully
  2026-05-21T14:48:06.203876+10:00 node.2 ctdb-recoverd[5704]: Took cluster lock, leader=2
  2026-05-21T14:48:06.756771+10:00 node.2 ctdb-recoverd[5704]: Remote node 0 had flags 0x20, local had 0x0 - updating local
  2026-05-21T14:48:06.760258+10:00 node.2 ctdb-recoverd[5704]: Pushing updated flags for node 0 (0x20)
  2026-05-21T14:48:06.766434+10:00 node.2 ctdbd[5649]: Node 0 has changed flags - 0x0 -> 0x20
  2026-05-21T14:48:06.777574+10:00 node.2 ctdb-recoverd[5704]: Node:2 was in recovery mode. Start recovery process
  2026-05-21T14:48:06.778277+10:00 node.2 ctdb-recoverd[5704]: Node:1 was in recovery mode. Start recovery process
  2026-05-21T14:48:06.779017+10:00 node.2 ctdb-recoverd[5704]: do_recovery: Starting do_recovery

This makes the affected databases consistent again... and explains why
this hasn't been noticed before.

More details...

During recovery, recbuf_filter_add() sets the dmaster of records in a
volatile database to the leader.  If this node is inactive then
records with it as dmaster can't be migrated to other nodes after
recovery completes.  So, until another recovery occurs, the databases
are inconsistent and any attempts to fetch records will hang.

In terms of post-recovery distributed database performance, it might
make more sense for recovery to set each record's dmaster to its
lmaster.  However, that would cost an additional lmaster (i.e. hash)
calculation for each key.  So, setting the dmaster of records to be
the leader might be an important recovery performance optimisation.

This bug was found while testing a "leader resignation" change, which
aims to speed up operations like "ctdb stop" by having an outgoing
leader resign, so other nodes do not have to wait for a leader
broadcast timeout.  This change does not require recovery to be run
after an election if there is no other change to the
cluster (e.g. leader capability removed: orderly transfer of power).
For a stopped node, the new leader would run a recovery due to the
stopped node becoming inactive... unless the outgoing leader runs
recovery (as per this bug), which handles the cluster change so that
it is no longer exposed to the new leader.  So, with this bug and the
leader resignation change, the databases stay inconsistent until a
subsequent recovery.

Although it is theoretically unnecessary, it would be possible to have
leader resignation force a full election, which would always result in
recovery, but that is a question for another day.  The current
behaviour is wrong because a recovery run by an inactive leader leaves
volatile databases (at least temporarily) in an inconsistent state.
So, this needs to be fixed.

In the longer term, CTDB will hopefully become more modular.
Elections and database recovery will happen in different modules.
This situation will have to be carefully handled.