Created attachment 14563 [details]
We are building a 3 node CTDB cluster for HA CEPH-backed NFS / iSCSI storage,
using ctdb_mutex_ceph_rados_helper for the lock. We use rbd-nbd mapped images
exported as block device over iSCSI or with a file system over NFS.
Note: We're not using Cephfs, nor active-active iSCSI or NFS.
- Ubuntu 18.04
- CTDB 4.9.1
- Ceph Mimic
- ctdb_mutex_ceph_rados_helper set to a 10s timeout
During failure testing we found a split brain scenario:
If the current recovery master (say, node 0) loses network connectivity (both
ctdb and ceph), it ignores its lock file:
"Time out getting recovery lock, allowing recmode set anyway"
Node 0 then tries to claim any VIPs it doesn't already have, which will fail
due to no network, leading to the node getting banned and dropping all VIPs,
then a cycle of trying to recover, failing, and getting banned again.
If node 0 has all the VIPs already when connectivity is lost, nothing really
happens. It ignores the lack of a reclock and stays "online", with status OK
and all VIPs configured. I let it run for about 10 minutes with no change.
When connectivity returns, several things can happen:
Node 0 could come online with all VIPs configured while they are also
configured on the other nodes. Not good.
Sometimes, the cluster recovers.
Usually, node 0 will try to get a lock, get contention, clears all VIPS and
bans itself. After the ban it will try again and get banned again, and again,
etc. This will go on until someone manually kills ctdb on node 0. While that
ban loop is going, the other nodes carry on as if node 0 was still offline.
Sometimes, node 0 rejoins the others, 'ctdb status' on all nodes will show all
nodes OK, but 'ctdb ip' is different on every node and all VIPs are offline
until someone kills ctdb on node 0.
So, I see a few problems in this:
1. Ignoring the lock seems really weird. What is the point of having a lock
if it is ignored when it can't be reached? The code in ctdb_recover.c implies
this is a conscious decision:
/* Timeout. Consider this a success, not a failure,
* as we failed to set the recovery lock which is what
* we wanted. This can be caused by the cluster
* filesystem being very slow to arbitrate locks
* immediately after a node failure. */
I don't understand the reasoning behind "we failed, which is what we wanted,
so it's a success". Could this behaviour at least be configurable through a
tuneable? We need an unreachable lock to be a hard failure, to prevent the
situation where multiple nodes can be online with the same VIP on them.
2. That endless contention-ban loop. I'd expect on hitting contention there
would be a check to see which node holds the lock and whether it nees to join
instead of just banning itself and trying again in a seemingly endless loop.
3. That case where all nodes show OK but everything is broken. I have no
idea what's going on there... I've attached some logs.
I notice in the original logs attached to the email to the list:
2018/10/30 16:36:19.389704 ctdb-recoverd: Attempting to take recovery lock (!/usr/local/bin/ctdb_mutex_ceph_rados_helper ceph client.test foo ctdb_sgw_recovery_lock 30)
and the original configuration was:
This means that the cluster will notice a node is dead after about 5 seconds but ctdb_mutex_ceph_rados_helper won't retest the lock for at least 15 (i.e. 30/2) seconds. I'm guessing that if the node holding the lock is disconnected then it would hold the lock for up to 30 seconds (since it may just have been retaken).
So, this is a misconfiguration. You can't expect to take the recovery lock on one of the other nodes if there's no chance it has been released! :-)
The current configuration seems saner:
2018/11/02 11:15:15.193564 ctdb-recoverd: Attempting to take recovery lock (!/usr/local/bin/ctdb_mutex_ceph_rados_helper ceph client.BITED-158081-rbd BITED-158081-RADOSMISC ctdb_sgw_recovery_lock 10)
but those 2 values are so close that there is nowiggle room for the reclock helper to time out trying to take the lock.
I would like to understand if there is a problem if you use the default KeepAlive values and the default lock timeout for ctdb_mutex_ceph_rados_helper - I think you're already back to the latter.
In the attached logs the problem is this:
2018/11/02 11:25:57.788524 ctdbd: Eventd went away
The fact that ctdbd stays alive after that, but in a broken state, is:
Since eventd should not go away the best thing is to just shut down ctdbd. A watchdog of some sort (e.g. cron job) can be used to restart ctdbd.
That will be fixed in 4.9.2, to be released soon.
The real question is why eventd is going away. Is it crashing? We did fix one out of memory crash some time ago. If there is a crash and you have a core file, can you please open another bug, keep the core file and attach a backtrace?
We do have at least a couple of bugs to fix for the reclock handling:
1. If the reclock is lost we try to retake it
The node may have lost it because the node is no longer part of the quorum
in the underlying locking mechanism. The node should call an election
instead. The node can still win the election with itself but when it tries
to take the reclock it should fail and then ban itself.
I haven't seen this reflected in any of the logs for this bug, though. Have
you seen the following logged?
Recovery lock helper terminated unexpectedly - trying to retake recovery lock
2. ctdb_mutex_fcntl_helper.c does not make regular attempts to try to
retake the lock. It should do this and, like ctdb_mutex_ceph_rados_helper,
it should exit if it can't retake the lock.
Again, the helper that needs fixing isn't used in the scenario for this bug.
I think the following is being misinterpreted:
"Time out getting recovery lock, allowing recmode set anyway"
That isn't a real attempt to take the lock. This is a sanity check that is done on all nodes at the end of recovery to make sure the lock can't be taken (because it is held by another process). This is expected to fail, so when it times out we regard it as a form of failure.
take_reclock_handler() is the place where real attempts to take the reclock are handled. Here, the timeout is handled as expected.
I'm definitely bit saying that we don't have bugs - I've pointed out some above - but I'd like to see if this problem still happens with default configuration, as mentioned above. If it does then please provide details. If the defaults are OK then please tweak the KeepAlive settings again but consider how they relate to the timeout setting for ctdb_mutex_ceph_rados_helper.
We may have to add a cooling off period between completion of an election and trying to take the lock, where we don't ban a node. Right now I think that can be worked around with the ElectionTimeout... but it really is the wrong knob to be turned if this is a real problem.
Can you please retest with default settings and let me know what you find? If you continue to see "Eventd went away", can you please debug it and open a bug for any crash you're seeing?