Hello, maybe I found a problem: Only the Recovery Master seems to be able to takeover the public IP from other nodes. Details: I forced the recmaster node to change to state "unhealthy" by a) killing winbindd b) unplugging the network cable In none of this two cases the puplic IP was taken over to the second node (non recmaster). When doing the same at the second node, the IP was taken over to the first node as expected. Then I switched the recmaster to the second node and repeated the tests with the same results. I could reproduce this behaviour in two very different environments. Test environment 1: - 2 nodes x86_64 based + node 1 with RHEL 5.3 + node 2 with SLES 10 SP2 - IBM GPFS 3.2.1.11 - Samba 3.2.14 (from Sernet, but recompiled with clustering support) - CTDB 1.0.88/1.0.86/1.0.84 Test environment 2 (near production): - 2 nodes IBM p5 based (RHEL 5.3 ppc64) - IBM GPFS 3.2.1.12 - Samba 3.2.14 (from Sernet, but recompiled with clustering support) - CTDB 1.0.88 /etc/sysconfig/ctdb: CTDB_RECOVERY_LOCK="/gpfs/fs1/ctdb-x86.lck" CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes CTDB_MANAGES_WINBIND=yes CTDB_NODES=/etc/ctdb/nodes CTDB_DBDIR=/var/lib/ctdb CTDB_DBDIR_PERSISTENT=/var/lib/ctdb/persistent CTDB_EVENT_SCRIPT_DIR=/etc/ctdb/events.d CTDB_LOGFILE=/var/log/ctdb/log.ctdb CTDB_DEBUGLEVEL=2 /etc/ctdb/nodes: 192.168.136.4 192.168.136.7 /etc/ctdb/public_addresses: 192.168.136.152/24 bond1 192.168.136.153/24 bond1 The configuration files are identical on all cluster nodes. Please let me know, if I could provide further informations (logs, traces, etc.). Christoph
Changed state to "invalid" because of improper combination of CTDB (1.0.88) and Samba (3.2.14).
Hi - Thanks for taking the time to report this. This bug report is not invalid: The action (b) taken - unplugging a network cable - does not depend on samba. ctdb should in this case take car of failing over IPs in that case. Cheers - Michael
I think the problem here was "unplugging *the* network cable". Recommended networking configuration is to have a private network so CTDB can communicate between nodes and a public network. In this case it looks like only one network was used. If this network is unplugged then the recovery master can't communicate with other nodes... So, I'm going to mark this old bug as invalid... :-)