hello, hope for some help and suggestions. Windows10 client access 3 nodes samba shared cluster, and ctdb manages samba. I do the followings: 1. Client access to shared directory, and then copy the files in shared directory to the local, 2.kill the smbd in the node of the client connected .wait a moment, client re-transmission; 3. After re-transmission, all nodes ctdb will have been in the recovery state. The ctdb configuration file is as follows: CTDB_RECOVERY_LOCK=null_lock CTDB_PUBLIC_INTERFACE=enp5s0f0 CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes CTDB_MANAGES_WINBIND=yes CTDB_SERVICE_SMB=smb CTDB_SERVICE_NMB=nmb CTDB_SERVICE_WINBIND=winbind CTDB_NODES=/etc/ctdb/nodes CTDB_SAMBA_SKIP_SHARE_CHECK=yes CTDB_DEBUGLEVEL=INFO The processes is as follows: [root@inspur02 ~]# ps -ef |grep ctdb root 796608 1 0 Jan09 ? 00:02:51 /usr/sbin/ctdbd --reclock=null_lock --pidfile=/run/ctdb/ctdbd.pid --nlist=/etc/ctdb/nodes --public-addresses=/etc/ctdb/public_addresses --public-interface=enp5s0f0 -d INFO root 796843 796608 0 Jan09 ? 00:00:03 /usr/sbin/ctdbd --reclock=null_lock --pidfile=/run/ctdb/ctdbd.pid --nlist=/etc/ctdb/nodes --public-addresses=/etc/ctdb/public_addresses --public-interface=enp5s0f0 -d INFO root 825768 796843 0 Jan09 ? 00:00:00 /usr/libexec/ctdb/ctdb_mutex_fcntl_helper null_lock root 825821 796608 0 Jan09 ? 00:00:00 /usr/libexec/ctdb/ctdb_lock_helper 65 796608 63 DB /var/lib/ctdb/smbXsrv_open_global.tdb.1 0x0 root 1401016 796843 0 08:44 ? 00:00:00 /usr/libexec/ctdb/ctdb_recovery_helper 53 51 /var/run/ctdb/ctdbd.socket 1359003619 root 1401373 393898 0 08:44 pts/2 00:00:00 grep --color=auto ctdb gdb attach ctdb_lock_helper process: (gdb) bt #0 0x00007f1c8bce0410 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x00007f1c8bce02c4 in sleep () from /lib64/libc.so.6 #2 0x00007f1c8ce580c2 in ctdb_wait_for_process_to_exit () #3 0x00007f1c8ce5702d in main () (gdb) The ctdb log is as follows: [root@inspur03 ~]# tail -f /var/log/log.ctdb 2017/01/10 08:47:04.393652 [613650]: Not vacuuming smbXsrv_open_global.tdb (in recovery) 2017/01/10 08:47:04.393673 [613650]: Not vacuuming leases.tdb (in recovery) 2017/01/10 08:47:04.393691 [613650]: Not vacuuming locking.tdb (in recovery) 2017/01/10 08:47:04.393710 [613650]: Not vacuuming brlock.tdb (in recovery) 2017/01/10 08:47:04.393729 [613650]: Not vacuuming smbXsrv_tcon_global.tdb (in recovery) 2017/01/10 08:47:04.393748 [613650]: Not vacuuming smbXsrv_session_global.tdb (in recovery) 2017/01/10 08:47:04.393768 [613650]: Not vacuuming smbXsrv_version_global.tdb (in recovery) 2017/01/10 08:47:04.393788 [613650]: Not vacuuming serverid.tdb (in recovery) 2017/01/10 08:47:04.393808 [613650]: Not vacuuming netlogon_creds_cli.tdb (in recovery) 2017/01/10 08:47:04.393827 [613650]: Not vacuuming g_lock.tdb (in recovery) 2017/01/10 08:47:05.008675 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second 2017/01/10 08:47:06.008972 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second 2017/01/10 08:47:07.009367 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second 2017/01/10 08:47:08.010386 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second Hope for some help and suggestions thank you , best wishes
when i do as the follows, the problem happened: 1. Client access to shared directory, and then copy the files in shared directory to the local; 2. restart ctdb on the shared node, then all nodes ctdb will have been in the recovery state.
When you restart CTDB on a node, that node gets DISCONNECTED from the cluster. This will cause CTDB recovery to recover databases and also to re-assign the public IP addresses. After CTDB recovery is over, SMB clients can re-connect to samba. If you kill smbd on a node, then that node will become UNHEALTHY. This will not cause CTDB recovery. It will only cause reassignment of public IP addresses. Looking at the CTDB configuration, you cannot have following line in the configuration: CTDB_RECOVERY_LOCK=null_lock This variable should be set to a path on the shared file system. If you don't want to use CTDB_RECOVERY_LOCK, then this variable should be commented. That explains why CTDB gets stuck in the recovery. CTDB will not be able to take the recovery lock "null_lock" and will end up banning nodes.
Thanks for the reply. I re-describe my test scene: Windows10 client access to the shared directory, and copy the files under the directory to the local, and then restart ctdb on the node which the client is currently connected to. and then found ctdb on all nodes always be in recovery state. But sometimes ctdb on all nodes can be normal. I check the ctdb log and do not find error messages about the recovery lock. Best wishes.
(In reply to sunyekuan from comment #3) Please provide the logs from all the nodes after you restart CTDB on one node.
I am very sorry that the logs on all nodes are missed. I will reproduce the problem after a while. Now I want to know the what's the function of ctdb_lock_helper? and when will it end? thank you very much!
Created attachment 12810 [details] master node ctdb log
Created attachment 12811 [details] restart ctdb node ctdb log
Created attachment 12812 [details] the left node ctdb log
hello, the problem occured,3 nodes samba cluster, 3 ctdb logs, please help me check the problem. thank you very much!
(In reply to sunyekuan from comment #9) As I mentioned before the configuration setting of CTDB_RECOVERY_LOCK is wrong. Either specify an absolute path to a file on shared file system or comment out the CTDB_RECOVERY_LOCK setting on all the nodes. Setting CTDB_RECOVERY_LOCK=null_lock is equivalent to saying use file "null_lock" on each node (relative to the root directory for the ctdbd process) as a cluster-wide lock. This is completely WRONG. Please fix the configuration and re-create the problem.
thank you for the reply again. I do as you said. set CTDB_RECOVERY_LOCK = /mnt/nfs/ctdb_lockfile. but the problem still happens. and the ctdb log: 0 0x00007fe10b7f23a4 in fcntl () from /lib64/libpthread.so.0 #1 0x00007fe10b5d0e79 in fcntl_lock () from /usr/lib64/samba/libtdb.so.1 #2 0x00007fe10b5d0fac in tdb_brlock () from /usr/lib64/samba/libtdb.so.1 #3 0x00007fe10b5d1bb9 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #4 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #5 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #6 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #7 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #8 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #9 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #10 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #11 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #12 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #13 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #14 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #15 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #16 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #17 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #18 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #19 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #20 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1 #21 0x00007fe10b5d1d2f in tdb_allrecord_lock () from /usr/lib64/samba/libtdb.so.1 #22 0x00007fe10b5d2057 in tdb_lockall () from /usr/lib64/samba/libtdb.so.1 #23 0x00007fe10bc29661 in lock_db () #24 0x00007fe10bc29857 in main () I don't know why.
(In reply to sunyekuan from comment #11) You cannot use recovery lock over NFS mounted file system. If you are using NFS to share the file system, then please disable recovery lock.
thanks very much. which version of samba will achieve the "Transparant Failover" ?
Closing as "invalid", since reporter was trying to use recovery lock in a filesystem shared by NFS. The final question doesn't relate to the reported bug.