12504 – ctdb always be recovery state when restart it in centos7 platform

Bug 12504 - ctdb always be recovery state when restart it in centos7 platform

Summary: ctdb always be recovery state when restart it in centos7 platform

Status:	RESOLVED INVALID

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	CTDB (show other bugs)
Version:	4.5.1
Hardware:	All All

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Amitay Isaacs
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-01-10 00:48 UTC by sunyekuan
Modified:	2017-10-26 03:21 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
master node ctdb log (2.73 MB, text/plain) 2017-01-10 08:00 UTC, sunyekuan	no flags	Details
restart ctdb node ctdb log (2.85 MB, text/plain) 2017-01-10 08:05 UTC, sunyekuan	no flags	Details
the left node ctdb log (832.50 KB, text/plain) 2017-01-10 08:05 UTC, sunyekuan	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description sunyekuan 2017-01-10 00:48:07 UTC

hello,
    hope for some help and suggestions.

    Windows10 client access 3 nodes samba shared cluster,  and ctdb manages samba. I do the followings:
        1. Client access to shared directory, and then copy the files in shared directory  to the local,
        2.kill the smbd in the node of the client connected .wait a moment, client re-transmission;
        3. After re-transmission, all nodes ctdb will have been in the recovery state.
    The ctdb configuration file is as follows:

CTDB_RECOVERY_LOCK=null_lock
CTDB_PUBLIC_INTERFACE=enp5s0f0
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
CTDB_MANAGES_WINBIND=yes
CTDB_SERVICE_SMB=smb
CTDB_SERVICE_NMB=nmb
CTDB_SERVICE_WINBIND=winbind
CTDB_NODES=/etc/ctdb/nodes
CTDB_SAMBA_SKIP_SHARE_CHECK=yes
CTDB_DEBUGLEVEL=INFO

    The processes is as follows:
[root@inspur02 ~]# ps -ef |grep ctdb
root      796608       1  0 Jan09 ?        00:02:51 /usr/sbin/ctdbd --reclock=null_lock --pidfile=/run/ctdb/ctdbd.pid --nlist=/etc/ctdb/nodes --public-addresses=/etc/ctdb/public_addresses --public-interface=enp5s0f0 -d INFO
root      796843  796608  0 Jan09 ?        00:00:03 /usr/sbin/ctdbd --reclock=null_lock --pidfile=/run/ctdb/ctdbd.pid --nlist=/etc/ctdb/nodes --public-addresses=/etc/ctdb/public_addresses --public-interface=enp5s0f0 -d INFO
root      825768  796843  0 Jan09 ?        00:00:00 /usr/libexec/ctdb/ctdb_mutex_fcntl_helper null_lock
root      825821  796608  0 Jan09 ?        00:00:00 /usr/libexec/ctdb/ctdb_lock_helper 65 796608 63 DB /var/lib/ctdb/smbXsrv_open_global.tdb.1 0x0
root     1401016  796843  0 08:44 ?        00:00:00 /usr/libexec/ctdb/ctdb_recovery_helper 53 51 /var/run/ctdb/ctdbd.socket 1359003619
root     1401373  393898  0 08:44 pts/2    00:00:00 grep --color=auto ctdb

    
    gdb attach ctdb_lock_helper process:

(gdb) bt
#0  0x00007f1c8bce0410 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007f1c8bce02c4 in sleep () from /lib64/libc.so.6
#2  0x00007f1c8ce580c2 in ctdb_wait_for_process_to_exit ()
#3  0x00007f1c8ce5702d in main ()
(gdb)

    The ctdb log is as follows:
[root@inspur03 ~]# tail -f /var/log/log.ctdb 
2017/01/10 08:47:04.393652 [613650]: Not vacuuming smbXsrv_open_global.tdb (in recovery)
2017/01/10 08:47:04.393673 [613650]: Not vacuuming leases.tdb (in recovery)
2017/01/10 08:47:04.393691 [613650]: Not vacuuming locking.tdb (in recovery)
2017/01/10 08:47:04.393710 [613650]: Not vacuuming brlock.tdb (in recovery)
2017/01/10 08:47:04.393729 [613650]: Not vacuuming smbXsrv_tcon_global.tdb (in recovery)
2017/01/10 08:47:04.393748 [613650]: Not vacuuming smbXsrv_session_global.tdb (in recovery)
2017/01/10 08:47:04.393768 [613650]: Not vacuuming smbXsrv_version_global.tdb (in recovery)
2017/01/10 08:47:04.393788 [613650]: Not vacuuming serverid.tdb (in recovery)
2017/01/10 08:47:04.393808 [613650]: Not vacuuming netlogon_creds_cli.tdb (in recovery)
2017/01/10 08:47:04.393827 [613650]: Not vacuuming g_lock.tdb (in recovery)
2017/01/10 08:47:05.008675 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second
2017/01/10 08:47:06.008972 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second
2017/01/10 08:47:07.009367 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second
2017/01/10 08:47:08.010386 [613650]: ../ctdb/server/ctdb_monitor.c:324 in recovery. Wait one more second
   
    Hope for some help and suggestions
    thank you , best wishes

Comment 1 sunyekuan 2017-01-10 01:38:46 UTC

when i do as the follows, the problem happened:
1. Client access to shared directory, and then copy the files in shared directory  to the local;
2. restart ctdb on the shared node, then all nodes ctdb will have been in the recovery state.

Comment 2 Amitay Isaacs 2017-01-10 06:32:31 UTC

When you restart CTDB on a node, that node gets DISCONNECTED from the cluster.  This will cause CTDB recovery to recover databases and also to re-assign the public IP addresses.  After CTDB recovery is over, SMB clients can re-connect to samba.

If you kill smbd on a node, then that node will become UNHEALTHY.  This will not cause CTDB recovery.  It will only cause reassignment of public IP addresses.

Looking at the CTDB configuration, you cannot have following line in the configuration:

  CTDB_RECOVERY_LOCK=null_lock

This variable should be set to a path on the shared file system.  If you don't want to use CTDB_RECOVERY_LOCK, then this variable should be commented.  That explains why CTDB gets stuck in the recovery.  CTDB will not be able to take the recovery lock "null_lock" and will end up banning nodes.

Comment 3 sunyekuan 2017-01-10 06:43:18 UTC

Thanks for the reply.
I re-describe my test scene:
Windows10 client access to the shared directory, and copy the files under the directory to the local, and then restart ctdb on the node which the client is currently connected to.
and then found ctdb on all nodes always be in recovery state.
But sometimes ctdb on all nodes can be normal.
I check the ctdb log and do not find error messages about the recovery lock.

Best wishes.

Comment 4 Amitay Isaacs 2017-01-10 06:52:00 UTC

(In reply to sunyekuan from comment #3)

Please provide the logs from all the nodes after you restart CTDB on one node.

Comment 5 sunyekuan 2017-01-10 07:25:30 UTC

I am very sorry that the logs on all nodes are missed. I will reproduce the problem after a while.
Now I want to know the what's the function of ctdb_lock_helper? and when will it end?
thank you very much!

Comment 6 sunyekuan 2017-01-10 08:00:54 UTC

Created attachment 12810 [details]
master node ctdb log

Comment 7 sunyekuan 2017-01-10 08:05:11 UTC

Created attachment 12811 [details]
restart ctdb node ctdb log

Comment 8 sunyekuan 2017-01-10 08:05:40 UTC

Created attachment 12812 [details]
the left node ctdb log

Comment 9 sunyekuan 2017-01-10 08:07:19 UTC

hello, the problem occured，3 nodes samba cluster, 3 ctdb logs, please help me check the problem.

thank you very much!

Comment 10 Amitay Isaacs 2017-01-11 00:51:12 UTC

(In reply to sunyekuan from comment #9)

As I mentioned before the configuration setting of CTDB_RECOVERY_LOCK is wrong.
Either specify an absolute path to a file on shared file system or comment out the CTDB_RECOVERY_LOCK setting on all the nodes.

Setting CTDB_RECOVERY_LOCK=null_lock is equivalent to saying use file "null_lock" on each node (relative to the root directory for the ctdbd process) as a cluster-wide lock.  This is completely WRONG.

Please fix the configuration and re-create the problem.

Comment 11 sunyekuan 2017-01-11 11:00:47 UTC

thank you for the reply again.
I do as you said. set CTDB_RECOVERY_LOCK = /mnt/nfs/ctdb_lockfile.
but the problem still happens. and the ctdb log:
0  0x00007fe10b7f23a4 in fcntl () from /lib64/libpthread.so.0
#1  0x00007fe10b5d0e79 in fcntl_lock () from /usr/lib64/samba/libtdb.so.1
#2  0x00007fe10b5d0fac in tdb_brlock () from /usr/lib64/samba/libtdb.so.1
#3  0x00007fe10b5d1bb9 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#4  0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#5  0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#6  0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#7  0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#8  0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#9  0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#10 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#11 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#12 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#13 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#14 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#15 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#16 0x00007fe10b5d1c13 in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#17 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#18 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#19 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#20 0x00007fe10b5d1c5e in tdb_chainlock_gradual () from /usr/lib64/samba/libtdb.so.1
#21 0x00007fe10b5d1d2f in tdb_allrecord_lock () from /usr/lib64/samba/libtdb.so.1
#22 0x00007fe10b5d2057 in tdb_lockall () from /usr/lib64/samba/libtdb.so.1
#23 0x00007fe10bc29661 in lock_db ()
#24 0x00007fe10bc29857 in main ()
I don't know why.

Comment 12 Amitay Isaacs 2017-01-11 12:10:29 UTC

(In reply to sunyekuan from comment #11)

You cannot use recovery lock over NFS mounted file system.  If you are using NFS to share the file system, then please disable recovery lock.

Comment 13 sunyekuan 2017-01-12 03:27:33 UTC

thanks very much.
which version of samba will achieve the "Transparant Failover" ?

Comment 14 Martin Schwenke 2017-10-26 03:21:48 UTC

Closing as "invalid", since reporter was trying to use recovery lock in a filesystem shared by NFS.  The final question doesn't relate to the reported bug.