Created attachment 10989 [details] GlusterFS NFS Event Monitor Script (Old Version Ignore) Hello Support, There is no CTDB monitor script for the GlusterFS NFS implementation as you cannot use the normal NFS event script that comes with CTDB, this is because GlusterFS manages NFS. Without a proper monitoring script CTDB will not initiate a failover when GlusterFS NFS services fail, attached is a script to solve this problem. Please see testing below: # ctdb status Number of nodes:2 pnn:0 10.0.1.10 OK (THIS NODE) pnn:1 10.0.1.11 OK Generation:2096778561 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:1 # gluster volume status smb_br01 | grep 'NFS Server on localhost' NFS Server on localhost 2049 Y 15479 # kill -9 15479 # gluster volume status smb_br01 | grep 'NFS Server on localhost' NFS Server on localhost N/A N N/A # ctdb status Number of nodes:2 pnn:0 10.0.1.10 UNHEALTHY (THIS NODE) pnn:1 10.0.1.11 OK Generation:2096778561 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:1 # tail /var/log/log.ctdb 2015/04/26 14:00:29.465384 [ 2050]: Node became UNHEALTHY. Ask recovery master 1 to perform ip reallocation 2015/04/26 14:00:34.838603 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 2049 is not responding 2015/04/26 14:00:34.841680 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 38465 is not responding 2015/04/26 14:00:34.844732 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 38466 is not responding 2015/04/26 14:00:45.210742 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 2049 is not responding 2015/04/26 14:00:45.213786 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 38465 is not responding 2015/04/26 14:00:45.216709 [ 2050]: 60.glusternfs: ERROR: glusterfs_nfs tcp port 38466 is not responding # systemctl restart glusterd && systemctl restart glusterfsd # gluster volume status smb_br01 | grep 'NFS Server on localhost' NFS Server on localhost 2049 Y 18629 # ctdb status Number of nodes:2 pnn:0 10.0.1.10 OK (THIS NODE) pnn:1 10.0.1.11 OK Generation:2096778561 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:1 # Regards, Ben Draper
Created attachment 10990 [details] GlusterFS NFS Event Monitor Script GlusterFS NFS Event Monitor Script
Thanks for this suggestion and sorry for taking so long to respond. I assume that TCP ports 38465 and 38466 are RPC services? The job you're doing with verify_ports() can probably be done more reliably and completely using our existing RPC port checking code. Can you please send me the output of "rpcinfo -p" and "rpcinfo -s" so I can do a sanity check? We're in the process of folding 60.ganesha into 60.nfs so that we only have a single, unified NFS eventscript in CTDB. I'm in the process of reworking our RPC checking code. It will a directory of configuration files (/etc/ctdb/nfs-checks.d/ by default). It will use a more extensible scheme than what we currently do in /etc/ctdb/nfs-rpc-checks.d/. I like your idea of actually monitoring the port(s) for the portmapper itself. I'm planning to add a configuration file to do this by default. Thanks. :-) Anything else will be done by configuring a call-out for the NFS system being used. That will stop us hard-coding all sorts of rubbish in our core code. :-) We will ship a callout for the Linux kernel NFS server and configuration files for RPC checks. These will be used by default. We will provide a sample callout for Ganesha as documentation and will expect NFS Ganesha to ship an up-to-date callout for CTDB. This way they will always be in sync (with themselves) and won't have to depend on us to merge changes to 60.ganesha. Gluster NFS could then also provide a very simple callout and instructions (or a script) to setup configuration files in /etc/ctdb/nfs-checks.d/ for the RPC checks.
Thanks for getting back to me Martin, I really appreciate it. I'll try and provide as much information as possible to help you with this. I like your idea of having one handler and then callout functions to different types of NFS implementations though, that would be fantastic! :-) The issue I had is that the nfs-kernel-server service never gets started as GlusterFS has its own NFS implementation, so I needed a different script to ensure it worked properly as you can see below. With regards to the RPC side of things with my checking code your probably right there can be better ways to do it :-) # rpcinfo -p program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100005 3 tcp 38465 mountd 100005 1 tcp 38466 mountd 100003 3 tcp 2049 nfs 100021 4 tcp 38468 nlockmgr 100227 3 tcp 2049 nfs_acl 100024 1 udp 58457 status 100024 1 tcp 38265 status 100021 1 udp 932 nlockmgr 100021 1 tcp 934 nlockmgr # # rpcinfo -s program version(s) netid(s) service owner 100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser 100005 1,3 tcp mountd superuser 100003 3 tcp nfs superuser 100021 1,4 udp,tcp nlockmgr superuser 100227 3 tcp nfs_acl superuser 100024 1 tcp6,udp6,tcp,udp status 29 # # fuser 111/tcp 111/tcp: 583 # fuser 38465/tcp 38465/tcp: 2446 # fuser 38466/tcp 38466/tcp: 2446 # fuser 2049/tcp 2049/tcp: 2446 # fuser 38468/tcp 38468/tcp: 2446 # fuser 58457/udp 58457/udp: 2683 # fuser 38265/tcp 38265/tcp: 2683 # fuser 932/udp 932/udp: 2446 # fuser 934/tcp 934/tcp: 2446 # ps -elf | grep 583 | grep -v grep 5 S rpc 583 1 0 80 0 - 9977 poll_s 17:05 ? 00:00:00 /sbin/rpcbind -w # ps -elf | grep 2446 | grep -v grep 5 S root 2446 1 0 80 0 - 142206 futex_ 17:05 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/572f9a612871bae19917989c604bd09b.socket # ps -elf | grep 2683 | grep -v grep 5 S rpcuser 2683 1 0 80 0 - 12691 poll_s 17:05 ? 00:00:00 /sbin/rpc.statd # # firewall-cmd --list-all public (default, active) interfaces: eno67113728 sources: services: nfs rpc-bind samba ssh ports: 38466/tcp 38465/tcp masquerade: no forward-ports: icmp-blocks: rich rules: # If you need anymore information please let me know. Thanks, Ben
The new 60.nfs with $CTDB_NFS_CALLOUT and new /etc/ctdb/nfs-checks.d/ directory is now upstream in Samba master. To implement your existing eventscripts you would need to: * Implemented a callout with "monitor-pre" defined and have it run verify_supporting_services(). Set CTDB_NFS_CALLOUT to point to where it is installed. You probably want to define "register" too, so that the callout is only called for "monitor-pre". Take a look at nfs-linux-kernel-callout as an example. * Implement verify_ports() using .check files in /etc/ctdb/nfs-checks.d/. If you want CTDB to become unhealthy after a single failure then you would just have files like: 20.nfs.check: # nfs version=3 unhealthy_after=1 # nlockmgr version="1 4" unhealthy_after=1 and so on. The current default is to just check for services available on "tcp" but you can also do "udp". This is done by using the "family" variable in the check file (there's a README to explain). I see I didn't implement support in ctdb_check_rpc() for properly checking IPv6 service availability. That patch is now in my queue, so you'll be able to explicitly check for "tcp6" and "udp6" if you need to. :-) You could either include instructions to install/create these or provide a directory which can be pointed to by the CTDB_NFS_CHECKS_DIR variable. This variable is currently undocumented but we could document it. Will you also want to use CTDB's statd-callout? Or will we need to quickly add something to disable this? Not sure if Gluster NFS has a cluster-aware lock manager or if it needs to use CTDB's hackery to keep track of the clients that have locks.
Hi Martin, All those points make sense, I'll see if I can get some time to create the callout and create the required files to verify ports and test it all works properly from upstream with GlusterFS. I'll have to double check on the statd-callout question. I think GlusterFS takes care of locks, but I will need to double check this to be 100% though with its NFS implementation. Thanks, Ben
Hi Ben, The NFS callout feature is now available in Samba 4.3. Can you please check if it meets your needs? Thanks... peace & happiness, martin
I'm going to go out on a limb and claim that the helper support in 60.nfs allows GlusterFS NFS to be supported... so will now close this as "fixed" :-)