The testing scenario includes running CTDB (1.0 2008-05-29) with GPFS 3.2.1-4 with a cluster of 3 servers. Files are ingested concurrently into different servers using separate windows servers via cifs, and the test software will receive the error error_netname_deleted. This test is reproducible within about 30 minutes of starting the test. Here are some logs from dc-gpfs-gn2 around the time of failure: /var/log/messages: Aug 1 23:42:52 dc-gpfs-gn2 kernel: nfsd: last server has exited Aug 1 23:42:52 dc-gpfs-gn2 kernel: nfsd: unexporting all filesystems Aug 1 23:42:52 dc-gpfs-gn2 kernel: RPC: failed to contact portmap (errno -5). Aug 1 23:42:52 dc-gpfs-gn2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Aug 1 23:42:52 dc-gpfs-gn2 kernel: NFSD: starting 90-second grace period Aug 1 23:42:58 dc-gpfs-gn2 kernel: nfsd: last server has exited Aug 1 23:42:58 dc-gpfs-gn2 kernel: nfsd: unexporting all filesystems Aug 1 23:42:59 dc-gpfs-gn2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Aug 1 23:42:59 dc-gpfs-gn2 kernel: NFSD: starting 90-second grace period ctdb.log: 2008/08/01 23:42:51.766257 [ 7442]: Recovery has started 2008/08/01 23:42:51.992687 [ 7442]: Takeover of IP 9.11.192.102/23 on interface bond1 2008/08/01 23:42:52.083970 [ 7442]: Sending NFS tickle ack for 9.11.192.102 to 9.11.193.163:1023 2008/08/01 23:42:52.092136 [ 7442]: Sending NFS tickle ack for 9.11.192.102 to 9.11.192.182:836 2008/08/01 23:42:52.138972 [ 7442]: Recovery has finished 2008/08/01 23:42:58.428241 [ 7442]: Recovery has started 2008/08/01 23:42:58.528365 [ 7442]: Release of IP 9.11.192.102/23 on interface bond1 2008/08/01 23:42:58.771637 [ 7442]: Recovery has finished ctdb.log from dc-gpfs-gn1: 2008/08/01 23:42:50.973209 [ 8619]: rpcinfo: RPC: Timed out 2008/08/01 23:42:50.973569 [ 8619]: ERROR: NFS not responding to rpc requests 2008/08/01 23:42:50.973877 [ 6428]: Event script /etc/ctdb/events.d/60.nfs monitor failed with error 1 2008/08/01 23:42:50.974095 [ 8619]: monitor event failed - disabling node 2008/08/01 23:42:51.765963 [ 8619]: Recovery has started 2008/08/01 23:42:51.865051 [ 8621]: Deterministic IPs enabled. Resetting all ip allocations 2008/08/01 23:42:51.865155 [ 8619]: Release of IP 9.11.192.102/23 on interface bond1 2008/08/01 23:42:52.122118 [ 8619]: Recovery has finished 2008/08/01 23:42:58.265338 [ 8619]: monitor event OK - node re-enabled 2008/08/01 23:42:58.427697 [ 8619]: Recovery has started 2008/08/01 23:42:58.527880 [ 8621]: Deterministic IPs enabled. Resetting all ip allocations 2008/08/01 23:42:58.652277 [ 8619]: Takeover of IP 9.11.192.102/23 on interface bond1 2008/08/01 23:42:58.771024 [ 8619]: Recovery has finished messages from dc-gpfs-gn1: Aug 1 23:42:57 dc-gpfs-gn1 kernel: nfsd: last server has exited Aug 1 23:42:57 dc-gpfs-gn1 kernel: nfsd: unexporting all filesystems Aug 1 23:42:57 dc-gpfs-gn1 kernel: RPC: failed to contact portmap (errno -5). Aug 1 23:42:57 dc-gpfs-gn1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Aug 1 23:42:57 dc-gpfs-gn1 kernel: NFSD: starting 90-second grace period Aug 1 23:42:58 dc-gpfs-gn1 kernel: nfsd: last server has exited Aug 1 23:42:58 dc-gpfs-gn1 kernel: nfsd: unexporting all filesystems Aug 1 23:42:59 dc-gpfs-gn1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Aug 1 23:42:59 dc-gpfs-gn1 kernel: NFSD: starting 90-second grace period
Dustin, Your bug report doesn't give nearly enough information. The logs you sent are at the minimal log level, and you don't include any of the config files. Please raise the ctdb log levels, and attach the output of the ctdb_diagnostics script for future bug reports. Even so, I can take a wild guess at what the problem is. The logs you sent show that ctdb is managing NFS as well as CIFS, and that it is detecting NFS server problems on your nodes. When it detects that NFS is not working it disables the public IP on that node. When it disables the public IPs, that will cause disconnects on all clients, both NFS and CIFS. The idea is that the IP gets moved to a new node. You should also check that you have setup rr-DNS correctly, and are using the rr-DNS public addresses in your tests. If you use non-rr-DNS IPs then when a failover happens you will get a permanent disconnect rather than a failover. Still, this is all complete guesswork, as from your bug report I have no idea what your config is like. I'd also suggest you look into setting up a SoFS 'autocluster' system, so you can see how a properly setup clustered Samba system should be configured. Cheers, Tridge
Required information never provided....