Bug 5672 - CTDB receives error_netname_deleted during clustered ingest test
CTDB receives error_netname_deleted during clustered ingest test
Status: RESOLVED INVALID
Product: CTDB 2.5.x or older
Classification: Unclassified
Component: ctdb
unspecified
x86 Linux
: P3 normal
: ---
Assigned To: Martin Schwenke
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-05 19:38 UTC by Dustin Massop
Modified: 2016-08-10 09:03 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dustin Massop 2008-08-05 19:38:46 UTC
The testing scenario includes running CTDB (1.0 2008-05-29) with GPFS 3.2.1-4 with a cluster of 3 servers. Files are ingested concurrently into different servers using separate windows servers via cifs, and the test software will receive the error error_netname_deleted. This test is reproducible within about 30 minutes of starting the test.

Here are some logs from dc-gpfs-gn2 around the time of failure:

/var/log/messages:

Aug  1 23:42:52 dc-gpfs-gn2 kernel: nfsd: last server has exited
Aug  1 23:42:52 dc-gpfs-gn2 kernel: nfsd: unexporting all filesystems
Aug  1 23:42:52 dc-gpfs-gn2 kernel: RPC: failed to contact portmap (errno -5).
Aug  1 23:42:52 dc-gpfs-gn2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Aug  1 23:42:52 dc-gpfs-gn2 kernel: NFSD: starting 90-second grace period
Aug  1 23:42:58 dc-gpfs-gn2 kernel: nfsd: last server has exited
Aug  1 23:42:58 dc-gpfs-gn2 kernel: nfsd: unexporting all filesystems
Aug  1 23:42:59 dc-gpfs-gn2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Aug  1 23:42:59 dc-gpfs-gn2 kernel: NFSD: starting 90-second grace period

ctdb.log:

2008/08/01 23:42:51.766257 [ 7442]: Recovery has started
2008/08/01 23:42:51.992687 [ 7442]: Takeover of IP 9.11.192.102/23 on interface bond1
2008/08/01 23:42:52.083970 [ 7442]: Sending NFS tickle ack for 9.11.192.102 to 9.11.193.163:1023
2008/08/01 23:42:52.092136 [ 7442]: Sending NFS tickle ack for 9.11.192.102 to 9.11.192.182:836
2008/08/01 23:42:52.138972 [ 7442]: Recovery has finished
2008/08/01 23:42:58.428241 [ 7442]: Recovery has started
2008/08/01 23:42:58.528365 [ 7442]: Release of IP 9.11.192.102/23 on interface bond1
2008/08/01 23:42:58.771637 [ 7442]: Recovery has finished

ctdb.log from dc-gpfs-gn1:

2008/08/01 23:42:50.973209 [ 8619]: rpcinfo: RPC: Timed out
2008/08/01 23:42:50.973569 [ 8619]: ERROR: NFS not responding to rpc requests
2008/08/01 23:42:50.973877 [ 6428]: Event script /etc/ctdb/events.d/60.nfs monitor failed with error 1
2008/08/01 23:42:50.974095 [ 8619]: monitor event failed - disabling node
2008/08/01 23:42:51.765963 [ 8619]: Recovery has started
2008/08/01 23:42:51.865051 [ 8621]: Deterministic IPs enabled. Resetting all ip allocations
2008/08/01 23:42:51.865155 [ 8619]: Release of IP 9.11.192.102/23 on interface bond1
2008/08/01 23:42:52.122118 [ 8619]: Recovery has finished
2008/08/01 23:42:58.265338 [ 8619]: monitor event OK - node re-enabled
2008/08/01 23:42:58.427697 [ 8619]: Recovery has started
2008/08/01 23:42:58.527880 [ 8621]: Deterministic IPs enabled. Resetting all ip allocations
2008/08/01 23:42:58.652277 [ 8619]: Takeover of IP 9.11.192.102/23 on interface bond1
2008/08/01 23:42:58.771024 [ 8619]: Recovery has finished

messages from dc-gpfs-gn1:

Aug  1 23:42:57 dc-gpfs-gn1 kernel: nfsd: last server has exited
Aug  1 23:42:57 dc-gpfs-gn1 kernel: nfsd: unexporting all filesystems
Aug  1 23:42:57 dc-gpfs-gn1 kernel: RPC: failed to contact portmap (errno -5).
Aug  1 23:42:57 dc-gpfs-gn1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Aug  1 23:42:57 dc-gpfs-gn1 kernel: NFSD: starting 90-second grace period
Aug  1 23:42:58 dc-gpfs-gn1 kernel: nfsd: last server has exited
Aug  1 23:42:58 dc-gpfs-gn1 kernel: nfsd: unexporting all filesystems
Aug  1 23:42:59 dc-gpfs-gn1 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Aug  1 23:42:59 dc-gpfs-gn1 kernel: NFSD: starting 90-second grace period
Comment 1 Andrew Tridgell 2008-08-10 20:21:22 UTC
Dustin,

Your bug report doesn't give nearly enough information. The logs you sent
are at the minimal log level, and you don't include any of the config files.

Please raise the ctdb log levels, and attach the output of the ctdb_diagnostics script for future bug reports.

Even so, I can take a wild guess at what the problem is. The logs you sent show that ctdb is managing NFS as well as CIFS, and that it is detecting NFS server problems on your nodes. When it detects that NFS is not working it disables the public IP on that node. When it disables the public IPs, that will cause disconnects on all clients, both NFS and CIFS. The idea is that the IP gets moved to a new node. 

You should also check that you have setup rr-DNS correctly, and are using the rr-DNS public addresses in your tests. If you use non-rr-DNS IPs then when a failover happens you will get a permanent disconnect rather than a failover.

Still, this is all complete guesswork, as from your bug report I have no idea what your config is like.

I'd also suggest you look into setting up a SoFS 'autocluster' system, so you can see how a properly setup clustered Samba system should be configured.

Cheers, Tridge
Comment 2 Martin Schwenke 2016-08-10 09:03:43 UTC
Required information never provided....