60.nfs.script detects that NFS RPC services have failed via nfs-checks.d/. For rpc.mountd, rpc.rquotad or rpc.statd it kills the relevant process via killall and restarts it by constructing a command-line for the relevant RPC daemon and running it directly. systemd later fails to start such hard-restarted services because they are already running. It tracks these services individually via PID files instead of using a generic kill command to stop them. When systemd is in use and is tracking these service then they should only ever be stopped and started via systemd.
Additionally, the code that "corrects" the nfsd thread count if it is not at the expected level, corrects if the thread count is 0. This is almost certainly a mistake because it can cause the general problem described above in the bug description.
Created attachment 15041 [details] Patch for 4.9 and 4.10 This is the patch set that went into master, *minus* the patch that changes the default to systemd-redhat.
Comment on attachment 15041 [details] Patch for 4.9 and 4.10 I'm going to remove this patch for now and try testing for a couple of days without the last commit. I think there's something else going on...
Created attachment 15052 [details] Patch for 4.9 and 4.10 Patch for 4.9 and 4.10 This is the patch set that went into master, *minus*: * The patch that changes the default to systemd-redhat We don't want to change the default for released versions, but we do want to give people the ability to edit the call-back to take advantage of the changes. * The final patch, which avoids changing the thread count when it is 0 I originally thought this was being triggered and causing problems in cluster testing. However, I have since fixed the test environment and tested many times with the patch reverted... and have seen no problems.
Hi Karolin, This is ready for v4-9 and v4-10.
(In reply to Amitay Isaacs from comment #5) Pushed to autobuild-v4-{10,9}-test.
(In reply to Karolin Seeger from comment #6) Pushed to both branches. Closing out bug report. Thanks!