The Samba-Bugzilla – Bug 12833
Profile (ntuser.dat) locked "forever" when shutting down but not when logging off then shutting down
Last modified: 2018-01-13 08:11:37 UTC
I have had this problem in both the 3.6.X and and 4.2.X stream of samba.
I, too, am not the only one having this problem and it is reported on a variety of distros, e.g. Ubuntu and Centos.
It is really annoying when
* you have to reboot after an (un)install
* you have to reboot after an upgrade
* shutdown then move to another workstation to continue to work there
The behavior, too, is different for the two cases:
* when you SHUTDOWN while logged in, then restart and login again you get
the dreadful "preparing your profile" and you are logged in with a temporary
profile, this is due to ntuser.dat staying locked for a long time, and the
length can be forever or until you "/etc/rc.d/init.d/smb reload" which frees
up the lock.
* when you LOGOUT then SHUTDOWN the ntuser.dat file is unlocked 5s after and
there is no problem with logging in again.
I consider this a bug as the behavior of the locking is different if you shutdown or logout then shutdown.
I also consider it a bug because it is time depending. When you shutdown in the evening and come back the next morning and login there is no problem as the lock will be gone by then. I have observed oplocks on ntuser.dat for more then three hours by doing a "lsof | grep -i USERNAME | grep smbd" or "lsof | grep -i ntuser" - I had to go home then.
I also consider a shutdown while logged in the same as logging out then shutting down as they are essentially the same.
I know too, you can set "oplocks=no" to solve this problem but then why is it ok when logging out then shutting down? Also there is nothing wrong with locks, I can see there is lot of merit. For example a person logs in at one workstation, then goes to another station works there, shuts down and continues to work on the first station - this has never led to any problems over the last ten years .... due to locks.
IMHO a cleaner should be running to clean up these things, i.e. the cleaner is started when the person logs out and makes sure everything is in order ...
I have also done this test and there was no problem or corruption of the files involved:
* logged in and waited for everything to work
* just opened one file, made a change, saved
* then shutdown and started the machine again
* found the PID of the process holding on to NTUSER.DAT
* gave it a whack with "kill -9 PID"
* logged in
there was absolute no problem ...
Why holding on to the lock in the first place?
Forgot to add:
* mailing list firstname.lastname@example.org,
Subject Domain Logout, then domain login again, profile corrupt -> replaced by TEMP profile
I guess the problem is that we send an oplock/lease break to the old connection
and don't get an ack in time. As the tcp timeouts are too long to detect the
broken connection and windows may not send tcp rst, smbd believes the connection
is still connected and just downgrade the oplock, while keeping the file
open, which causes the NT_STATUS_SHARING_VIOLATION.
The work towards multi-channel support, will hopefully fix that
as it will detect the broken connection much sooner and close the file.
(In reply to Stefan Metzmacher from comment #2)
Something like the following in the [global] section:
socket options = TCP_KEEPCNT=4 TCP_KEEPIDLE=240 TCP_KEEPINTVL=15
might detect the broken connection sooner,
it starts to sends the first tcp keepalive after
being idle for 4 minutes (240s) and continues to send
3 additonal keepalives every 15s until the broken connection
is finally detected after 5 minutes.
(In reply to Stefan Metzmacher from comment #3)
I guess it's giving up after 4 minutes and 45 seconds...
Depending on how fast you machines reboot, you may need to adjust the values.
The lowest useful values would be:
socket options = TCP_KEEPCNT=5 TCP_KEEPIDLE=30 TCP_KEEPINTVL=1
As the OPLOCK_BREAK_TIMEOUT is 30 seconds and smbd forces
a downgraded after OPLOCK_BREAK_TIMEOUT*2.
(In reply to Stefan Metzmacher from comment #2)
I know that Roaming Profiles can be a) PITA and b) different sizes depending on what people store in them, although I have so far succeeded to store mostly everything in the HOME share ... but some there is always that on stupid programmer using hard coded path and I cannot win - meaning loads of data stored in the profile. I know some games that do this :-(((((
So because of the different sizes Windows needs to get more or less data across when logging off (which is part of shut down) hence why the entire process isn't always the same time - which, granted, makes it tricky.
I would guess that windows has some sort of "hey I am finished now" flag that is send from the workstation to the server after logoff - so why not capturing this flag and close the lock?
Why does the server need to keep the lock? A logoff is a logoff, meaning that the user does not want to be logged in anymore. Also once you click that button for logoff there is no going back ....
I checked what happens to the file (lsof). When you logoff the lock is released immediately ... when you shutdown it is not? That seems strange to me.
(In reply to Jobst Schmalenbach from comment #5)
If the client really sends the logoff everything is fine,
the problem comes when the client reboots and doesn't
send a logoff nor close the tcp connection.
(In reply to Stefan Metzmacher from comment #6)
When you shutdown client 1 and then try to login at client 2 the profile is still locked long after client 1 had been shutdown.
This issue seems cause trouble to many people (see links below). On the technet some people are reporting that this also happens when the server is a Windows Server 2012 R2 (haven't verified myself) - so this seems to be an issue with at least Windows 10 occuring when being shutdown or restarted on fast computers (having SSD).
Some people claim that a shutdown script sleeping for about 15 seconds has helped them (Windows 10 probably closes the connection then/releases the locks).
Tried it with samba 4.7.4:
When you have shutdown a Windows 10 1703 x64 client with current monthly rollups (fast startup disabled)(without the workaround mentioned above) a
netstat -a -n -o | grep IP_OF_SHUTDOWN_COMPUTER
still shows the connection in the state ESTABLISHED with a very long keepalive timeout. This connection was shown on the server in netstat much longer than the OPLOCK_BREAK_TIMEOUT * 2 time (if it is really defaulting to 30 seconds).
Don't know whether the suggested TCP_KEEPALIVE options have any negative side effects (i.e. to Windows clients going to standby/hibernate without releasing the locks), but they seem to help for real shutdowns/reboots. After the KEEPALIVE has expired the connection is closed, the lock is freed and you can login on the next client without any issues noticed.
If they should have negative impacts, maybe it would be possible to create an option for smb.conf to have this keepalive options only be applied on connections to the profile share? Users having trouble when using standby/hibernate clients could have them use another share path without the short keepalive delay.
Best solution would be when Microsoft would fix that on the Windows clients, but nobody knows if that will happen.