Samba 3.3.7 client and server side, mounted using mount.cifs. The "serverino" option is enabled on the mount, but that doesn't help - though it does make it easier to see what's going on. % ls -lai symlinktodelete t 3146959 lrwxrwxrwx 1 dave users 1 Sep 29 18:57 symlinktodelete -> t 3081135 -rw-rw-rw- 1 dave users 0 Sep 29 18:50 t % rm symlinktodelete % ls -lai symlinktodelete 3146959 lrwxrwxrwx 1 dave users 1 Sep 29 18:57 symlinktodelete -> t At some point it appears to have confused the symlink itself (inode 3146959 = 0x3004CF) and the target of the symlink (inode 3081135 = 0x2F03AF). Just to be clear, the expected behaviour is of course to unlink the symlink, NOT the target. In this case unlink returns success, but nothing is unlinked. If the target doesn't exist we get different bad behaviour: % ls -lai symlinktodelete 3146959 lrwxrwxrwx 1 dave users 1 Sep 29 18:57 symlinktodelete -> t % rm symlinktodelete rm: cannot remove `symlinktodelete': No such file or directory % ls -lai symlinktodelete 3146959 lrwxrwxrwx 1 dave users 1 Sep 29 18:57 symlinktodelete -> t In the log where target exists: [2009/09/30 14:38:20, 5] smbd/close.c:close_remove_share_mode(330) close_remove_share_mode: file ./symlinktodelete. Delete on close was set - del eting file. [2009/09/30 14:38:20, 5] smbd/close.c:close_remove_share_mode(383) close_remove_share_mode: file ./symlinktodelete. Delete on close was set and d ev and/or inode does not match [2009/09/30 14:38:20, 5] smbd/close.c:close_remove_share_mode(388) close_remove_share_mode: file ./symlinktodelete. stored file_id 804:3004cf, st at file_id 804:2f03af [2009/09/30 14:38:20, 2] smbd/close.c:close_normal_file(606) dave closed file ./symlinktodelete (numopen=315) NT_STATUS_OK [2009/09/30 14:38:20, 5] smbd/files.c:file_free(471) freed files structure 12312 (315 used) log where target doesn't exist: [2009/09/30 14:46:54, 3] smbd/dosmode.c:unix_mode(124) unix_mode(./symlinktodelete) returning 0744 [2009/09/30 14:46:54, 4] smbd/open.c:open_file_ntcreate_internal(1884) calling open_file with flags=0x0 flags2=0x0 mode=0744, access_mask = 0x10000, open_access_mask = 0x10000 [2009/09/30 14:46:54, 5] smbd/files.c:file_free(471) freed files structure 12620 (315 used) [2009/09/30 14:46:54, 3] smbd/error.c:error_packet_set(61) error packet at smbd/reply.c(2643) cmd=6 (SMBunlink) NT_STATUS_OBJECT_NAME_NOT_FOUND
Can you get me a wireshark network capture of this, and also a debug level 10 log from the server please ? All codepaths on the server should be using the symlink aware versions of the POSIX calls (lstat etc.) when dealing with UNIX extensions and pathnames, so this should work. I'm trying to track down if this is a client or server bug. Thanks ! Jeremy.
Created attachment 4770 [details] Log level 10 server log of an unlink operation of a hanging symlink Symlink is pointing at a missing target, as this seems to be the simplest case - hopefully both cases have a common cause.
Created attachment 4771 [details] tcpdump of delete of hanging symlink Note this is a dump of a similar unlink, but not exactly the same one as the level 10 log documents.
Created attachment 4772 [details] Patch for 3.3.x. Can you try this and see if it fixes your problem please ? Jeremy.
(In reply to comment #4) > Created an attachment (id=4772) [details] > Patch for 3.3.x. > > Can you try this and see if it fixes your problem please ? > Jeremy. Sorry, this seemed to have no effect at all. I had a look at the code a little further, and as a wild stab in the dark changed the SMB_VFS_STAT in source/smbd/posix_acls.c:posix_get_nt_acl() to an SMB_VFS_LSTAT (which perhaps it should be?) This logged slightly different output, but still ultimately had no measurable effect. The log seems to suggest that the target file does not exist in source/smbd/close.c:close_remove_share_mode(). This makes me think that perhaps fsp->posix_open is false for the symlink, where it should be true... Does that shed any light?
I reproduced your problem on the latest git head branch (master) and the attached patch fixed it there. The "file not found" problem in close was due to the streaminfo call using stat(), not lstat() and thus returning a "ENOENT" error. I didn't have time to test this in the 3.3.x branch code, but I'll reproduce and finish the patch for that branch today. Jeremy.
Created attachment 4792 [details] git-am format patch for 3.3.x.
Created attachment 4793 [details] git-am format patch for 3.4.x.
Karolin please pull the two git patches into v3-3-test and v3-4-test. Thanks ! Jeremy.
(In reply to comment #7) > Created an attachment (id=4792) [details] > git-am format patch for 3.3.x. > Please note that the 3.3.x patch appears to be wrong. I'm testing the obvious intended change at the moment (removing the old line that the new code is intended to replace).
I'm afraid I see no change in final behaviour after applying the patch to a 3.3.8 server. That's including correcting patch 4792 to remove the old: ret = SMB_VFS_STAT(handle->conn, fname, &sbuf); Oh, actually, I now realise that the corrected patch 4792 is actually the same as patch 4772! I'll attach level 10 logs in a minute.
Created attachment 4796 [details] Log: Level 10, Samba 3.3.8, hanging symlink unlink
Comment on attachment 4796 [details] Log: Level 10, Samba 3.3.8, hanging symlink unlink Modified patch 4792, essentially patch 4772
Created attachment 4797 [details] Log: Level 10, Samba 3.3.8, target exists symlink unlink
(In reply to comment #11) > I'll attach level 10 logs in a minute. Here are the relevant inode numbers for the two logs. 3146959 lrwxrwxrwx 1 dave users 1 Sep 29 18:57 symlinktodelete -> t 20742452 -rw-rw-rw- 1 dave users 0 Oct 7 16:21 t
Yes, the 3.3.x patch is incorrect, I forgot to remove the old VFS_STAT call. I'll update this tomorrow. However I perfectly reproduced the symlink pointing to a bad link problem with the existing code in v3-3-test, and confirmed the problem was fixed with the patch replacing the VFS_STAT with VFS_LSTAT, so I'm not convinced by your claims the problem remains (at least with the dangling symlink issue). I'll re-test tomorrow with the valid symlink delete. Jeremy.
Karolin please hold off on merging any patches to 3.3 or 3.4 until I've done more testing, thanks ! Jeremy.
(In reply to comment #16) > However I perfectly reproduced the symlink pointing > to a bad link problem with the existing code in v3-3-test, and confirmed the > problem was fixed with the patch replacing the VFS_STAT with VFS_LSTAT, so I'm > not convinced by your claims the problem remains (at least with the dangling > symlink issue). Certainly odd. But I'm quite happy to accept I may have blundered in my testing :) However, I repeated the build several times, and double checked that I was running updated smbd's and so on. I kind of get the impression there may be several bugs involved though. E.g. the log suggests that the path I take goes through smbd/posix_acls.c:posix_get_nt_acl), which does indeed have an unconditional SMB_VFS_STAT operation in 3.3.8.
Probably if you're running on a system with POSIX ACLs active on the share. I've fixed this completely in v3-5-test and master by adding a new call vfs_stat_fsp() which replaces all the hand-calls to stat/lstat, but this is a bigger change than I'd wanted to make in 3.4.x or 3.3.x. I may just hand fix these paths for these releases rather than do the full back-port of the patch. If you have git access I'd appreciate it if you could confirm the problem is fixed in v3-5-test and master trees. Jeremy.
Karolin, I think this is a blocker for 3.3.9 - I should have a tested patch today. Please lower the proiority if you disagree. Jeremy.
Created attachment 4801 [details] More correct git-am patch for 3.3.9.
Created attachment 4802 [details] git-am format patch for 3.3.9. I missed one STAT -> LSTAT in the previous attachment. This one looks ok. Jeremy.
Comment on attachment 4802 [details] git-am format patch for 3.3.9. Looks ok to me. I don't see any STAT/LSTAT copy&paste errors anymore. Volker
(In reply to comment #22) > Created an attachment (id=4802) [details] > git-am format patch for 3.3.9. > > I missed one STAT -> LSTAT in the previous attachment. This one looks ok. > Jeremy. Sorry about this :) It still doesn't fix the problem... I rebuilt twice, once with patch 4802, and a second time with 4802 and I also modified the posix_get_nt_acl debug message to say *PATCHED WITH 4802*, mainly to prove to myself that I really was testing the patch properly. I'm patching 3.3.8 for my testing. I'll attach some logs shortly.
Well I'm afraid you're going to have to show me exactly where this fails, you can add more debug statements if you like. I've never been able to reproduce your problem here - this may be due to a different cifsfs client, I'm using the ones shipped with Ubuntu 8.10 and 9.04. I need to see a codepath that's still using stat() instead of lstat() for a POSIX unlink. I simply don't see one right now (and can't reproduce your problem at all). Jeremy.
Actually, I miss-spoke in my last update. I was able to reproduce your problem before I fixed the STAT/LSTAT bug in the vfswrap_streaminfo() function in modules/vfs_default.c. But since I fixed that I haven't been able to see it. Jeremy.
(In reply to comment #25) > Well I'm afraid you're going to have to show me exactly where this fails, you > can add more debug statements if you like. I've discovered a related bug. My testing has been to restart the server, but not remount - since my home directory is on CIFS. It appears there's a bug in how the mount renegotiates. I guess the same bug might also turn up if network conditions cause the mount to reattach. I applied patch 4802, and added debug messages before the STAT calls in posix_get_nt_acl. If the mount happened after the most recent server restart I see: [2009/10/08 15:34:10, 10] smbd/posix_acls.c:posix_get_nt_acl(3185) posix_get_nt_acl: *PATCHED WITH 4802* called for file testdir/symlinktodelete [2009/10/08 15:34:10, 10] smbd/posix_acls.c:posix_get_nt_acl(3189) posix_get_nt_acl: Definitely LSTAT file testdir/symlinktodelete BUT, if I restart the server I see: [2009/10/08 15:38:29, 10] smbd/posix_acls.c:posix_get_nt_acl(3185) posix_get_nt_acl: *PATCHED WITH 4802* called for file testdir/symlinktodelete [2009/10/08 15:38:29, 10] smbd/posix_acls.c:posix_get_nt_acl(3192) posix_get_nt_acl: Definitely STAT file testdir/symlinktodelete So, what seems to be happening is that restarting the server causes the negotiated connection to at least partially forget it should be using POSIX pathname rules. That means of course that most of your patch is disabled once I restart the server. The bad news is that even when it's logging that it's using LSTAT - and thus your patch is functional, I STILL can't unlink the symlink :) :) :) I'm still looking at that...
Ok, that would definitely cause the problem - Windows clients are not aware of symlinks and so for them we must always follow the link (STAT vs LSTAT). If the client isn't re-connecting with POSIX then that's a separate bug you should log here under the client cifsfs code. Still don't see how the LSTAT case can be failing the delete, but I'll await more info from you. Jeremy.
Created attachment 4811 [details] Force POSIX semantics on unlink Tested on Samba 3.3.8 + patch 4802. Not tested without patch 4802.
(In reply to comment #29) > Created an attachment (id=4811) [details] > Force POSIX semantics on unlink > > Tested on Samba 3.3.8 + patch 4802. Not tested without patch 4802. > Oh, that wasn't all I wanted to say :) It appears in do_unlink() that POSIX semantics are always desired, since it unconditionally does an SMB_VFS_LSTAT at the start. Requesting POSIX semantics on the unlink arranges for posix_open to be set when close_remove_share_mode() gets called, which means that SMB_VFS_LSTAT will get called instead of SMB_VFS_STAT. I'm totally mystified about how this works for you though, it just seems like it should never ever work. I'm not sure whether FILE_FLAG_POSIX_SEMANTICS should be conditional on lp_posix_pathnames(), but given the existing LSTAT in do_unlink() it seems to me like it should always be on. That said, posix_open seems to be overloaded with other meanings that might make this change problematic. But it does fix my issue.
I still would still like to see the debug level 10 log for this. Unconditionally setting FILE_FLAG_POSIX_SEMANTICS is incorrect here, it should be dependent on lp_posix_pathnames(). The LSTAT in do_unlink was done for cases where there is a dangling symlink and Windows clients want to delete it (which can happen). I'm not sure the logic there is still correct, but I'll test that case to make sure it's still covered. I think that part of your patch is correct, but conditional on posix pathname semantics being selected. I still want the debug log please :-). I'll provide updated patches shortly. Jeremy.
Ok, this is very interesting. You must be using an earlier cifsfs client that still uses the Windows SMBunlink call to delete a file. More modern cifsfs clients use the trans2:posix_unlink level to delete a file, which does not go through the do_unlink() code in smbd/reply.c but handles the POSIX semantics directly. This is why I can't reproduce your the problem. Can you send me a wireshark packet capture trace to confirm ? However, we still need to cope with older cifsfs clients using SMBunlink as there are still a lot of them out there, so I've got a modified variant of the git-am patches that use the FILE_FLAG_POSIX_SEMANTICS on the open for delete in the SMBunlink path. I've also confirmed that with 3.3.9, Windows clients can no longer see a dangling symlink (as it's filtered out on directory listing as the file cannot be correctly VFS_STAT'ed). Updated patches for 3.3.9 and 3.4.3 to follow. Jeremy.
(In reply to comment #32) > Ok, this is very interesting. You must be using an earlier cifsfs client that > still uses the Windows SMBunlink call to delete a file. Is that determined by Samba bits or kernel bits? The kernel is a bit dated, the Samba client installed is unmodified 3.3.8 I believe. > This is why I can't reproduce your the problem. Can you send me a > wireshark packet capture trace to confirm ? I think attachment 4771 [details] should have what you want?
This is completely dependent on kernel bits, not Samba. The version of the Samba client libs is irrelevent. I'll look at that trace. Jeremy.
Yes, as I suspected it's using the non-POSIX SMBunlink call. I have a fix that should work for this. Patches for 3.3.9 and 3.4.3 to follow. Jeremy.
Created attachment 4822 [details] git-am patch for 3.3.9 Fixes the earlier problems plus the issue with old CIFSFS clients using SMBunlink instead of trans2:posix_unlink. Jeremy.
Created attachment 4823 [details] git-am patch for 3.3.9. Final fix for 3.3.9 - removed unneeded changes in rmdir_internals(). Jeremy.
Created attachment 4824 [details] git-am patch for 3.4.3. Matches the 3.3.9 patch. Jeremy.
Ok David, if you can confirm the patch in attachment #37 [details]: https://bugzilla.samba.org/attachment.cgi?id=4823&action=view Then I'll re-assign to Karolin for inclusion in 3.3.9. If you could check this asap I'd appreciate it so we can get it into the tree for release testing. Thanks ! Jeremy.
(In reply to comment #39) > Ok David, if you can confirm the patch in: > > https://bugzilla.samba.org/attachment.cgi?id=4823&action=view > > Then I'll re-assign to Karolin for inclusion in 3.3.9. If you could check this > asap I'd appreciate it so we can get it into the tree for release testing. That's a success :) Tested on samba server 3.3.8 + patch 4823 and no other changes. As discussed before, I need to remount after restarting the server, but this is an unrelated client side problem - no doubt a side effect of the vintage of the kernel. Thanks :)
W00t! Success. Re-assigning to Karolin for inclusion in 3.3.9 and 3.4.3. Thanks. Jeremy.
Pushed to v3-4-test and v3-3-test. Will be included in the next bug fix releases. Closing out bug report. Thanks!
Created attachment 4862 [details] Further fix for 3.4.3. Here is a subsequent patch for 3.4.3, which fixes the symlink calls in all the modules/vfs*.c code. I missed this in the initial patch, was caught by the build farm tests for posix unlink. Most of these modules are not often used, the only one that might be an issue is the xattr_tdb store module (this was the one the buildfarm code used ad why I detected the problem). As we're after the freeze date for 3.4.3 I'm going to ask someone else to review. Jeremy.
Re-opened bug to get review for patch in attachment #4862 [details] https://bugzilla.samba.org/attachment.cgi?id=4862&action=view Jeremy.
Look good, thanks Volker
Pushed patch to v3-4-test. (There are some blockers left for 3.4.3, so I think we will delay it.) Closing out bug report. Thanks!