We've seen some sporradic reports of busy inodes after unmounting cifs filesystems. This bug is to track progress on identifying and fixing this... I've personally seen it frequently when running fsstress from the LTP suite on cifs filesystems against samba. Suresh mentioned that he was able to reproduce by doing this: * mount the Samba exported cifs share (unix extensions enabled) * Run on the server on the share while true; do touch 1 2; echo 1 > 1;echo 2 > 2; rm 1 2; done * Run on the client on the cifs mount point while true; do touch 1 2; echo 1 > 1;echo 2 > 2; rm 1 2; done * Stop the script on the client * Stop on the server * cd to a different dir other than mountpoint on the client * umount the mountpoint on the client * VFS: Busy inode error appears on the dmesg/syslog ...so far, I've *not* been able to reproduce it when using '-o noserverino' with the patches that Steve committed yesterday. Prior to those, I'm fairly certain that noserverino didn't work correctly, so the problem was still present. I've also noticed that I occasionally see server inode numbers being disabled automatically in the cifs_iget code even though the share I'm using has no submounts. That probably indicates that reuse of inode numbers may be a factor in this. Finally, I've noticed that when running fsstress on a cifs filesystem with unix extensions enabled, the size of the cifs_inode_cache seems to grow w/o bound, whereas if I disable unix extensions then it eventually shrinks due to memory pressure. That may indicate an inode refcount leak somewhere.
Time to poke around in memory. Got the machine into a "busy inodes after umount" state and ran crash on the live kernel: crash> kmem -S cifs_inode_cache CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff88003c3f3000 cifs_inode_cache 1264 3 9 3 4k SLAB MEMORY TOTAL ALLOCATED FREE ffff880006dda080 ffff880006dda0c0 3 1 2 FREE / [ALLOCATED] ffff880006dda0c0 [ffff880006dda5b0] ffff880006ddaaa0 SLAB MEMORY TOTAL ALLOCATED FREE ffff880030e6a000 ffff880030e6a040 3 1 2 FREE / [ALLOCATED] ffff880030e6a040 ffff880030e6a530 [ffff880030e6aa20] SLAB MEMORY TOTAL ALLOCATED FREE ffff880030f8c000 ffff880030f8c040 3 1 2 FREE / [ALLOCATED] [ffff880030f8c040] ffff880030f8c530 ffff880030f8ca20 ...addresses in brackets are still allocated. This kernel is using slab allocator, so we have to add 8 bytes to each address to get past the metadata on each object: crash> struct cifsInodeInfo ffff880030f8c048 Some interesting snippets. This is a regular file: i_mode = 0x81a4, (100644) ...the dentry list is empty: i_dentry = { next = 0xffff880030f8c0c0, prev = 0xffff880030f8c0c0 }, ...openFileList is not empty: openFileList = { next = 0xffff88003c714770, prev = 0xffff88003c714770 }, ...looking at cifsFileInfo: crash> struct cifsFileInfo -l cifsFileInfo.flist 0xffff88003c714770 ...count is non-zero: count = { counter = 0x1 }, ...so it seems somewhat plausible that the problem is a cifsFileInfo refcount leak.
6c indicates increment of some bit field in a stale structures (still) being referred by cifs inspite it freed it, somebody else allocated it and freed it. This is with the git tree I pulled couple of hours ago from Steve's branch. - No unix extensions, noserverino mount option - No errors --------------------------------------------------------- - unix extensions, noserverino mount option - errors CIFS VFS: Send error in Close = -2 CIFS VFS: Send error in Close = -2 - After unmount VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds. Have a nice day... general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 3 Modules linked in: cifs nls_utf8 ipv6 microcode fuse loop dm_mod rtc_cmos ibmpex iTCO_wdt ibmaem ioatdma iTCO_vendor_support tpm_tis i2c_i801 bnx2 rtc_core sr_mod tpm ipmi_msghandler i5k_amb tpm_bios dca button rtc_lib i2c_core cdrom ses pcspkr shpchp enclosure sg i5000_edac pci_hotplug joydev edac_core usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic piix ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon aacraid scsi_mod [last unloaded: cifs] Pid: 19, comm: bdi-default Not tainted 2.6.34-rc7-1-default-1 #1 System Planar/IBM System x3550 -[7978AC1]- RIP: 0010:[<ffffffffa0342822>] [<ffffffffa0342822>] cifs_writepages+0x49/0x7d9 [cifs] RSP: 0018:ffff88013a9fdae0 EFLAGS: 00010282 RAX: 6b6b6b6b6b6b6b6b RBX: ffff880137b49af0 RCX: ffff88013583a5b0 RDX: ffff880137b49aa8 RSI: ffff88013a9fdd00 RDI: ffff880137b49c08 RBP: ffff88013a9fdc30 R08: 0000000000000000 R09: ffff88013a9fdcb0 R10: ffffffff81743c50 R11: ffffffff81743c60 R12: ffffffff81743b20 R13: ffff88013a9fdd00 R14: ffff880136fbb650 R15: ffff880137b49c08 FS: 0000000000000000(0000) GS:ffff880006180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fae3264a4f8 CR3: 000000013972c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bdi-default (pid: 19, threadinfo ffff88013a9fc000, task ffff88013a9fa580) Stack: 0000000000000000 0000000000000000 0000000000000200 0000000000000400 <0> ffff88013a9fdd00 ffff880137b49c08 0000000200000001 ffff880006194ac0 <0> 6b6b6b6b6b6b6b6b ffff88013fc23680 ffff88013a9fdb60 ffffffff8103ef51 Call Trace: [<ffffffff8103ef51>] ? finish_task_switch+0x34/0x95 [<ffffffff810be76b>] do_writepages+0x1c/0x25 [<ffffffff81111188>] writeback_single_inode+0xdd/0x2d2 [<ffffffff81111e8d>] writeback_inodes_wb+0x417/0x51d [<ffffffff811120d3>] wb_writeback+0x140/0x1bb [<ffffffff8111234b>] wb_do_writeback+0x145/0x15c [<ffffffff810cc663>] bdi_forker_task+0x89/0x2e3 [<ffffffff810cc5da>] ? bdi_forker_task+0x0/0x2e3 [<ffffffff8105f3ae>] kthread+0x7d/0x85 [<ffffffff81003b24>] kernel_thread_helper+0x4/0x10 [<ffffffff8105f331>] ? kthread+0x0/0x85 [<ffffffff81003b20>] ? kernel_thread_helper+0x0/0x10 Code: d0 fe ff ff 48 8b 07 4c 8b 67 68 48 8d 50 b8 48 89 95 18 ff ff ff 48 8b 80 00 01 00 00 48 8b 80 88 02 00 00 48 89 85 f0 fe ff ff <81> 78 24 ff 0f 00 00 76 3e 48 8b 8d f0 fe ff ff 48 8b 01 48 8b RIP [<ffffffffa0342822>] cifs_writepages+0x49/0x7d9 [cifs] RSP <ffff88013a9fdae0> ---[ end trace 5634406e87f60c44 ]--- Slab corruption: size-1024 start=ffff880136fbb650, len=1024 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [<ffffffff8126bda0>](__sk_free+0xd3/0xe4) 060: 6b 6b 6b 6b 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 0a0: 6b 6b 6b 6b 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b Prev obj: start=ffff880136fbb238, len=1024 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [<ffffffff81213eab>](tty_buffer_request_room+0xd2/0x155) 000: 58 00 3e 37 01 88 ff ff 60 b2 fb 36 01 88 ff ff 010: 60 b3 fb 36 01 88 ff ff fc 00 00 00 00 01 00 00 Next obj: start=ffff880136fbba68, len=1024 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [<ffffffff810fcead>](alloc_pipe_info+0x22/0xa0) 000: 07 00 07 00 00 00 00 00 10 3d 70 39 01 88 ff ff 010: 10 3d 70 39 01 88 ff ff 00 00 00 00 02 00 00 00 - After rmmod slab error in kmem_cache_destroy(): cache `cifs_inode_cache': Can't free all objects Pid: 4790, comm: rmmod Tainted: G D 2.6.34-rc7-1-default-1 #1 Call Trace: [<ffffffff810e907e>] __slab_error+0x24/0x26 [<ffffffff810eae68>] kmem_cache_destroy+0x96/0xee [<ffffffffa0353b94>] exit_cifs+0x5c/0x7b [cifs] [<ffffffff81074f22>] sys_delete_module+0x1d5/0x249 [<ffffffff810d5b3b>] ? do_munmap+0x30a/0x32e [<ffffffff81002d02>] system_call_fastpath+0x16/0x1b
Some more notes about this core. "busy" cifsInodeInfo structs are: ffff880006dda5b8 ffff880030e6aa28 ffff880030f8c048 In all 3 cases, the openFileList is non-empty, but the pfile pointer is NULL: pfile = 0x0, ...which suggests that something happened and cifs_fill_filedata() never found this struct and set that pointer. Could this be a problem such that the create occurs but something happens and the vfs never calls down into the ->open op?
The writepages crash that Shirish saw is probably just "normal" splatter that you see on a busy inodes after umount problem. This info is also typical. We have "busy" inode objects in this cache still, so the kernel can't deallocate the cache: slab error in kmem_cache_destroy(): cache `cifs_inode_cache': Can't free all objects Pid: 4790, comm: rmmod Tainted: G D 2.6.34-rc7-1-default-1 #1 Call Trace: [<ffffffff810e907e>] __slab_error+0x24/0x26 [<ffffffff810eae68>] kmem_cache_destroy+0x96/0xee [<ffffffffa0353b94>] exit_cifs+0x5c/0x7b [cifs] [<ffffffff81074f22>] sys_delete_module+0x1d5/0x249 [<ffffffff810d5b3b>] ? do_munmap+0x30a/0x32e [<ffffffff81002d02>] system_call_fastpath+0x16/0x1b ...I'm leaning toward the problem being that the cifsFileInfo isn't getting cifs_fill_filedata called on it for some reason. I'll probably try to spin up a test patch in a bit that adds a flag or something so we can verify whether that is the problem.
All of the problems I've seen so far are due to files being placed on the openFileList for an inode (and refcounts taken, of course) and then those files never actually being found on the list. Curiously, that's exactly the theoretical situation we were discussing on list a week or two ago... http://lists.samba.org/archive/linux-cifs-client/2010-May/006007.html Looks like there may be some related problems here as well. Even if the open file is found on this list, if cifs_open returns an error it's not clear to me where the inode reference gets put. I don't think the ->release op will get called in that case.
The other difference from older code is that cifsFileInfo now holds a reference to the inode (in order to ensure that it doesn't disappear if the filp is put before an oplock break is completed). Prior to that, it's quite plausible that this code was broken and these structs were being leaked, it's just that it was silently b0rked since that didn't cause an inode refcount leak. I think it's pretty much an imperative that we fix this the right way. The creates should be using lookup_instantiate_filp properly instead of depending on cifs_open to take care of these.
Wonder if ENOENT errors on close are contributing to this problem i.e. openList not being cleaned up! CIFS VFS: Send error in Close = -2
(In reply to comment #7) > Wonder if ENOENT errors on close are contributing to this problem > i.e. openList not being cleaned up! > > CIFS VFS: Send error in Close = -2 > I don't think so. The list is cleaned up regardless of what that call returns. I'm pretty sure I understand the problem at this point (or at least _a_ problem, no guarantee that there aren't more).
Reassigning this to Shirish for now, since this is somewhat related to the create on lookup work. Reassign it back to me or to Suresh if you don't feel you'll be able to tackle it in the near future... The right solution for this probably will be to ditch the code that makes cifs_open look for open files via the openFileList. Instead, whatever function creates the file (cifs_lookup or cifs_create) should use lookup_instantiate_filp to pass an instantiated file pointer back to the caller in the nameidata. That way, everything should get properly cleaned up in an error situation. That will probably mean a substantial overhaul of cifs_create, cifs_lookup and cifs_open (and some of their associated functions too).
I see cifs_lookup calling lookup_instantiate_filp if (posix_open) filp = lookup_instantiate_filp(nd, direntry, NULL); Need to go through this code. Continuing the investigation and corrosponding code change in a short while.
(In reply to comment #10) > I see cifs_lookup calling lookup_instantiate_filp > > if (posix_open) > filp = lookup_instantiate_filp(nd, direntry, NULL); > > Need to go through this code. Continuing the investigation and corrosponding > code change in a short while. > Right, but only when create on lookup actually occurs. That's a very limited set of operations and in some cases, this code punts (like O_EXCL case) and expects cifs_create to handle it. The big part that makes this so difficult is that the functions and code surrounding all of this are such a mess. What would be best is a clean separation of code 1) functions that actually do the on-the-wire operations (the actual create/open operations) 2) functions that do the VFS ops to hook the results of that open and create to files/inodes/dentries. ...IOW, increasing the complexity of these functions with more special cases, etc is the worst possible approach here.
Basically, as I see We have to consider open, create, exclusive create lookup finds an inode on the client lookup does not find an inode on the client lookup intent or not, unix extension support, posix open support lookup intent or not, unix extension support, posix open not supported lookup intent or not, unix extensions not supported As I see, cifs_open() will get called only when cifs_lookup() finds an inode for the dentry on the client. For rest of the cases, either cifs_lookup() or cifs_lookup() and cifs_create() should take care of creating an inode, creating a file pointer, creating a cifs private info structure, associating inode with dentry, associating file pointer with inode (?) and/or dentry (?), associating cifs private data structure with file pointer querrying (from the server) and filling inode and file structures.
(In reply to comment #12) > Basically, as I see > > We have to consider > > open, create, exclusive create > I'm not 100% convinced we need to treat create and exclusive create that much differently. We do now, but that was basically a hack to work around other problems, right? > lookup finds an inode on the client > > lookup does not find an inode on the client > lookup intent or not, unix extension support, posix open support > > lookup intent or not, unix extension support, posix open not supported > > lookup intent or not, unix extensions not supported > Yep, it would be good to encapsulate all of that away into a separate function or set of functions that the lookup and create ops could call. > As I see, cifs_open() will get called only when cifs_lookup() finds an inode > for the dentry on the client. That's the goal I think. It's not like that today where it gets called as well if cifs_lookup doesn't find an inode and cifs_create makes one without using lookup_instantiate_filp. > For rest of the cases, either cifs_lookup() or cifs_lookup() and cifs_create() > should take care of creating an inode, creating a file pointer, creating a > cifs private info structure, associating inode with dentry, associating file > pointer with inode (?) and/or dentry (?), associating cifs private data > structure with file pointer querrying (from the server) and filling inode and > file structures. > Right. The idea here is that since NTCreateAndX is basically an open call, you'll end up with the file remaining open after you create it. At that point you want to hook that up to a filp. lookup_instantiate_filp is how that happens. When you hook it up to a filp in the lookup or create ops, then the open op is bypassed by the VFS. But...there are some gotcha's here. We need to consider what happens if a lookup doesn't occur because there's already a dentry in cache. You may need to fix the d_revalidate op to always return 0 if the nameidata implies a create/open. It's probably worthwhile to study how the nfs client code handles this to get an feel for the model you're shooting for.
(In reply to comment #1) > crash> struct cifsFileInfo -l cifsFileInfo.flist 0xffff88003c714770 > ...count is non-zero: > count = { > counter = 0x1 > }, > ...so it seems somewhat plausible that the problem is a cifsFileInfo refcount > leak. the cifsFileInfo refcount leak means there is no matching close for an open. Otherwise count would be zero, iput on that inode would happen and so no busy inodes. If so, wondering when can there be a mismatch between an open and close. Even if there is an error between cifs_lookup (or cifs_create) and cifs_open (at vfs level), can/when cifs_close not get called? Because even if cifs_open does not get called and file pointer's private data is not assigned to cifsFileInfo structure, cifsFileInfo structure is still on the openfilelist.
> Even if there is an error between cifs_lookup (or cifs_create) and cifs_open > (at vfs level), can/when cifs_close not get called? cifs_close is the release file operation. If there is no struct file, then how will it ever be called?
Jeff, that is right. I am going to see what openfilelist looks like when we are unmounting for that tcon and if there are any entries, issue close command for those open files and do the necessary associated bookkeeping before freeing tcon, keeping in mind the bigger picture as (sort of initiated grossly in notes #12 and #13).
U(In reply to comment #16) > Jeff, that is right. > > I am going to see what openfilelist looks like when we are unmounting > for that tcon and if there are any entries, issue close command for > those open files and do the necessary associated bookkeeping before freeing > tcon, keeping in mind the bigger picture as (sort of initiated grossly in > notes #12 and #13). > NAK. That is just a hack plain and simple to paper over the bug. It won't be that difficult to fix this the right way.
Reassigning this back to myself after discussion with Shirish on IRC. I think I have a better idea of what needs to happen here.
RFC patchset sent to linux-cifs-client and linux-fsdevel: Subject: [linux-cifs-client] [PATCH 0/4] cifs: fix "Busy inodes after umount" issues (RFC) ...seems to fix the issue for me, but I could use some help testing it.
Perhaps an important observation: I'm able to reproduce this error only when I try running the reproducer with 'root' user on the server and a normal user on the 'client'. I could not reproduce this with same user on both the server and the client with Jeff's recent patches. So it might have to do something with the "permission denied" errors (from touch and echo) on the client.
(In reply to comment #1) > Time to poke around in memory. Got the machine into a "busy inodes after > umount" state and ran crash on the live kernel: > > > crash> struct cifsInodeInfo ffff880030f8c048 > > Some interesting snippets. This is a regular file: > > i_mode = 0x81a4, (100644) > > ...the dentry list is empty: > > i_dentry = { > next = 0xffff880030f8c0c0, > prev = 0xffff880030f8c0c0 > }, > > ...openFileList is not empty: > > openFileList = { > next = 0xffff88003c714770, > prev = 0xffff88003c714770 > }, openFileList is empty here, right? Did you mean it is not empty in other cifsInodeInfo structs or Am I missing something? I'm always seeing openFileList empty.
With the recent patches, I see inode->i_state value 131 (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_CLEAR) which should not happen, I think. Any ideas when it could be set like this?
No, the list was not empty. There was single entry on it, which is why the next and prev pointers look the same: $ echo $(( 0xffff88003c714770 - 0xffff880030f8c048 )) 192448296 ...which is much larger than a cifsInodeInfo struct.
(In reply to comment #22) > With the recent patches, I see inode->i_state value 131 (I_DIRTY_SYNC | > I_DIRTY_DATASYNC | I_CLEAR) which should not happen, I think. Any ideas when it > could be set like this? > Looks like it was redirtied after clear_inode was called on it. I'm not sure how that could happen -- the VFS should only call clear_inode on something with no more references. The only place that I know of that cifs takes an inode reference is in cifs_new_fileinfo and that is done via igrab which should return a NULL pointer.
(In reply to comment #24) > The only place that I know of that cifs takes an inode > reference is in cifs_new_fileinfo and that is done via igrab which should > return a NULL pointer. > ...this is, when I_FREEING or I_CLEAR is set.
I just tried reverting the change that added a inode reference in cifs_new_fileinfo, just to get more information. It looks like the change exposed bugs which would have gone otherwise undetected. We seem to be missing a call to cifsFileInfo_put when a race condition occurs running the reproducer. I was not able to reproduce the VFS busy inode errors during my initial reproduction attempts.. Not definitely saying we should not be adding inode ref, but we got a clue where things might be going wrong.. Index: linux-2.6/fs/cifs/cifsglob.h =================================================================== --- linux-2.6.orig/fs/cifs/cifsglob.h +++ linux-2.6/fs/cifs/cifsglob.h @@ -376,7 +376,6 @@ static inline void cifsFileInfo_get(stru static inline void cifsFileInfo_put(struct cifsFileInfo *cifs_file) { if (atomic_dec_and_test(&cifs_file->count)) { - iput(cifs_file->pInode); kfree(cifs_file); } } Index: linux-2.6/fs/cifs/dir.c =================================================================== --- linux-2.6.orig/fs/cifs/dir.c +++ linux-2.6/fs/cifs/dir.c @@ -154,7 +154,7 @@ cifs_new_fileinfo(struct inode *newinode pCifsFile->netfid = fileHandle; pCifsFile->pid = current->tgid; - pCifsFile->pInode = igrab(newinode); + pCifsFile->pInode = newinode; pCifsFile->mnt = mnt; pCifsFile->pfile = file; pCifsFile->invalidHandle = false; ~
Created attachment 5746 [details] possible patch 1 I was finally able to reproduce this with Suresh's reproducer. Unfortunately though, crash and "virsh dump" aren't behaving for me so I wasn't able to poke at the busy inodes and see what was wrong. I did notice this though...if cifs_open_inode_helper returns error, I think the cifsFileInfo is currently leaked. This patch isn't even compile tested, but I think it's a bug that probably ought to be fixed. Suresh, might be good to test this out when you get a chance.
Just tested the patch. It compiles fine and seems to work. I let Suresh's reproducer run for quite a while and didn't see busy inodes after umount. Suresh, please test that patch when you have time and see whether it's still reproducible.
Yay! that change (moving cifs_open_inode_helper before cifs_new_fileinfo and checking for error) seem to have done the trick. I can no longer reproduce the "VFS busy inode" errors with my reproducer. I don't see allocated but not released structures (via crash) after running the tests too. (BTW, my tests were on top of Jeff's tree). I think this patch is good to go except for a minor nit (Subject talks about only the cleanup, but it seem to have fixed a real leak while stars align right/wrong). Please add my 'Reviewed-and-Tested-by:' tag to this one.
Patches for this are now in mainline.