Bug 7792 - cifsd stays running after umount
cifsd stays running after umount
Status: RESOLVED FIXED
Product: CifsVFS
Classification: Unclassified
Component: kernel fs
2.6
Other Linux
: P3 normal
: ---
Assigned To: Jeff Layton
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-11-11 07:00 UTC by Jeff Layton
Modified: 2010-11-12 08:21 UTC (History)
4 users (show)

See Also:


Attachments
patch -- make sure cifs_get_tlink is done under spinlock (1.06 KB, patch)
2010-11-11 07:40 UTC, Jeff Layton
no flags Details
patch -- instrument cifs_get_tlink (505 bytes, patch)
2010-11-11 08:28 UTC, Jeff Layton
no flags Details
cifs FYI with Jeff's instrumentation patch (235.83 KB, text/plain)
2010-11-11 08:45 UTC, Suresh Jayaraman (mail address is dead)
no flags Details
Debug message (31.95 KB, text/plain)
2010-11-11 11:37 UTC, Suresh Jayaraman (mail address is dead)
no flags Details
patch - don't take extra tlink reference in cifs_initiate_search (2.28 KB, patch)
2010-11-11 13:32 UTC, Jeff Layton
no flags Details
patch - don't take extra tlink reference in cifs_initiate_search (try #2) (2.28 KB, patch)
2010-11-11 14:53 UTC, Jeff Layton
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Layton 2010-11-11 07:00:09 UTC
Thus spake Shirish:

I think there is some kind of (I think refcount) leak.
Even after running basic connectathon test (-b -f) without
cifsacl mount option, after unmounting all the shares,
cannot unload cifs module.

[...]

After test4 of basic [connectathon] tests, cifs module does not unload, it is fine till test3, I can unmount and unload but not after test4

./test4 $TESTARG
+ ./test4 -f
./test4: setattr, getattr, and lookup
        20 chmods and stats on 10 files
        ./test4 ok.
+ exit 0

So even if the test passes, something leaks for cifs.ko to not unload.

[...]

Suresh also said he could reproduce this at will with test4, and when he tried reducing the number and count on the test the problem went away.
Comment 1 Jeff Layton 2010-11-11 07:40:24 UTC
Created attachment 6058 [details]
patch -- make sure cifs_get_tlink is done under spinlock

This is sort of a stab in the dark, but we don't generally hold the spinlock when we do this particular get. It might be interesting to test this patch and see whether it helps.

If it does, we'll still want to do some investigation to make sure we understand the race involved here.
Comment 2 Suresh Jayaraman (mail address is dead) 2010-11-11 08:10:11 UTC
I added a few printks in cifs_umount and found that after the test4, the tlink->tl_count =1 just before I leave cifs_umount which shows a problem here.
Comment 3 Jeff Layton 2010-11-11 08:28:37 UTC
Created attachment 6059 [details]
patch -- instrument cifs_get_tlink

Ok, so it seems we have too many "gets" and not enough "puts".

This patch might help shed some light on it. It pops a printk and does a dump_stack when the tl_count goes above 1.

I see this pop normally during the initial stages of test4 when files are being created. This is expected. I think what we want to know is whether this ever pops during the chmod/stat calls.
Comment 4 Suresh Jayaraman (mail address is dead) 2010-11-11 08:45:45 UTC
Created attachment 6060 [details]
cifs FYI with Jeff's instrumentation patch
Comment 5 Jeff Layton 2010-11-11 10:08:29 UTC
One thing I notice is that test4 in this case is forking off a rm process to clean up the old directory. I wonder whether the problem is more closely related to the recursive rm command rather than test4 specifically.

Suresh did a little testing and hinted at that on IRC, but it wasn't 100% clear to me what his results were. Suresh, could you elaborate?

I'm still unable to reproduce this at will so far...
Comment 6 Suresh Jayaraman (mail address is dead) 2010-11-11 10:12:42 UTC
I ran a simple test script as suggested by Jeff and I was able to reproduce the problem.

NFSTESTDIR is set /mnt/cifs/newtests (and the directory was existing before the test was run).

#!/bin/sh

rm -rf $NFSTESTDIR
mkdir $NFSTESTDIR

for i in `seq 0 9`
do
echo  '' > $NFSTESTDIR/file.$i
done
Comment 7 Suresh Jayaraman (mail address is dead) 2010-11-11 10:21:25 UTC
To narrow down the test further, I'm able to hit it just by doing

rm -rf $NFSTESTDIR
mkdir $NFSTESTDIR
Comment 8 Suresh Jayaraman (mail address is dead) 2010-11-11 11:37:52 UTC
Created attachment 6061 [details]
Debug message
Comment 9 Shirish S. Pargaonkar 2010-11-11 11:56:01 UTC
Not sure what this magic number of 10.
If I change 10 files to 9 or less in test4.c, 
cifs module can be unloaded, always.
Does not make any sense but that is the observation.
Comment 10 Shirish S. Pargaonkar 2010-11-11 12:38:55 UTC
I also think cifs_umount should not always return with 0 (meaning success)
but perhaps return tcount.
Comment 11 Shirish S. Pargaonkar 2010-11-11 12:46:11 UTC
Disregard comment #9. It is not true.
Comment 12 Jeff Layton 2010-11-11 13:32:20 UTC
Created attachment 6063 [details]
patch - don't take extra tlink reference in cifs_initiate_search

I think I found the bug. Suresh's latest debug output pointed it out:

[  784.131227] fs/cifs/readdir.c: CIFS VFS: leaving cifs_readdir (xid = 114) rc = 0
[  784.131235] fs/cifs/readdir.c: CIFS VFS: in cifs_readdir as Xid: 115 with uid: 1001
[  784.131237] fs/cifs/readdir.c: search backing up - close and restart search
[  784.131239] fs/cifs/readdir.c: freeing SMB ff cache buf on search rewind
[  784.131240] tl_count=2
[  784.131242] Pid: 2542, comm: rm Not tainted 2.6.37-rc1-12-default #3
[  784.131244] Call Trace:
[  784.131250]  [<ffffffffa036d3d1>] cifs_sb_tlink+0xf1/0x570 [cifs]
[  784.131254]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131260]  [<ffffffffa0381f62>] initiate_cifs_search+0x32/0x2b0 [cifs]
[  784.131262]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131295]  [<ffffffff810e48d2>] ? mempool_free_slab+0x12/0x20
[  784.131298]  [<ffffffff810e4a12>] ? mempool_free+0x52/0xa0
[  784.131301]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131306]  [<ffffffffa0382f20>] cifs_readdir+0x1f0/0x880 [cifs]
[  784.131309]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131312]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131314]  [<ffffffff81144230>] vfs_readdir+0xb0/0xd0
[  784.131318]  [<ffffffff81133576>] ? generic_file_llseek+0x56/0x70
[  784.131320]  [<ffffffff81144410>] sys_getdents64+0x80/0xe0
[  784.131324]  [<ffffffff81002e02>] system_call_fastpath+0x16/0x1b
[  784.131326] tl_count=3
[  784.131327] Pid: 2542, comm: rm Not tainted 2.6.37-rc1-12-default #3
[  784.131329] Call Trace:
[  784.131334]  [<ffffffffa03821d0>] initiate_cifs_search+0x2a0/0x2b0 [cifs]
[  784.131337]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131339]  [<ffffffff810e4a12>] ? mempool_free+0x52/0xa0
[  784.131342]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131347]  [<ffffffffa0382f20>] cifs_readdir+0x1f0/0x880 [cifs]
[  784.131350]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131352]  [<ffffffff81143f20>] ? filldir64+0x0/0xd0
[  784.131355]  [<ffffffff81144230>] vfs_readdir+0xb0/0xd0
[  784.131357]  [<ffffffff81133576>] ? generic_file_llseek+0x56/0x70
[  784.131360]  [<ffffffff81144410>] sys_getdents64+0x80/0xe0
[  784.131363]  [<ffffffff81002e02>] system_call_fastpath+0x16/0x1b

I think this patch will fix it, but I haven't tested it yet. Could one of you that are able to reproduce this at will please do so?
Comment 13 Shirish S. Pargaonkar 2010-11-11 14:01:46 UTC
Jeff, I am doublechecking the patch, my modified scripts but I run into this
problem with the patch you posted.

+ ./test4 -f
./test4: setattr, getattr, and lookup
        ./test4: (/mnt/smb_c/testdir) runtests: line 41:  4313 Segmentation fault      ./test4 $TESTARG
+ '[' 139 -ne 0 ']'
+ echo basic tests failed
basic tests failed
+ exit 1
cifstest6:/usr/src/cthon04 # dmesg
CIFS VFS: Unexpected SMB signature
CIFS VFS: Unexpected SMB signature
test4[4313]: segfault at 4c6e4ca0 ip 00007f9dd3613580 sp 00007fff4c6e0a88 error 4 in libc-2.9.so[7f9dd3597000+14f000]

Comment 14 Jeff Layton 2010-11-11 14:53:53 UTC
Created attachment 6064 [details]
patch - don't take extra tlink reference in cifs_initiate_search (try #2)

That is odd...not sure what would cause a segfault there. Might be interesting to strace the program and figure out where it's actually segfaulting. It may be that we have another bug in this area, as I don't see how this would cause that to happen. It seems a little odd that we'd need to restart the search in either case.

Either way, here's a slightly revised patch (gets rid of the unneeded setting of cifsFile). I don't think that will affect the function of it, but it's slightly cleaner.
Comment 15 Shirish S. Pargaonkar 2010-11-11 16:57:39 UTC
This is absolutely strange.  I just downloaded nfstests.tar.gz and build
connectathon tests and with the a cifs module from the cloned tree without
any changes, I still run into core dump.

cifstest6:/usr/src/cthon111110/cthon04 # ./runtests -b -f

Starting BASIC tests: test directory /mnt/smb_c/testdir (arg: -f)

./test1: File and directory creation test
        created 6 files 6 directories 2 levels deep
        ./test1 ok.

./test2: File and directory removal test
        removed 6 files 6 directories 2 levels deep
        ./test2 ok.

./test3: lookups across mount point
        2 getcwd and stat calls
        ./test3 ok.

./test4: setattr, getattr, and lookup
        ./test4: (/mnt/smb_c/testdir) runtests: line 40:  3622 Segmentation fault      ./test4 $TESTARG
basic tests failed
cifstest6:/usr/src/cthon111110/cthon04 # dmesg
test4[3622]: segfault at 32bdf410 ip 00007fcbcdbeb580 sp 00007fff32bdb1f8 error 4 in libc-2.9.so[7fcbcdb6f000+14f000]

The mount is off of a Windows share.

A command like this returns error

cifstest6:/usr/src/cthon111110/cthon04 # strace -o /tmp/strfile ./runtests -b -f
strace: exec: Exec format error



cifstest6:/usr/src/cthon111110/cthon04/basic # ./runtests -f
Starting BASIC tests: test directory /mnt/smb_c/testdir (arg: -f)
mkdir: cannot create directory `/mnt/smb_c/testdir': File exists

./test1: File and directory creation test
        created 6 files 6 directories 2 levels deep
        ./test1 ok.

./test2: File and directory removal test
        removed 6 files 6 directories 2 levels deep
        ./test2 ok.

./test3: lookups across mount point
        2 getcwd and stat calls
        ./test3 ok.

./test4: setattr, getattr, and lookup
        ./test4: (/mnt/smb_c/testdir) ./runtests: line 40:  3765 Segmentation fault      ./test4 $TESTARG

Runing tests in basic directory does not help either.

And again strace errors out

strace -o /tmp/strfile ./runtests -f
strace: exec: Exec format error

Comment 16 Jeff Layton 2010-11-11 18:37:29 UTC
(In reply to comment #15)

> 
> ./test4: setattr, getattr, and lookup
>         ./test4: (/mnt/smb_c/testdir) runtests: line 40:  3622 Segmentation
> fault      ./test4 $TESTARG
> basic tests failed
> cifstest6:/usr/src/cthon111110/cthon04 # dmesg
> test4[3622]: segfault at 32bdf410 ip 00007fcbcdbeb580 sp 00007fff32bdb1f8 error
> 4 in libc-2.9.so[7fcbcdb6f000+14f000]
> 
> The mount is off of a Windows share.
> 
> A command like this returns error
> 
> cifstest6:/usr/src/cthon111110/cthon04 # strace -o /tmp/strfile ./runtests -b
> -f
> strace: exec: Exec format error
> 

strace only works on binaries, not on shell scripts (which is what runtests is). You'd need to run it on test4 directly.

I'm not sure why you'd be hitting segfaults there. This patch is fairly straightforward. Perhaps you should back it out and ensure your test rig is working correctly before you try to verify it?
Comment 17 Shirish S. Pargaonkar 2010-11-11 21:48:19 UTC
No, even without this patch test4 hits segfault. It is possible that I had
modified test4.c. Now that I have downloaded compiled testsuite, test4.c
is hitting this error. I will run strace on test4 binary directly to identify
the problem.
Comment 18 Suresh Jayaraman (mail address is dead) 2010-11-11 22:08:56 UTC
Hurray! I didn't see any segfault and the seem to fix the issue for me during my initial testing. I'll do little more testing and confirm.
Comment 19 Shirish S. Pargaonkar 2010-11-11 22:35:33 UTC
This line in test4.c causes segmentation fault, there are two of them, 
both do (if I comment out the first one, second segfaults).

if ((statb.st_mode & CHMOD_MASK) != CHMOD_NONE) {
Comment 20 Suresh Jayaraman (mail address is dead) 2010-11-11 23:03:25 UTC
The patch fixes the problem. I tried possible all combinations that yielded in
"Module in use" error and I could not reproduce the problem with this fix.

I also see the patch being obviously correct. We call initiate_cifs_search()
from cifs_readdir() when file->private_data is NULL and then call once down
below from find_cifs_entry() by that time we would have file->private_data set,
but we still take a extra reference..  You can add my "Reviewed-and-Tested-by"
as well.

During my testing I also found tl_count never went up to 3 from the debug
messages (as in the previous case) and it cifs_umount had tl_count of 0 before
returning.

Kudos! Jeff for nailing down this hard problem.
Comment 21 Shirish S. Pargaonkar 2010-11-11 23:08:58 UTC
Mine too.  The patch is definitely correct.
I am encoutering the segfault at
 error("%s has mode %o after chmod 0", str, (statb.st_mode & 0777));
but that has nothing to do with this bug and the patch certainly fixes the bug.
Comment 22 Jeff Layton 2010-11-12 05:52:56 UTC
Patch sent to Steve and the list. I think we can call this FIXED...
Comment 23 Shirish S. Pargaonkar 2010-11-12 08:17:31 UTC
The original problem was, in my cthon test suite, in test4.c, I had
commented out offending lines, the check following stat() call.
Now that I uncommented them, I started running into this problem (segfault).

I need to brush up my gdb skills to do proper debugging. I have 
compiled test4.c with -g option, will do the rest and log the findings.

I see this problem against a Windows server.  
 ./test4 -f
./test4: setattr, getattr, and lookup
        ./test4: (/mnt/smb_c/testdir) Segmentation fault

#dmesg 
test4[3843]: segfault at f1de2fe0 ip 00007ff5b51d1580 sp 00007ffff1ddedc8 
error 4 in libc-2.9.so[7ff5b5155000+14f000]

Against a Samba server, this is not a problem.

./test4 -f
./test4: setattr, getattr, and lookup
        20 chmods and stats on 10 files
        ./test4 ok.

# umount /mnt/smb_c
# rmmod cifs.ko

But this is a separate issue than this bug was openend for. So I will
log the results just for the information's sake.
Jeff's patch definitely fixes the problem reported by this bug.
Comment 24 Shirish S. Pargaonkar 2010-11-12 08:21:11 UTC
Oh, if I mount a Windows server share with mount option  cifsacl  ,
no segfaults, test4 completes

./test4 -f
./test4: setattr, getattr, and lookup
        20 chmods and stats on 10 files
        ./test4 ok.