Thus spake Shirish: I think there is some kind of (I think refcount) leak. Even after running basic connectathon test (-b -f) without cifsacl mount option, after unmounting all the shares, cannot unload cifs module. [...] After test4 of basic [connectathon] tests, cifs module does not unload, it is fine till test3, I can unmount and unload but not after test4 ./test4 $TESTARG + ./test4 -f ./test4: setattr, getattr, and lookup 20 chmods and stats on 10 files ./test4 ok. + exit 0 So even if the test passes, something leaks for cifs.ko to not unload. [...] Suresh also said he could reproduce this at will with test4, and when he tried reducing the number and count on the test the problem went away.
Created attachment 6058 [details] patch -- make sure cifs_get_tlink is done under spinlock This is sort of a stab in the dark, but we don't generally hold the spinlock when we do this particular get. It might be interesting to test this patch and see whether it helps. If it does, we'll still want to do some investigation to make sure we understand the race involved here.
I added a few printks in cifs_umount and found that after the test4, the tlink->tl_count =1 just before I leave cifs_umount which shows a problem here.
Created attachment 6059 [details] patch -- instrument cifs_get_tlink Ok, so it seems we have too many "gets" and not enough "puts". This patch might help shed some light on it. It pops a printk and does a dump_stack when the tl_count goes above 1. I see this pop normally during the initial stages of test4 when files are being created. This is expected. I think what we want to know is whether this ever pops during the chmod/stat calls.
Created attachment 6060 [details] cifs FYI with Jeff's instrumentation patch
One thing I notice is that test4 in this case is forking off a rm process to clean up the old directory. I wonder whether the problem is more closely related to the recursive rm command rather than test4 specifically. Suresh did a little testing and hinted at that on IRC, but it wasn't 100% clear to me what his results were. Suresh, could you elaborate? I'm still unable to reproduce this at will so far...
I ran a simple test script as suggested by Jeff and I was able to reproduce the problem. NFSTESTDIR is set /mnt/cifs/newtests (and the directory was existing before the test was run). #!/bin/sh rm -rf $NFSTESTDIR mkdir $NFSTESTDIR for i in `seq 0 9` do echo '' > $NFSTESTDIR/file.$i done
To narrow down the test further, I'm able to hit it just by doing rm -rf $NFSTESTDIR mkdir $NFSTESTDIR
Created attachment 6061 [details] Debug message
Not sure what this magic number of 10. If I change 10 files to 9 or less in test4.c, cifs module can be unloaded, always. Does not make any sense but that is the observation.
I also think cifs_umount should not always return with 0 (meaning success) but perhaps return tcount.
Disregard comment #9. It is not true.
Created attachment 6063 [details] patch - don't take extra tlink reference in cifs_initiate_search I think I found the bug. Suresh's latest debug output pointed it out: [ 784.131227] fs/cifs/readdir.c: CIFS VFS: leaving cifs_readdir (xid = 114) rc = 0 [ 784.131235] fs/cifs/readdir.c: CIFS VFS: in cifs_readdir as Xid: 115 with uid: 1001 [ 784.131237] fs/cifs/readdir.c: search backing up - close and restart search [ 784.131239] fs/cifs/readdir.c: freeing SMB ff cache buf on search rewind [ 784.131240] tl_count=2 [ 784.131242] Pid: 2542, comm: rm Not tainted 2.6.37-rc1-12-default #3 [ 784.131244] Call Trace: [ 784.131250] [<ffffffffa036d3d1>] cifs_sb_tlink+0xf1/0x570 [cifs] [ 784.131254] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131260] [<ffffffffa0381f62>] initiate_cifs_search+0x32/0x2b0 [cifs] [ 784.131262] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131295] [<ffffffff810e48d2>] ? mempool_free_slab+0x12/0x20 [ 784.131298] [<ffffffff810e4a12>] ? mempool_free+0x52/0xa0 [ 784.131301] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131306] [<ffffffffa0382f20>] cifs_readdir+0x1f0/0x880 [cifs] [ 784.131309] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131312] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131314] [<ffffffff81144230>] vfs_readdir+0xb0/0xd0 [ 784.131318] [<ffffffff81133576>] ? generic_file_llseek+0x56/0x70 [ 784.131320] [<ffffffff81144410>] sys_getdents64+0x80/0xe0 [ 784.131324] [<ffffffff81002e02>] system_call_fastpath+0x16/0x1b [ 784.131326] tl_count=3 [ 784.131327] Pid: 2542, comm: rm Not tainted 2.6.37-rc1-12-default #3 [ 784.131329] Call Trace: [ 784.131334] [<ffffffffa03821d0>] initiate_cifs_search+0x2a0/0x2b0 [cifs] [ 784.131337] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131339] [<ffffffff810e4a12>] ? mempool_free+0x52/0xa0 [ 784.131342] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131347] [<ffffffffa0382f20>] cifs_readdir+0x1f0/0x880 [cifs] [ 784.131350] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131352] [<ffffffff81143f20>] ? filldir64+0x0/0xd0 [ 784.131355] [<ffffffff81144230>] vfs_readdir+0xb0/0xd0 [ 784.131357] [<ffffffff81133576>] ? generic_file_llseek+0x56/0x70 [ 784.131360] [<ffffffff81144410>] sys_getdents64+0x80/0xe0 [ 784.131363] [<ffffffff81002e02>] system_call_fastpath+0x16/0x1b I think this patch will fix it, but I haven't tested it yet. Could one of you that are able to reproduce this at will please do so?
Jeff, I am doublechecking the patch, my modified scripts but I run into this problem with the patch you posted. + ./test4 -f ./test4: setattr, getattr, and lookup ./test4: (/mnt/smb_c/testdir) runtests: line 41: 4313 Segmentation fault ./test4 $TESTARG + '[' 139 -ne 0 ']' + echo basic tests failed basic tests failed + exit 1 cifstest6:/usr/src/cthon04 # dmesg CIFS VFS: Unexpected SMB signature CIFS VFS: Unexpected SMB signature test4[4313]: segfault at 4c6e4ca0 ip 00007f9dd3613580 sp 00007fff4c6e0a88 error 4 in libc-2.9.so[7f9dd3597000+14f000]
Created attachment 6064 [details] patch - don't take extra tlink reference in cifs_initiate_search (try #2) That is odd...not sure what would cause a segfault there. Might be interesting to strace the program and figure out where it's actually segfaulting. It may be that we have another bug in this area, as I don't see how this would cause that to happen. It seems a little odd that we'd need to restart the search in either case. Either way, here's a slightly revised patch (gets rid of the unneeded setting of cifsFile). I don't think that will affect the function of it, but it's slightly cleaner.
This is absolutely strange. I just downloaded nfstests.tar.gz and build connectathon tests and with the a cifs module from the cloned tree without any changes, I still run into core dump. cifstest6:/usr/src/cthon111110/cthon04 # ./runtests -b -f Starting BASIC tests: test directory /mnt/smb_c/testdir (arg: -f) ./test1: File and directory creation test created 6 files 6 directories 2 levels deep ./test1 ok. ./test2: File and directory removal test removed 6 files 6 directories 2 levels deep ./test2 ok. ./test3: lookups across mount point 2 getcwd and stat calls ./test3 ok. ./test4: setattr, getattr, and lookup ./test4: (/mnt/smb_c/testdir) runtests: line 40: 3622 Segmentation fault ./test4 $TESTARG basic tests failed cifstest6:/usr/src/cthon111110/cthon04 # dmesg test4[3622]: segfault at 32bdf410 ip 00007fcbcdbeb580 sp 00007fff32bdb1f8 error 4 in libc-2.9.so[7fcbcdb6f000+14f000] The mount is off of a Windows share. A command like this returns error cifstest6:/usr/src/cthon111110/cthon04 # strace -o /tmp/strfile ./runtests -b -f strace: exec: Exec format error cifstest6:/usr/src/cthon111110/cthon04/basic # ./runtests -f Starting BASIC tests: test directory /mnt/smb_c/testdir (arg: -f) mkdir: cannot create directory `/mnt/smb_c/testdir': File exists ./test1: File and directory creation test created 6 files 6 directories 2 levels deep ./test1 ok. ./test2: File and directory removal test removed 6 files 6 directories 2 levels deep ./test2 ok. ./test3: lookups across mount point 2 getcwd and stat calls ./test3 ok. ./test4: setattr, getattr, and lookup ./test4: (/mnt/smb_c/testdir) ./runtests: line 40: 3765 Segmentation fault ./test4 $TESTARG Runing tests in basic directory does not help either. And again strace errors out strace -o /tmp/strfile ./runtests -f strace: exec: Exec format error
(In reply to comment #15) > > ./test4: setattr, getattr, and lookup > ./test4: (/mnt/smb_c/testdir) runtests: line 40: 3622 Segmentation > fault ./test4 $TESTARG > basic tests failed > cifstest6:/usr/src/cthon111110/cthon04 # dmesg > test4[3622]: segfault at 32bdf410 ip 00007fcbcdbeb580 sp 00007fff32bdb1f8 error > 4 in libc-2.9.so[7fcbcdb6f000+14f000] > > The mount is off of a Windows share. > > A command like this returns error > > cifstest6:/usr/src/cthon111110/cthon04 # strace -o /tmp/strfile ./runtests -b > -f > strace: exec: Exec format error > strace only works on binaries, not on shell scripts (which is what runtests is). You'd need to run it on test4 directly. I'm not sure why you'd be hitting segfaults there. This patch is fairly straightforward. Perhaps you should back it out and ensure your test rig is working correctly before you try to verify it?
No, even without this patch test4 hits segfault. It is possible that I had modified test4.c. Now that I have downloaded compiled testsuite, test4.c is hitting this error. I will run strace on test4 binary directly to identify the problem.
Hurray! I didn't see any segfault and the seem to fix the issue for me during my initial testing. I'll do little more testing and confirm.
This line in test4.c causes segmentation fault, there are two of them, both do (if I comment out the first one, second segfaults). if ((statb.st_mode & CHMOD_MASK) != CHMOD_NONE) {
The patch fixes the problem. I tried possible all combinations that yielded in "Module in use" error and I could not reproduce the problem with this fix. I also see the patch being obviously correct. We call initiate_cifs_search() from cifs_readdir() when file->private_data is NULL and then call once down below from find_cifs_entry() by that time we would have file->private_data set, but we still take a extra reference.. You can add my "Reviewed-and-Tested-by" as well. During my testing I also found tl_count never went up to 3 from the debug messages (as in the previous case) and it cifs_umount had tl_count of 0 before returning. Kudos! Jeff for nailing down this hard problem.
Mine too. The patch is definitely correct. I am encoutering the segfault at error("%s has mode %o after chmod 0", str, (statb.st_mode & 0777)); but that has nothing to do with this bug and the patch certainly fixes the bug.
Patch sent to Steve and the list. I think we can call this FIXED...
The original problem was, in my cthon test suite, in test4.c, I had commented out offending lines, the check following stat() call. Now that I uncommented them, I started running into this problem (segfault). I need to brush up my gdb skills to do proper debugging. I have compiled test4.c with -g option, will do the rest and log the findings. I see this problem against a Windows server. ./test4 -f ./test4: setattr, getattr, and lookup ./test4: (/mnt/smb_c/testdir) Segmentation fault #dmesg test4[3843]: segfault at f1de2fe0 ip 00007ff5b51d1580 sp 00007ffff1ddedc8 error 4 in libc-2.9.so[7ff5b5155000+14f000] Against a Samba server, this is not a problem. ./test4 -f ./test4: setattr, getattr, and lookup 20 chmods and stats on 10 files ./test4 ok. # umount /mnt/smb_c # rmmod cifs.ko But this is a separate issue than this bug was openend for. So I will log the results just for the information's sake. Jeff's patch definitely fixes the problem reported by this bug.
Oh, if I mount a Windows server share with mount option cifsacl , no segfaults, test4 completes ./test4 -f ./test4: setattr, getattr, and lookup 20 chmods and stats on 10 files ./test4 ok.