Bug 3219 - cifsoplockd oops in 2.6.13.4
Summary: cifsoplockd oops in 2.6.13.4
Status: RESOLVED WORKSFORME
Alias: None
Product: CifsVFS
Classification: Unclassified
Component: kernel fs (show other bugs)
Version: 2.6
Hardware: x86 Linux
: P3 major
Target Milestone: ---
Assignee: Steve French
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-26 20:35 UTC by Ian Berry
Modified: 2006-04-06 13:09 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Berry 2005-10-26 20:35:51 UTC
I am submitting this bug by request from Steven French regarding a problem that
I recently encountered with cifs in the 2.6 kernel. Below is the contents of my
report to the linux-kernel mailing list.

======
I have been experiencing problems with cifs-mounted Windows (2k/2k3) shares
running under the 2.6.13.4 kernel. I am running a piece of software that
traverses a remote cifs mount for backup purposes. To prevent locking issues,
the script obtains a shared lock on each file before trying to read it and
releases it afterwards. After making this change, running the script for an
extended period of time causes the cifs process in the kernel to oops and
eventually require a reboot to fully recover.

As I *was* using a stock 2.6.11 kernel, the changelogs for 2.6.12 and 2.6.13
looked promising with new code that would prevent an oops when the oplockd
thread dies and another fix for an oops when closing a file with open locks.
Unfortunately, upgrading the kernel to 2.6.13.4 has not corrected the problem
although it seems to take longer for the problem to manifest itself.

Here is what appears in dmesg:

Process cifsoplockd (pid: 192, threadinfo=f7cd2000 task=c1be7020)
Stack: f380940c ffffffff c013fc03 f3809400 f7cd3f4c 00000000 00000000 f7cd3f44
      00000000 f7cd2000 c014a16e f38093fc 00000000 0000000e f7cd3f4c 00000000
      c014a6e9 f7cd3f44 f38093fc 00000000 0000000e 00000000 00000000 00000000
Call Trace:
 [<c013fc03>] find_get_pages+0x53/0x60
 [<c014a16e>] pagevec_lookup+0x2e/0x40
 [<c014a6e9>] invalidate_mapping_pages+0x59/0x100
 [<c022ce5b>] CIFSSMBLock+0x7b/0x1e0
 [<c022b452>] cifs_oplock_thread+0x112/0x214
 [<c022b340>] cifs_oplock_thread+0x0/0x214

I have confirmed that the cifsoplockd process does in fact die after this
occurs. I also saw a note in the 2.6.13 changelog that the cifs locking code is
still not perfect, but improved. Can anyone confirm this?

The box in question is a Dell 2850 w/ 2x2.8 GHz Xeon processors. I have been
able to reproduce this on multiple servers as well.
Comment 1 Ian Berry 2005-10-28 10:02:12 UTC
Just a quick update. We have ordered a Windows Server 2003 license for various
uses, so when that arrives I will setup a test server in attempt to reproduce
this problem  an in isolated environment.
Comment 2 Steve French 2005-12-02 09:10:47 UTC
If you can recreate this let me know, and we may be able to add debug code, but the cifs call stack here is pretty harmless (invalidate_mapping_pages does not really have much to do with cifs per-se perhaps there is some race with two threads doing it at once but that sounds like something I can't do much about ...).

If you reproduce it please attach the oops data (I think part of it got truncated in your earlier post).

This may be a corruption of the mapping.  The mapping was able to be locked presumably (or the mapping was not so corrupt as to prevent lock) a few lines earlier (about line 508 of mm/filemap.c) so it is strange that it would oops on the unlock.   The chance of this occurring seems to be near zero unless it were smp in any case since presumably nothing will get in between the lock and unlock.
Comment 3 Ian Berry 2005-12-20 18:45:05 UTC
Adding debug code the the kernel would be ideal in tracking this issue down. I don't think there will be much problem reproducing the error as it has happened across multiple clients and servers.

I have been really busy lately and so it is going to be another couple of weeks before I can put some time into tracking this down. Unfortunately these errors were occurring on live customer servers at various locations, so it makes debugging difficult. Once I get everything together, I will kick this bug again.
Comment 4 Steve French 2006-04-05 22:21:19 UTC
This should not occur on more recent code - but it would require installing newer cifs (cifs-1.42b.tar.gz is current on the download site) in order to prove it.

I may return this bug as it is not being actively worked and no other reports similar to this came in
Comment 5 Ian Berry 2006-04-06 08:46:41 UTC
Thanks, Steve. I will let you know if I experience any more problems with this using a newer version of CIFS. We have yet to migrate to a newer kernel on some of our customer boxes so it will be difficult to test until we do.
Comment 6 Steve French 2006-04-06 09:38:52 UTC
If you prefer you can just build cifs.ko (rather than upgrade the kernel) e.g. from http://us1.samba.org/samba/ftp/cifs-cvs/cifs-1.42b.tar.gz
Comment 7 Ian Berry 2006-04-06 13:02:59 UTC
We currently have CIFS compiled into the kernel, which for reasons like this probably does not make sense. Is the current version of CIFS in 2.6.16?
Comment 8 Steve French 2006-04-06 13:04:47 UTC
CIFS in 2.6.16 has a problem with servers which have mandated cifs packet signing (which is fixed in 2.6.17-rc1) and also has a handful of other minor bug fixes but otherwise current enough.
Comment 9 Ian Berry 2006-04-06 13:09:35 UTC
That would probably cause issues for us since we connect to a number of Windows Server 2003 clients which require packet signing by default. Either way, the modular approach is probably the best way to go here. That or wait until 2.6.17, which might be quite a leap to make in terms of running code that has not been entirely "broken in" on lots of different hardware.