Bug 7402 - File Corruption when reading a file
Summary: File Corruption when reading a file
Status: RESOLVED WONTFIX
Alias: None
Product: CifsVFS
Classification: Unclassified
Component: kernel fs (show other bugs)
Version: 2.6
Hardware: Other Linux
: P3 critical
Target Milestone: ---
Assignee: Jeff Layton
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-30 13:50 UTC by jeevan kodali
Modified: 2012-06-29 08:58 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description jeevan kodali 2010-04-30 13:50:52 UTC
Hi

I am mapping to a windows drive from linux machine. I was trying to download one xml file (82KB) and the file was corrupted. I tried to download the file using cp command. I was able to reproduce it every time I used cp. But if I copy the xml file into a different folder on the windows machine and tried to copy it from linux, everything is fine. But if I cp again from the first location, the file was corrupted again.

If I unmount the drive and mount it again, everything is fine.

I am afraid, some files might be corrupted in such a way that I don't know about it (I knew this because I got error as it is an xml file). Remounting agian fixed the problem shows that something is getting corrupted because of staying on for long time.
Comment 1 Jeremy Allison 2010-04-30 13:56:03 UTC
This is a CIFS VFS problem, not a Samba one. Changing to Linux kernel VFS.
Jeremy.
Comment 2 Jeremy Allison 2010-04-30 13:56:24 UTC
Re-assigning to Jeff.
Comment 3 Jeff Layton 2010-04-30 14:13:46 UTC
What kernel version are you running? Do you have a way to reproduce this at will?
Comment 4 jeevan kodali 2010-04-30 14:27:42 UTC
Linux version 2.6.18-164.11.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #

That is the linux version we are using. I used to be able to reproduce at will for that file in that location but once I unmounted and mounted the drive, I cannot reproduce this again.
Comment 5 Jeff Layton 2010-04-30 14:40:18 UTC
Hmmm...questionable whether we'll be able to tell much given that the box is no longer in that state. One possibility is that a read was done against the file while it was in an intermediate state of modification and the client didn't realize that it needed to invalidate the cache.

When you say it was corrupt...did you mean that it was actually just garbage? Stuck at an earlier version? Can you be more specific?
Comment 6 jeevan kodali 2010-04-30 15:19:47 UTC
When we started getting corrupted file, it was not all junk. 3/4 th of the file was good and then after that it was not junk but it looked as if some characters (probably 5-6) were lost. After that most of the file was good and at then some data which was not there in the original file appeared.

Probably you are right in terms of caching problem, is there any way to handle/fix it, say not to have cache at all or something like that?
Comment 7 Jeff Layton 2010-05-01 06:11:03 UTC
Hmm ok. Sounds almost like confusion about the file size, or some sort of page cache corruption. If you get the box in this state again what may be interesting is to try and change the mtime (aka LastWriteTime) on the file from another client or on the server and see whether that "fixes" the problem (it probably will).

There are also a number of fixes that went into the RHEL5.5 kernels that may help this as well (primiarly those having to do with dentry caches).
Comment 8 Jeff Layton 2010-05-01 06:12:08 UTC
Also, if you see this again maybe consider saving a copy of the corrupt file somewhere so you can compare it to the original. It would also be interesting to know whether the corrupt section starts on a page boundary.
Comment 9 jeevan kodali 2010-05-01 08:18:39 UTC
I have the copy of the original and the corrupted file with me, do you know how I can check whether it is a page boundary? 
Comment 10 Jeff Layton 2010-05-01 10:43:01 UTC
Compare it to the uncorrupted original, determine the byte offset where the problem starts and see whether that byte offset is a multiple of 4096.
Comment 11 jeevan kodali 2010-05-03 15:43:19 UTC
It was an exact multiple (14*4096 = 57344). I am not really sure I understood your suggestion of fixes. Are you saying that I have to update the kernel?

Thanks
Comment 12 Jeff Layton 2010-05-04 09:26:32 UTC
Ok, that's an interesting data point. Updating the kernel probably won't hurt, but I don't think we know enough to be able to state whether it'll help or not. Unless this happens again, we probably won't be able to much farther with it.

The ideal thing would be a way to reliably reproduce this.
Comment 13 jeevan kodali 2010-05-04 10:12:42 UTC
Thanks for your help, next time something like this happens I will contact again without fixing the issue first.
Comment 14 Jeff Layton 2010-05-04 11:15:10 UTC
Sounds good. When/if it occurs you should probably collect:

Copies of the "good" and "bad" files
Metadata from the file on the client and server (run /bin/stat against it on both sides and save off the output)
Also, consider turning up cifsFYI info on the client while accessing the file. See this page for info on doing that:

http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
Comment 15 jeevan kodali 2010-05-05 13:29:39 UTC
This happened again, doesn't look like its at a boundary of 4096. I have good and bad files. Running stat on the file that I am seeing from linux shows this:

  Size: 34676           Blocks: 72         IO Block: 16384  regular file
Device: 17h/23d Inode: 6310308     Links: 1
Access: (2767/-rwxrwSrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-05-04 22:14:08.078000400 -0500
Modify: 2010-05-04 22:14:16.062477600 -0500
Change: 2010-05-05 02:32:06.018902400 -0500

When I look at this file on windows server, I see the time as: 5/4/2010 10:14PM (this file might have got updated twice between 10:14:09 and 10:14:17) which caused this corruption.

I haven't turned on that CIFS thing yet.
Comment 16 jeevan kodali 2010-05-11 10:52:16 UTC
I changed to map the drive using "directio" in mount.cifs. Looks like I am not getting this problem right now, will keep an eye on it.
Comment 17 Jeff Layton 2012-05-05 10:43:00 UTC
Possibly this is due to the semi-broken caching model in cifs.ko. Are you
still using RHEL5 here, have you seen this issue since switching to
"forcedirectio" ?
Comment 18 David H. Durgee 2012-06-28 13:05:25 UTC
I believe I am seeing a re-emergence of this bug.  I can recreate it at will and have used traceSMB and cifsFYI on systems both with and without the bug.

As I have already opened a bug report on launchpad let me give you a link to it as opposed to reposting everything here:

https://bugs.launchpad.net/linuxmint/+bug/1017660

Dave
Comment 19 David H. Durgee 2012-06-28 15:09:52 UTC
Based upon reading the comments here I added the "directio" option to my fstab entries for the shares.  I got mixed results with this.  Two of the shares are with peers and the problem appears to be solved for them.  One of the shares is with a Warp Server and behavior there is not a clear.  Here is an example:

dhdurgee@DG41TY ~/Downloads $ unzip -t /mnt/n_/monthly/BD120605.zip 
Archive:  /mnt/n_/monthly/BD120605.zip
error [/mnt/n_/monthly/BD120605.zip]:  reported length of central directory is
  0 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
file #1:  bad zipfile offset (lseek):  0
file #2:  bad zipfile offset (local header sig):  863
error:  zipfile read error
file #3:  bad zipfile offset (EOF):  4106
error:  zipfile read error
file #4:  bad zipfile offset (EOF):  7315
error:  zipfile read error
file #5:  bad zipfile offset (EOF):  7702
error:  zipfile read error
file #6:  bad zipfile offset (EOF):  8111
file #7:  bad zipfile offset (lseek):  8192
file #8:  bad zipfile offset (local header sig):  13164
At least one error was detected in /mnt/n_/monthly/BD120605.zip.
dhdurgee@DG41TY ~/Downloads $ cp  /mnt/n_/monthly/BD120605.zip ./BD120605.zip
cp: reading `/mnt/n_/monthly/BD120605.zip': Invalid argument
cp: failed to extend `./BD120605.zip': Invalid argument
dhdurgee@DG41TY ~/Downloads $ ls -l BD120605.zip 
-rwxr-xr-x 1 dhdurgee dhdurgee 0 Jun 28 10:56 BD120605.zip

{used FC/L to copy file from mount to directory here}

dhdurgee@DG41TY ~/Downloads $ unzip -t ./BD120605.zip 
Archive:  ./BD120605.zip
    testing: XH.TXT                   OK
    testing: XD.TXT                   OK
    testing: EQ2.CSV                  OK
    testing: EQ.CSV                   OK
    testing: LT.CSV                   OK
    testing: IU.CSV                   OK
    testing: BD2.CSV                  OK
    testing: BD1.CSV                  OK
No errors detected in compressed data of ./BD120605.zip.
dhdurgee@DG41TY ~/Downloads $ 

So as you can see above, using unzip directly against the file on the share has problems as does the cp command for some reason.  Yet I can use FC/L to copy the file to a local directory and the copy is valid!  Wierd.
Comment 20 Jeff Layton 2012-06-29 08:58:49 UTC
David, the problem you're seeing is almost certainly different from the one originally reported. Could you open a new bug with this info and cc me on it? I'm going to go ahead and close this as INVALID since the original problem was never reproducible and he was able to work around it with "directio".