Hi I am mapping to a windows drive from linux machine. I was trying to download one xml file (82KB) and the file was corrupted. I tried to download the file using cp command. I was able to reproduce it every time I used cp. But if I copy the xml file into a different folder on the windows machine and tried to copy it from linux, everything is fine. But if I cp again from the first location, the file was corrupted again. If I unmount the drive and mount it again, everything is fine. I am afraid, some files might be corrupted in such a way that I don't know about it (I knew this because I got error as it is an xml file). Remounting agian fixed the problem shows that something is getting corrupted because of staying on for long time.
This is a CIFS VFS problem, not a Samba one. Changing to Linux kernel VFS. Jeremy.
Re-assigning to Jeff.
What kernel version are you running? Do you have a way to reproduce this at will?
Linux version 2.6.18-164.11.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) # That is the linux version we are using. I used to be able to reproduce at will for that file in that location but once I unmounted and mounted the drive, I cannot reproduce this again.
Hmmm...questionable whether we'll be able to tell much given that the box is no longer in that state. One possibility is that a read was done against the file while it was in an intermediate state of modification and the client didn't realize that it needed to invalidate the cache. When you say it was corrupt...did you mean that it was actually just garbage? Stuck at an earlier version? Can you be more specific?
When we started getting corrupted file, it was not all junk. 3/4 th of the file was good and then after that it was not junk but it looked as if some characters (probably 5-6) were lost. After that most of the file was good and at then some data which was not there in the original file appeared. Probably you are right in terms of caching problem, is there any way to handle/fix it, say not to have cache at all or something like that?
Hmm ok. Sounds almost like confusion about the file size, or some sort of page cache corruption. If you get the box in this state again what may be interesting is to try and change the mtime (aka LastWriteTime) on the file from another client or on the server and see whether that "fixes" the problem (it probably will). There are also a number of fixes that went into the RHEL5.5 kernels that may help this as well (primiarly those having to do with dentry caches).
Also, if you see this again maybe consider saving a copy of the corrupt file somewhere so you can compare it to the original. It would also be interesting to know whether the corrupt section starts on a page boundary.
I have the copy of the original and the corrupted file with me, do you know how I can check whether it is a page boundary?
Compare it to the uncorrupted original, determine the byte offset where the problem starts and see whether that byte offset is a multiple of 4096.
It was an exact multiple (14*4096 = 57344). I am not really sure I understood your suggestion of fixes. Are you saying that I have to update the kernel? Thanks
Ok, that's an interesting data point. Updating the kernel probably won't hurt, but I don't think we know enough to be able to state whether it'll help or not. Unless this happens again, we probably won't be able to much farther with it. The ideal thing would be a way to reliably reproduce this.
Thanks for your help, next time something like this happens I will contact again without fixing the issue first.
Sounds good. When/if it occurs you should probably collect: Copies of the "good" and "bad" files Metadata from the file on the client and server (run /bin/stat against it on both sides and save off the output) Also, consider turning up cifsFYI info on the client while accessing the file. See this page for info on doing that: http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
This happened again, doesn't look like its at a boundary of 4096. I have good and bad files. Running stat on the file that I am seeing from linux shows this: Size: 34676 Blocks: 72 IO Block: 16384 regular file Device: 17h/23d Inode: 6310308 Links: 1 Access: (2767/-rwxrwSrwx) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-05-04 22:14:08.078000400 -0500 Modify: 2010-05-04 22:14:16.062477600 -0500 Change: 2010-05-05 02:32:06.018902400 -0500 When I look at this file on windows server, I see the time as: 5/4/2010 10:14PM (this file might have got updated twice between 10:14:09 and 10:14:17) which caused this corruption. I haven't turned on that CIFS thing yet.
I changed to map the drive using "directio" in mount.cifs. Looks like I am not getting this problem right now, will keep an eye on it.
Possibly this is due to the semi-broken caching model in cifs.ko. Are you still using RHEL5 here, have you seen this issue since switching to "forcedirectio" ?
I believe I am seeing a re-emergence of this bug. I can recreate it at will and have used traceSMB and cifsFYI on systems both with and without the bug. As I have already opened a bug report on launchpad let me give you a link to it as opposed to reposting everything here: https://bugs.launchpad.net/linuxmint/+bug/1017660 Dave
Based upon reading the comments here I added the "directio" option to my fstab entries for the shares. I got mixed results with this. Two of the shares are with peers and the problem appears to be solved for them. One of the shares is with a Warp Server and behavior there is not a clear. Here is an example: dhdurgee@DG41TY ~/Downloads $ unzip -t /mnt/n_/monthly/BD120605.zip Archive: /mnt/n_/monthly/BD120605.zip error [/mnt/n_/monthly/BD120605.zip]: reported length of central directory is 0 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1 zipfile?). Compensating... file #1: bad zipfile offset (lseek): 0 file #2: bad zipfile offset (local header sig): 863 error: zipfile read error file #3: bad zipfile offset (EOF): 4106 error: zipfile read error file #4: bad zipfile offset (EOF): 7315 error: zipfile read error file #5: bad zipfile offset (EOF): 7702 error: zipfile read error file #6: bad zipfile offset (EOF): 8111 file #7: bad zipfile offset (lseek): 8192 file #8: bad zipfile offset (local header sig): 13164 At least one error was detected in /mnt/n_/monthly/BD120605.zip. dhdurgee@DG41TY ~/Downloads $ cp /mnt/n_/monthly/BD120605.zip ./BD120605.zip cp: reading `/mnt/n_/monthly/BD120605.zip': Invalid argument cp: failed to extend `./BD120605.zip': Invalid argument dhdurgee@DG41TY ~/Downloads $ ls -l BD120605.zip -rwxr-xr-x 1 dhdurgee dhdurgee 0 Jun 28 10:56 BD120605.zip {used FC/L to copy file from mount to directory here} dhdurgee@DG41TY ~/Downloads $ unzip -t ./BD120605.zip Archive: ./BD120605.zip testing: XH.TXT OK testing: XD.TXT OK testing: EQ2.CSV OK testing: EQ.CSV OK testing: LT.CSV OK testing: IU.CSV OK testing: BD2.CSV OK testing: BD1.CSV OK No errors detected in compressed data of ./BD120605.zip. dhdurgee@DG41TY ~/Downloads $ So as you can see above, using unzip directly against the file on the share has problems as does the cp command for some reason. Yet I can use FC/L to copy the file to a local directory and the copy is valid! Wierd.
David, the problem you're seeing is almost certainly different from the one originally reported. Could you open a new bug with this info and cc me on it? I'm going to go ahead and close this as INVALID since the original problem was never reproducible and he was able to work around it with "directio".