Bug 5911 - Data loss while writing with 00o over slow networks (SDSL 1MBit)
Data loss while writing with 00o over slow networks (SDSL 1MBit)
Status: RESOLVED WORKSFORME
Product: CifsVFS
Classification: Unclassified
Component: kernel fs
2.6
x86 Linux
: P3 major
: ---
Assigned To: Steve French
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-20 08:22 UTC by Dominik Fischer
Modified: 2009-08-24 18:16 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Fischer 2008-11-20 08:22:46 UTC
Linux Client with SDSL 1MBit network connection (symmetric DSL with 1MBit up- and 1MBit downstream) mounts share from a Windows 2003 fileserver via cifs.
After opening and editing the write of a (medium sized) document (e.g. 10MB)
fails after aprox. 3min with an EA-error. After clicking "ok" on the dialog
it takes another 3min, then the same error is shown again. After clicking
on "ok" again a third ea-error dialog pops up immediatly.

At this point the file is unreadable. Every open on this file with OOo gives
an EA-Error, so the data is lost!

I've an tcpdump in which you can see, that after a while a second process
(pdflush) writes over the same fid as the OOo process. These write operation
were blocked from the server because the file is locked.

Can I somehow attach the tcpdump to this ticket?
Comment 1 Dominik Fischer 2008-11-21 03:07:38 UTC
TCPDUMP at http://www.digitalparanoid.de/cifs-ooo-dsl-write.pcap
Comment 2 Shirish S. Pargaonkar 2009-03-16 08:57:32 UTC
Couple of questions, what is an EA error and what is OOo?
What is the version of cifs that is being used?  Would you be able to try
to recreate the bug with mainline kernel i.e. something like 2.6.29?

You can clear syslog buffer (dmesg -c), turn on cifs debugging by doing this
 echo 7 > /proc/fs/cifs/cifsFYI
recreate the problem and send the output of dmesg command as soon as
error is noticed (i.e. very first EA error is encountered)!
Comment 3 Dominik Fischer 2009-03-16 12:44:25 UTC
EA-Error -> Input/Output-Error
OOo -> OpenOffice.org
CIFS kernel module has version number 1.50cRH.

I've only a Red Hat Enterprise Linux Version 5 System which does not include 
kernel version 2.6.29. I've testet with kernel version 2.6.18-78 from Red Hat.

I can try to reproduce this error with the latest RHEL Kernel. It includes
version 1.54RH of the CIFS module. Would this help?

Did you look at the tcpdump output? IMHO it seems to be a bug concerning caching and CIFS. Within the CIFS data structure the PID should not change if pdflush writes data to the server. The fileserver has locked the file for the first PID. Attempts to write data to the same file with a different PID results in an error "file is locked".
Comment 4 Shirish S. Pargaonkar 2009-03-16 14:30:51 UTC
(In reply to comment #3)

> I can try to reproduce this error with the latest RHEL Kernel. It includes
> version 1.54RH of the CIFS module. Would this help?

I think that would be great.  The socket behaviour has changed from 
non-blocking to blocking between then and there are couple other fixes
that are in the cifs module related to data integrity fixes.
So attempting to reproduce with 1.54RH of cifs module would be really helpful.

Going through the tcpdump output meanwhile...

Comment 5 Shirish S. Pargaonkar 2009-03-16 21:47:37 UTC
After going through the tcpdump, what I see is for some byte (4K) byte ranges,
write is successful but for some, write fails with STATUS_FILE_LOCK_CONFLICT.

I also see just before that file is opened (/temp/cifstest/test-large.ppt
and is being written and we are seeing some successful writes and some
failing writes (with error STATUS_FILE_LOCK_CONFLICT), the file was opened,
entire file was locked and unlocked, and closed.  So I do not know how the
locks on certain byte ranges were placed in between (between fid 0xc02e and 
fid 0x405f). 

You can also try using forcedirectio mount option during cifs mount and then
attempt what you are attempting.
Comment 6 Steve French 2009-08-24 18:16:56 UTC
No repsonse in a while. Please reopen if you can reproduce this and we can look at whether the tcp socket retry changes fixed this.