Bug 4386 - timed out descriptors become bad ones
Summary: timed out descriptors become bad ones
Status: CLOSED FIXED
Alias: None
Product: CifsVFS
Classification: Unclassified
Component: kernel fs (show other bugs)
Version: 2.6
Hardware: x86 Linux
: P3 normal
Target Milestone: ---
Assignee: Steve French
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-09 10:54 UTC by Dworkin Muller
Modified: 2009-03-07 11:03 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dworkin Muller 2007-02-09 10:54:17 UTC
When reading the same file tree multiple times in parallel, I'm seeing
file i/o get timed out (EAGAIN), and soon after that the remote server
is returning ERRbadfid for the relevant descriptors.  This does not happen
if I only have one copying of the tree going on, and rarely with two.
Five at once generates it very quickly, however.

% tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & 

and monitor /var/log/messages.  You'll see things like:

Feb  9 09:06:27 cs220a kernel:  CIFS VFS: No response to cmd 46 mid 25849
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: No response to cmd 46 mid 25846
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: No response to cmd 46 mid 25848
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -11
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -11
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -11
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -9
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -9
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in read = -9
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in Close = -9
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in Close = -9
Feb  9 09:06:27 cs220a kernel:  CIFS VFS: Send error in Close = -9

(-11 == EAGAIN, -9 == EBADF).  The EAGAINs always happen first.  The
EBADFs are always sent by the remote server.

Theory: it looks like what's happening is that a request times out
and marks the connection for reconnect.  While another request times
out, the reconnect is completed by a different thread.  However, since
there was a time out, the second thread marks the connection for rebuilding
again.  Any requests that get generated during the second connection get
orphaned by doing the second reconnect (third connection total).

My fumblings haven't been able to prove/fix things based on the above
theory, but it's the only one I've been able to come up with that explains
the behaviour I'm seeing.

Any help would be greatly appreciated....

Dworkin
Comment 1 shirishpargaonkar@gmail.com 2009-01-20 14:55:06 UTC
This is fixed in mainline kernel by changing sockets to blocking mode
instead of non-blocking.
Comment 2 Steve French 2009-01-20 16:29:15 UTC
fixed now - as Shirish notes