When reading the same file tree multiple times in parallel, I'm seeing file i/o get timed out (EAGAIN), and soon after that the remote server is returning ERRbadfid for the relevant descriptors. This does not happen if I only have one copying of the tree going on, and rarely with two. Five at once generates it very quickly, however. % tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & tar cf /tmp/t.tar /shares/diskc/many-files & and monitor /var/log/messages. You'll see things like: Feb 9 09:06:27 cs220a kernel: CIFS VFS: No response to cmd 46 mid 25849 Feb 9 09:06:27 cs220a kernel: CIFS VFS: No response to cmd 46 mid 25846 Feb 9 09:06:27 cs220a kernel: CIFS VFS: No response to cmd 46 mid 25848 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -11 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -11 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -11 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -9 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -9 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in read = -9 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in Close = -9 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in Close = -9 Feb 9 09:06:27 cs220a kernel: CIFS VFS: Send error in Close = -9 (-11 == EAGAIN, -9 == EBADF). The EAGAINs always happen first. The EBADFs are always sent by the remote server. Theory: it looks like what's happening is that a request times out and marks the connection for reconnect. While another request times out, the reconnect is completed by a different thread. However, since there was a time out, the second thread marks the connection for rebuilding again. Any requests that get generated during the second connection get orphaned by doing the second reconnect (third connection total). My fumblings haven't been able to prove/fix things based on the above theory, but it's the only one I've been able to come up with that explains the behaviour I'm seeing. Any help would be greatly appreciated.... Dworkin
This is fixed in mainline kernel by changing sockets to blocking mode instead of non-blocking.
fixed now - as Shirish notes