I Have updated my 2.6.14 kernel with the CIFS 1.39 patch taken out the GIT tree. I got no rejected or whatever. The good point is that apparently, it is faster but unfortunately it also prevent to shut down. When trying to unmount the CIFS mounted filesystem as the result of a umount -a done by init 6 scripts I get the following messages : CIFS VFS : Calculated size 0x126 vs actual lenght 0x27 CIFS VFS : Bad smb size detected for Mid = 204 cat /proc/fs/cifs/DebugData Display Internal CIFS Data Structures for Debugging --------------------------------------------------- CIFS Version 1.39 Active VFS Requests: 0 Servers: 1) Name: 10.193.118.75 Domain: FTRD Mounts: 1 OS: Windows 5.0 NOS: Windows 2000 LAN Manager Capability: 0xd3fd SMB session status: 1 TCP status: 1 Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0 MIDs: 2) Name: 10.193.21.130 Domain: FTRD Mounts: 1 OS: Windows 5.0 NOS: Windows 2000 LAN Manager Capability: 0xd3fd SMB session status: 1 TCP status: 1 Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0 MIDs: 3) Name: 10.194.51.11 Domain: FTRD Mounts: 1 OS: Windows 5.0 NOS: Windows 2000 LAN Manager Capability: 0xd3fd SMB session status: 1 TCP status: 1 Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0 MIDs: Shares: 1) \\p-europa\maps Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f PathComponentMax: 255 Status: 1 type: DISK 2) \\PINGOUIN\inter-drd Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f PathComponentMax: 255 Status: 1 type: DISK 3) \\r-canaris\ceva6380 Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f PathComponentMax: 255 Status: 1 type: DISK
Here is a trace when manually unmounting a CIFS share \\p-europa\maps umount /network/maps <=== stuc there dmesg shows : CIFS VFS: No response for cmd 114 mid 3112 CIFS VFS: No response for cmd 50 mid 219 CIFS VFS: No response for cmd 50 mid 12 CIFS VFS: Send error in Close = -9 CIFS VFS: Send error in Close = -9 CIFS VFS: Send error in Close = -9 CIFS VFS: Calculated size 0x126 vs actual length 0x27 CIFS VFS: bad smb size detected for Mid=30 Bad SMB: : dump of 48 bytes of data at 0xf73ae180 00000023 424d53ff 00000074 00018800 # . . . ÿ S M B t . . . . . . . 00000000 00000000 00000000 31100000 . . . . . . . . . . . . . . . 1 001e0000 0000ff00 00f00000 00000001 . . . . . ÿ . . . . ð . . . . .
Also happens with 2.6.15-rc2-git1
Created attachment 1585 [details] Umount trace Not easy to find the backtrace while the mount command is hung
Very odd - I see a malformed frame sent from your server - an SMB logoffx command with the minimum smb size (0x23 = 0x20 bytes of header + 1 byte "word count" + 2 bytes "byte count" (ie length of the data area) For an SMB of this size wct and bcc must be zero - but bcc has one corrupt byte set to 0xFF. I could relax the check by special casing minimum size SMB frames and zeroing their bcc area explicity if the server corrupted it, but I am also interested in where the umount or mount is hung - I don't see that in your dmesg so I assume that your kernel message buffer (which is a kernel config compile option in the debugging section of the Kernel configure menu) was too small to hold the stacktraces.
(In reply to comment #4) > Very odd - I see a malformed frame sent from your server - an SMB logoffx > command with the minimum smb size (0x23 = 0x20 bytes of header + 1 byte "word > count" + 2 bytes "byte count" (ie length of the data area) > > For an SMB of this size wct and bcc must be zero - but bcc has one corrupt byte > set to 0xFF. > > I could relax the check by special casing minimum size SMB frames and zeroing > their bcc area explicity if the server corrupted it, but I am also interested > in where the umount or mount is hung - I don't see that in your dmesg so I > assume that your kernel message buffer (which is a kernel config compile option > in the debugging section of the Kernel configure menu) was too small to hold > the stacktraces. > This makes me really wonder how you test your code!!! Pease use real life config that is Windows servers and not samba servers. If I had a Unix in front of me, I bet I would be using NFS.
Just to be sure, I checked the CIFS server I'm using for my various filesystems asking the admins : file servers are NetApp' servers with the folliwing OS versions installed : r-canaris 6.5.2R1 pingouin 6.5.2R1P12 p-europa 6.5.4
Clearly a server bug (illegal length), but the last change I made should help in two regards: 1) umount --force will now wake up all blocked commands on the responseq (it had only been waking up those waiting to be sent (on the requestq) which kicks the timeout logic in waitevent and should cause the logoff to complete. Later I am going to add periodic wake_up calls for that queue about every 5 or 10 seconds which will help prevent the need for this. 2) For the easier to recreate bug (WindowsXP sending too much data on FindFirst SMB response and causing illegal smb length message) - I have fixed - cifs now allows an illegal length (as long as the actual tcp data sent is as long or longer than the smb (within reason it has to be no more than 512 bytes too long) - the junk at the end is ignored. In your case I think you have a worse corruption - as if I trusted the server here I would go beyond the end of the frame which is a very dangerous thing to allow in the general case and could be a major security exposure (in this case it can be worked around but not in the general case of a frame too short). I don't have a NetApp server to try this to recreate the server bug - but really would like a short ethereal or tcpdump trace (network capture) so I could doublecheck the log data. In any case shutdown and umount --force will work so this will be better. I also presume that you could open a bug report against NetApp. I do not mind finding a workaround - but don't want to checkin something to mainline until someone can volunteer to test it.
Patch is now in the cifs development tree http://kernel.org/git/?p=linux/kernel/git/sfrench/cifs-2.6.git;a=commitdiff;h=3cc75176039474820e4cdf6a6938673628715974;hp=757521d3eef271a406ea115eeb70fb73d713b246
I'm affraid the fix is not quite right. I used 2.6.15-rc3 + the CIFS patch as mentionned in the comment #8. Bug is still there. I have taken a manual bt: kernel_sendmsg smb_send SendReceive small_smb_init autoremove_wake_function CIFSSMBLogoff cifs_umount destroy_inode cifs_put_super generic_shutdown_super deactivate_super sys_umount
The kernel trace buffer was too small in your trace attachment to catch the process of interest (umount), but the call stack you just appended has a few plausbile symbols - small_smb_init can not call SendReceive so the call stack is not exact - but it also would be very odd for this to block in kernel sendmsg (which is outside of cifs in tcp code) - if that were the case I wonder if the NetApp server had prematurely killed the session and whether I should be adding a check for dead tcp session before line 539 in CIFSSMBLogoff in fs/cifs/cifssmb.c similarly about line 3309 in fs/cifs/connect.c in cifs_umount we could add a check if SMB and TCP session are dead before trying to send calling CIFSSMBLogoff - but I think the check has to be in CIFSSMBLogoff after the usecount check.
Found a netapp server. Verified that the bug does exist in their server. bcc (length of the data area of the smb) really is larger than the total tcp length which is illegal. cifs code does handle this now - dnotify thread wakes up in about 15 seconds and the request times out. The message is logged to the kernel error log of course as it is a server bug and could be the kind of thing that a roque server might try to do to corrupt a client (although that is not the case here). Due to SMB signing issues, it is not really possible to change the frame on the fly and try to "fix" the smb by overwriting the bad byte. The fix to not hang umount is in cifs 1.39a or later and in cifs-2.6.git Let me know if you see any subsequent problems.
(In reply to comment #11) > Found a netapp server. Verified that the bug does exist in their server. bcc > (length of the data area of the smb) really is larger than the total tcp length > which is illegal. cifs code does handle this now - dnotify thread wakes up in > about 15 seconds and the request times out. The message is logged to the > kernel error log of course as it is a server bug and could be the kind of thing > that a roque server might try to do to corrupt a client (although that is not > the case here). Due to SMB signing issues, it is not really possible to change > the frame on the fly and try to "fix" the smb by overwriting the bad byte. The > fix to not hang umount is in cifs 1.39a or later and in cifs-2.6.git > > Let me know if you see any subsequent problems. > Not sure to understand your answer about the -force option : if I use fstab to mount the Network file system, I'm not sure to control the umount options. What can I do in that case?
I don't know if you can specify force on umounts done via fstab (although I suspect that root can still specify "umount /mnt --force" and it will work no matter how it will mount), but I thought shutdown did it that way if umount did not respond (not certain). In any case, it is not necessary to specify --force, despite the corrupt response from the server, umount will time out in between 15 seconds and 30 seconds. I did find two other likely NetApp server bugs, although I am not certain how recent the test server is (on the wire their server returns Windows 2000 as its OS and NOS version rather than their actual operating system version) - they return the wrong return code on NTRename (which is used to create Hard Links from Windows clients and of course the Linux cifs client) - instead of returning "error not supported" they return bad directory (apparently they do not support hardlinks via cifs perhaps only via nfs at the momemnt). In addition I see one additional server problem, which I believe is in the server failing to allow rename of open files (again there is at least one way in which this is allowed and can occur from Windows clients although not as common as it is from the Linux cifs client). But in general it did fine in the tests that I ran against it yesterday. There may be newer versions of their server which are even better.
Fix is also in 2.6.15-rc4 released yesterday evening.