Bug 3237 - CIFS 1.39 prevents to shutdown
Summary: CIFS 1.39 prevents to shutdown
Status: RESOLVED FIXED
Alias: None
Product: CifsVFS
Classification: Unclassified
Component: kernel fs (show other bugs)
Version: 2.6
Hardware: x86 Linux
: P3 critical
Target Milestone: ---
Assignee: Steve French
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-03 06:39 UTC by Eric Valette
Modified: 2005-12-01 09:34 UTC (History)
0 users

See Also:


Attachments
Umount trace (15.06 KB, text/plain)
2005-11-21 10:44 UTC, Eric Valette
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Valette 2005-11-03 06:39:23 UTC
I Have updated my 2.6.14 kernel with the CIFS 1.39 patch taken out the GIT tree.
I got no rejected or whatever. The good point is that apparently, it is faster
but unfortunately it also prevent to shut down.

When trying to unmount the CIFS mounted filesystem as the result of a umount -a
done by init 6 scripts I get the following messages :

CIFS VFS : Calculated size 0x126 vs actual lenght 0x27
CIFS VFS : Bad smb size detected for Mid = 204

cat /proc/fs/cifs/DebugData
Display Internal CIFS Data Structures for Debugging
---------------------------------------------------
CIFS Version 1.39
Active VFS Requests: 0
Servers:
1) Name: 10.193.118.75  Domain: FTRD Mounts: 1 OS: Windows 5.0
        NOS: Windows 2000 LAN Manager   Capability: 0xd3fd
        SMB session status: 1   TCP status: 1
        Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0
MIDs:

2) Name: 10.193.21.130  Domain: FTRD Mounts: 1 OS: Windows 5.0
        NOS: Windows 2000 LAN Manager   Capability: 0xd3fd
        SMB session status: 1   TCP status: 1
        Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0
MIDs:

3) Name: 10.194.51.11  Domain: FTRD Mounts: 1 OS: Windows 5.0
        NOS: Windows 2000 LAN Manager   Capability: 0xd3fd
        SMB session status: 1   TCP status: 1
        Local Users To Server: 1 SecMode: 0x3 Req On Wire: 0
MIDs:

Shares:
1) \\p-europa\maps Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f
PathComponentMax: 255 Status: 1 type: DISK
2) \\PINGOUIN\inter-drd Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f
PathComponentMax: 255 Status: 1 type: DISK
3) \\r-canaris\ceva6380 Uses: 1 Type: NTFS DevInfo: 0x20 Attributes: 0x4000f
PathComponentMax: 255 Status: 1 type: DISK
Comment 1 Eric Valette 2005-11-18 03:52:38 UTC
Here is a trace when manually unmounting a CIFS share
\\p-europa\maps

umount /network/maps  <=== stuc there

dmesg shows : 


 CIFS VFS: No response for cmd 114 mid 3112
 CIFS VFS: No response for cmd 50 mid 219
 CIFS VFS: No response for cmd 50 mid 12
 CIFS VFS: Send error in Close = -9
 CIFS VFS: Send error in Close = -9
 CIFS VFS: Send error in Close = -9
 CIFS VFS: Calculated size 0x126 vs actual length 0x27
 CIFS VFS: bad smb size detected for Mid=30
Bad SMB: : dump of 48 bytes of data at 0xf73ae180

 00000023 424d53ff 00000074 00018800 # . . . ÿ S M B t . . . . . . .
 00000000 00000000 00000000 31100000 . . . . . . . . . . . . . . . 1
 001e0000 0000ff00 00f00000 00000001 . . . . . ÿ . . . . ð . . . . .


Comment 2 Eric Valette 2005-11-21 05:21:44 UTC
Also happens with 2.6.15-rc2-git1
Comment 3 Eric Valette 2005-11-21 10:44:06 UTC
Created attachment 1585 [details]
Umount trace

Not easy to find the backtrace while the mount command is hung
Comment 4 Steve French 2005-11-21 14:12:32 UTC
Very odd - I see a malformed frame sent from your server - an SMB logoffx command with the minimum smb size (0x23 = 0x20 bytes of header + 1 byte "word count" + 2 bytes "byte count" (ie length of the data area)

For an SMB of this size wct and bcc must be zero - but bcc has one corrupt byte set to 0xFF.

I could relax the check by special casing minimum size SMB frames and zeroing their bcc area explicity if the server corrupted it, but I am also interested in where the umount or mount is hung - I don't see that in your dmesg so I assume that your kernel message buffer (which is a kernel config compile option in the debugging section of the Kernel configure menu) was too small to hold the stacktraces.
Comment 5 Eric Valette 2005-11-22 03:00:42 UTC
(In reply to comment #4)
> Very odd - I see a malformed frame sent from your server - an SMB logoffx
> command with the minimum smb size (0x23 = 0x20 bytes of header + 1 byte "word
> count" + 2 bytes "byte count" (ie length of the data area)
> 
> For an SMB of this size wct and bcc must be zero - but bcc has one corrupt byte
> set to 0xFF.
> 
> I could relax the check by special casing minimum size SMB frames and zeroing
> their bcc area explicity if the server corrupted it, but I am also interested
> in where the umount or mount is hung - I don't see that in your dmesg so I
> assume that your kernel message buffer (which is a kernel config compile option
> in the debugging section of the Kernel configure menu) was too small to hold
> the stacktraces.
> 


This makes me really wonder how you test your code!!! Pease use real life config that is Windows servers and not samba servers. If I had a Unix in front of me, I bet I would be using NFS.

 
Comment 6 Eric Valette 2005-11-25 06:58:09 UTC
Just to be sure, I checked the CIFS server I'm using for my various
filesystems asking the admins : file servers are NetApp' servers with the
folliwing OS versions installed :

r-canaris 6.5.2R1
pingouin  6.5.2R1P12
p-europa  6.5.4

Comment 7 Steve French 2005-11-28 23:42:00 UTC
Clearly a server bug (illegal length), but the last change I made should help in two regards:

1) umount --force will now wake up all blocked commands on the responseq (it had only been waking up those waiting to be sent (on the requestq) which kicks the timeout logic in waitevent and should cause the logoff to complete.  Later I am going to add periodic wake_up calls for that queue about every 5 or 10 seconds which will help prevent the need for this.

2) For the easier to recreate bug (WindowsXP sending too much data on FindFirst SMB response and causing illegal smb length message) - I have fixed - cifs now allows an illegal length (as long as the actual tcp data sent is as long or longer than the smb (within reason it has to be no more than 512 bytes too long) - the junk at the end is ignored.  In your case I think you have a worse corruption - as if I trusted the server here I would go beyond the end of the frame which is a very dangerous thing to allow in the general case and could be a major security exposure (in this case it can be worked around but not in the general case of a frame too short).

I don't have a NetApp server to try this to recreate the server bug - but really would like a short ethereal or tcpdump trace (network capture) so I could doublecheck the log data.

In any case shutdown and umount --force will work so this will be better.   I also presume that you could open a bug report against NetApp.

I do not mind finding a workaround - but don't want to checkin something to mainline until someone can volunteer to test it.
Comment 9 Eric Valette 2005-11-29 08:16:49 UTC
I'm affraid the fix is not quite right. I used 2.6.15-rc3 + the CIFS patch as mentionned in the comment #8. Bug is still there.

I have taken a manual bt:

kernel_sendmsg
smb_send
SendReceive
small_smb_init
autoremove_wake_function
CIFSSMBLogoff
cifs_umount
destroy_inode
cifs_put_super
generic_shutdown_super
deactivate_super
sys_umount

Comment 10 Steve French 2005-11-29 10:27:07 UTC
The kernel trace buffer was too small in your trace attachment to catch the process of interest (umount), but the call stack you just appended has a few plausbile symbols - small_smb_init can not call SendReceive so the call stack is not exact - but it also would be very odd for this to block in kernel sendmsg (which is outside of cifs in tcp code) - if that were the case I wonder if the NetApp server had prematurely killed the session and whether I should be adding a check for dead tcp session before line 539 in CIFSSMBLogoff in fs/cifs/cifssmb.c similarly about line 3309 in fs/cifs/connect.c in cifs_umount we could add a check if SMB and TCP session are dead before trying to send calling CIFSSMBLogoff - but I think the check has to be in CIFSSMBLogoff after the usecount check.
Comment 11 Steve French 2005-11-30 15:53:47 UTC
Found a netapp server.  Verified that the bug does exist in their server.   bcc (length of the data area of the smb) really is larger than the total tcp length which is illegal.   cifs code does handle this now - dnotify thread wakes up in about 15 seconds and the request times out.   The message is logged to the kernel error log of course as it is a server bug and could be the kind of thing that a roque server might try to do to corrupt a client (although that is not the case here).  Due to SMB signing issues, it is not really possible to change the frame on the fly and try to "fix" the smb by overwriting the bad byte.  The fix to not hang umount is in cifs 1.39a or later and in cifs-2.6.git

Let me know if you see any subsequent problems.   
Comment 12 Eric Valette 2005-12-01 03:10:33 UTC
(In reply to comment #11)
> Found a netapp server.  Verified that the bug does exist in their server.   bcc
> (length of the data area of the smb) really is larger than the total tcp length
> which is illegal.   cifs code does handle this now - dnotify thread wakes up in
> about 15 seconds and the request times out.   The message is logged to the
> kernel error log of course as it is a server bug and could be the kind of thing
> that a roque server might try to do to corrupt a client (although that is not
> the case here).  Due to SMB signing issues, it is not really possible to change
> the frame on the fly and try to "fix" the smb by overwriting the bad byte.  The
> fix to not hang umount is in cifs 1.39a or later and in cifs-2.6.git
> 
> Let me know if you see any subsequent problems.   
> 


Not sure to understand your answer about the -force option : if I use fstab to mount the Network file system, I'm not sure to control the umount options. What can I do in that case?
Comment 13 Steve French 2005-12-01 09:32:29 UTC
I don't know if you can specify force on umounts done via fstab (although I suspect that root can still specify "umount /mnt --force" and it will work no matter how it will mount), but I thought shutdown did it that way if umount did not respond (not certain).   In any case, it is not  necessary to specify --force, despite the corrupt response from the server, umount will time out in between 15 seconds and 30 seconds.

I did find two other likely NetApp server bugs, although I am not certain how recent the test server is (on the wire their server returns Windows 2000 as its OS and NOS version rather than their actual operating system version) - they return the wrong return code on NTRename (which is used to create Hard Links from Windows clients and of course the Linux cifs client) - instead of returning "error not supported" they return bad directory (apparently they do not support hardlinks via cifs perhaps only via nfs at the momemnt).   In addition I see one additional server problem, which I believe is in the server failing to allow rename of open files (again there is at least one way in which this is allowed and can occur from Windows clients although not as common as it is from the Linux cifs client).    But in general it did fine in the tests that I ran against it yesterday.   There may be newer versions of their server which are even better.
Comment 14 Steve French 2005-12-01 09:34:24 UTC
Fix is also in 2.6.15-rc4 released yesterday evening.