Bug 7860 - cifs hangs on Netapp DFS shares
cifs hangs on Netapp DFS shares
Status: RESOLVED FIXED
Product: CifsVFS
Classification: Unclassified
Component: user space tools
2.6
x86 Linux
: P3 major
: ---
Assigned To: Jeff Layton
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-10 08:52 UTC by Marcus Schopen
Modified: 2012-04-29 12:16 UTC (History)
5 users (show)

See Also:


Attachments
session dump (18.71 KB, application/octet-stream)
2010-12-11 11:02 UTC, Marcus Schopen
no flags Details
patch -- ignore extra junk after the end of SMB (2.10 KB, patch)
2010-12-21 11:45 UTC, Jeff Layton
no flags Details
cifs debug info (10.28 KB, text/plain)
2010-12-22 07:22 UTC, Burkhard Obergoeker
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marcus Schopen 2010-12-10 08:52:26 UTC
Connecting a Netapp file sever with DFS activated on the user's home shares the shares can successfully mounted with

 mount -t cifs //sever/home/username /mnt/ -o user=username,domain=AD

but the connection hangs in the moment a directory listing or a directory change is started. The strange thing is that only shares with activated DFS show this problem. I'm not maintaining the Netapp file server therefore a can't post more information about that system. Newest Linux version I've tested with on client side is a fresh Ubuntu 10.04.01 LTS installation without any kernel modifications.

/var/log/kernel.log:
--------------------
Dec 10 14:41:12 kramer kernel: [11431.779224]  CIFS VFS: RFC1001 size 164 bigger than SMB for Mid=17
Dec 10 14:41:12 kramer kernel: [11431.779239] Bad SMB: : dump of 48 bytes of data at 0xe93b8e00
Dec 10 14:41:12 kramer kernel: [11431.779252]  000000a4 424d53ff 00000032 80018800 € . . . ÿ S M B 2 . . . . . . .
Dec 10 14:41:12 kramer kernel: [11431.779265]  00000000 00000000 00000000 4dfe0040 . . . . . . . . . . . . @ . þ M
Dec 10 14:41:12 kramer kernel: [11431.779277]  00110800 7e00020a 02000000 00003800 . . . . . . . ~ . . . . . 8 . .
Dec 10 14:41:28 kramer kernel: [11447.928077]  CIFS VFS: server not responding
Dec 10 14:41:28 kramer kernel: [11447.928091]  CIFS VFS: No response for cmd 50 mid 17
Dec 10 14:41:33 kramer kernel: [11452.995093]  CIFS VFS: RFC1001 size 164 bigger than SMB for Mid=21
Dec 10 14:41:33 kramer kernel: [11452.995108] Bad SMB: : dump of 48 bytes of data at 0xe93b9c00
Dec 10 14:41:33 kramer kernel: [11452.995121]  000000a4 424d53ff 00000032 80018800 € . . . ÿ S M B 2 . . . . . . .
Dec 10 14:41:33 kramer kernel: [11452.995134]  00000000 00000000 00000000 4dfe0040 . . . . . . . . . . . . @ . þ M
Dec 10 14:41:33 kramer kernel: [11452.995147]  00150800 7e00020a 02000000 00003800 . . . . . . . ~ . . . . . 8 . .
Dec 10 14:41:58 kramer kernel: [11477.929102]  CIFS VFS: server not responding
Dec 10 14:41:58 kramer kernel: [11477.929116]  CIFS VFS: No response for cmd 50 mid 21
Dec 10 14:42:07 kramer kernel: [11486.772315]  CIFS VFS: RFC1001 size 164 bigger than SMB for Mid=25
--------------------

A dump of a session can be downloaded from: 

 http://wwwhomes.uni-bielefeld.de/schoppa/dump.pcap

The dump shows the following: 
-----------------
mount -t cifs //fs-home.xxx/home/jurtestmount /mnt/ -o user=xxx,domain=AD
Password: 
# ls -al
insgesamt 17
drwxrwxrwx  4 root root    0 2010-12-10 14:37 .
drwxr-xr-x 21 root root 4096 2010-11-18 17:57 ..
drwxrwxrwx  1 root root 4096 2010-12-10 14:38 testfolder001
drwxrwxrwx  1 root root 4096 2010-12-10 14:37 testfolder002
drwxrwxrwx  1 root root   29 2010-12-10 14:33 WWW
# cd testfolder001/
-----------------

In the moment I try to change the directory to testfolder001 the connection hangs.

Ciao!
Comment 1 Marcus Schopen 2010-12-11 11:02:39 UTC
Created attachment 6129 [details]
session dump
Comment 2 Burkhard Obergoeker 2010-12-15 06:45:00 UTC
We already discussed this problem with our support partners at Novell and NetApp. During this analysis we encountered that this problem does not depend on DFS, for we reproduced it in a Netapp share without having DFS enabled. Nevertheless, NetApp generates "bad" SMB headers using wrong packet length while sending a QUERY_PATH_INFO.
This occurs when acessing file names of a paticular length.

Example:

I mounted a share on /mnt 
mount -tcifs //fs-home/home/<username>/ /mnt -o username=<username>,noserverino

and tried to access some file names

# ls -ald /mnt/1234567890/12345
-rwxr-xr-x 1 root root 803 2010-01-14 08:07 /mnt/1234567890/12345
# ls -ald /mnt/1234567890/1234
-rwxr-xr-x 1 root root 803 2010-01-14 08:07 /mnt/1234567890/1234
# ls -ald /mnt/1234567890/123
 <hang>

Shorter file names don't work, while a "ls -ald" on the directory itself works again.

If this happens, cifs.ko stops the smb connection and initilizes it again. When restarting, NetApp again sends the malformed SMB package, so this cycles over and over again. (see kernel source, file fs/cifs/misc.c line 731 ff.)

Our Novell support discovered that the NetApp server appears to be sending *shorter* than declared packets depending on the length of the file name+path. In which way this may happen was a question to NetApp but the answer is still pending.

Maybe there is a way for the cifs.ko to handle this maformed SMB headers without resetting the connection to prevent those "hangs"?
Comment 3 Jeff Layton 2010-12-15 06:58:01 UTC
cc'ing Suresh since you're working with Novell. Reassigning bug to myself for now, but I probably won't get to it for a little while...

The CIFS client is definitely overzealous about aborting the connection. I'm working on a patchset to fix up some other places where it does this, but they don't really deal with malformed packets.

If the server is sending extra junk at the end of the SMB, we probably ought to just ignore that part (IOW, read that part off the socket and discard it). We'll have to look at packet traces and see if there are length fields that we can trust in there. There probably are if this works with windows...



Comment 4 Suresh Jayaraman (mail address is dead) 2010-12-15 11:22:29 UTC
I think I'm aware of such issue but not really sure whether the it is identical to this one. In the one we encountered, the packet header looked like SMB2 header. I guess NetApp has perhaps made SMB2 as default since ontap version 7.3.3. One suggestion is to revert the default to CIFS and see whether the problem is still reproducible. I do not have more information about this but I'll keep looking and update.
Comment 5 Jeff Layton 2010-12-21 11:45:13 UTC
Created attachment 6156 [details]
patch -- ignore extra junk after the end of SMB

This patch may or may not help.

Currently, we allow the netbios length to be a little larger than expected by the SMB packet. If it's "too big" however we drop the reply on the floor. The definition of "too big" seems to have changed over time, but I don't see any good reason to reject a packet that's too large outright. If we have extra junk at the end of the SMB then we ought to just ignore it.

So, you may want to give this patch a go and let me know whether it helps. It may not though -- some of the responses coming from this server look quite malformed. For instance, frame 112 has a FileNameLen of 54, but the trailing FileName is 32 bytes long.

So there seem to be some frames in this capture that are shorter than expected and you may still run afoul of that.

Still Marcus, if you can test this, please do and let me know if it helps.
Comment 6 Burkhard Obergoeker 2010-12-22 04:44:03 UTC
It helps!

I just applied this patch against a SLES 11 SP1, Kernel 2.6.32.23-0.3 (x86_64) and tried the same commands I describes above.
Although the mismatch of the smb header length is still being displayed, it's possible to access even the files that causes cifs.ko to reset the line. This means I am not only able to get file informations but also the correct contents.

Nevertheless I still see that NetApp has to solve the core problem of the malformed smb packets, but this will work for me so far.

Thanks a lot for your help!

Best regards

Burkhard
Comment 7 Jeff Layton 2010-12-22 06:24:29 UTC
(In reply to comment #6)

> Although the mismatch of the smb header length is still being displayed, it's
> possible to access even the files that causes cifs.ko to reset the line.

Which messages are still being displayed? We probably ought to not spam the ring buffer with messages about this as well.
Comment 8 Burkhard Obergoeker 2010-12-22 07:22:39 UTC
Created attachment 6158 [details]
cifs debug info

I added the debug output I gathered while startung the commands: 
# ls -ld /mnt/1234567890/12
# cat /mnt/1234567890/12

To get this verbose form, I switched it on by using
echo 1 > /proc/fs/cifs/cifsFYI


Best regards

Burkhard
Comment 9 Jeff Layton 2010-12-22 07:37:46 UTC
Ahh ok, so you're just getting the extra printk's when turning up cifsFYI? If so, then that's fine, I just wanted to make sure we weren't spamming the ring buffer with printk's without debugging turned up.

I'll plan to send the patch to the list soon.

Comment 10 Jeff Layton 2010-12-22 07:39:58 UTC
Ok, patch sent to linux-cifs mailing list. Hopefully we'll get this into 2.6.38.
Comment 11 Jeff Layton 2011-01-24 13:27:38 UTC
Actually, this patch is bogus since it discards the check for an RFC1001 length that's *smaller* than the SMB length, and that seems to be the actual problem here according to the capture.

This means that we are likely not going to be able to fix this problem. The BCC value in the packet should basically be the "distance" in bytes to the end of the SMB, starting with the byte following the BCC. In frame 78, that length should be ~108 bytes or so, but the packet has it as 131 bytes -- 23 bytes too short. If we trust that value we could read off the end of the received frame which is rather dangerous.

Steve suggested that we could consider "fudging" the BCC and Trans2 data length after we check the signature (if any). That might work, so I'll leave this open for now until we can investigate it.
Comment 12 Suresh Jayaraman (mail address is dead) 2011-03-04 04:21:26 UTC
Jeff: was there a fix proposed for this issue and I missed it? or we are investigating this still?
Comment 13 Jeff Layton 2011-03-04 12:28:23 UTC
There's no fix for this problem, and I'm not really investigating it since we understand the problem. It's really a server-side issue. We have discussed workarounds for this problem, but no one has generated any patches for it. Feel free to take this bug over if you want to work on it.
Comment 14 Linda Walsh 2011-10-31 20:48:55 UTC
(In reply to comment #13)
> There's no fix for this problem, and I'm not really investigating it since we
> understand the problem. It's really a server-side issue. We have discussed
> workarounds for this problem, but no one has generated any patches for it. Feel
> free to take this bug over if you want to work on it.

---
Netapp is not a very responsible company.  They are a bit like the Symantec of the anti-vir & security world.  They get by on new customers and customers who dont' know any better.   I had to deal with one of their servers helping a non-profit.  It had slower performance and considerably more restricted options than my home PC. (a dual P-II at the time).  IT was pathetic....yet they charged huge amounts of money to sell and license these things that they got for free -- very profitable.   But technical knowledge -- didn't see it in the product design.  it was commodity PC's dressed in fancy packages with suboptimal performance.

They likely bought the code from someone else (who is long), or had it done
by a contractor, (who is long gone)...  So there's the distinct possibility that they may not have the intellectual capital, on hand, needed to fix this.
Comment 15 Jeff Layton 2012-04-06 11:13:22 UTC
This should now be "fixed" to some degree. If we get a frame that we end
up discarding but can identify the task that issued it, then we now wake
up the sleeping process.

That means that you'll still get printk spam in the logs from this
server but it shouldn't hang anymore with recent kernels. Please reopen
this if you're still seeing the problem with 3.3 kernels or later.
Comment 16 Michael Letzgus 2012-04-28 21:17:14 UTC
Still the same problem with kernel 3.3.2.
Lots of scrambled and/or missing file names an dircetory entries.
Comment 17 Jeff Layton 2012-04-29 11:25:46 UTC
(In reply to comment #16)
> Still the same problem with kernel 3.3.2.
> Lots of scrambled and/or missing file names an dircetory entries.

Is that the same problem? The problem originally reported concerned the client
hanging when talking to a Netapp server. Are you still seeing hangs too? If not,
then I'd suggest opening a new bug and we'll take a look at that when we're
able.
Comment 18 Michael Letzgus 2012-04-29 11:45:27 UTC
I think it's the same problem - only without the hangs.

I'll open a new bug for this.
Comment 19 Jeff Layton 2012-04-29 12:16:12 UTC
That is indeed a different problem then. When I said that this problem was fixed
to some degree, I meant that it should no longer hang when malformed responses
come in.

The responses will still get thrown out if they're malformed, it's just
that the processes waiting on them should be notified immediately when that
happens -- they'll generally get EIO back in those cases. If you open a
new bug then we can take a look, but it sounds likely that this is a
server bug.

You'll probably want to get some network captures that show the traffic that
causes those messages to pop in the ring buffer if you do so:

    http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Wire_Captures