Bug 9356 - [regression] CIFS timeouts on Linux clients with Samba 4
[regression] CIFS timeouts on Linux clients with Samba 4
Status: RESOLVED DUPLICATE of bug 9422
Product: Samba 4.0
Classification: Unclassified
Component: File services
4.0.0rc5
x64 Linux
: P5 regression
: ---
Assigned To: Jeremy Allison
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-05 08:52 UTC by frederik.vogelsang
Modified: 2012-12-03 19:56 UTC (History)
4 users (show)

See Also:


Attachments
smbd.log (201.59 KB, text/x-log)
2012-11-05 08:52 UTC, frederik.vogelsang
no flags Details
Log/config from samba4 with s3fs (277.69 KB, application/octet-stream)
2012-12-02 20:50 UTC, frederik.vogelsang
no flags Details
Log/config from samba4 with ntvfs (135.60 KB, application/octet-stream)
2012-12-02 20:50 UTC, frederik.vogelsang
no flags Details
Log/config from samba3 (298.24 KB, application/octet-stream)
2012-12-02 20:51 UTC, frederik.vogelsang
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description frederik.vogelsang 2012-11-05 08:52:19 UTC
Created attachment 8150 [details]
smbd.log

I've been running Samba 4 since rc2 and I've been encountering lockups and CIFS timeouts on Linux clients ever since. These timeouts seem to happen only when accessing multiple files at once. Windows clients are fine, btw. Downgrading to Samba 3.6.8 on the server-side solves this issue, so this must be a regression.

Steps to reproduce:
1. Set up test share on Samba server
2. Put some FLAC files in test share (this is the easiest way to reproduce)
3. Mount test share on Linux client
4. On client run `metaflac --add-replay-gain *.flac`

Result:
All mounted CIFS shares from this server time out somehow. The metaflac process freezes. This is what dmesg shows on the client after a while:
> CIFS VFS: Server horst has not responded in 120 seconds. Reconnecting...
> CIFS VFS: Server horst has not responded in 120 seconds. Reconnecting...
> CIFS VFS: Error -32 sending data on socket to server


I have attached the smbd.log file. When the timeouts happen on the client-side I always get this message on the server:
> ../source3/smbd/server.c:436(remove_child_pid)
> Could not find child 20541 -- ignoring

Setup:
- Samba 4.0-rc4 on Linux (tried both 3.6 and 3.7-master)
- Samba acts as a standalone AD DC
- Linux clients: tried kernel 3.6 and 3.7-master
- Linux clients: tried with SMB2 and without SMB2 kernel option
- Linux clients: cifs-utils 5.6
Comment 1 frederik.vogelsang 2012-11-19 13:52:57 UTC
Also applies to rc5
Comment 2 Stefan Metzmacher 2012-11-20 07:47:10 UTC
We need logs with "log level = 10" "debug hires timestamps = yes" "debug pid = yes" and network captures.

Both for 3.6 and 4.0.

Please also upload your smb.conf

See also
https://wiki.samba.org/index.php/Client_specific_Log
https://wiki.samba.org/index.php/Capture_Packets
Comment 3 frederik.vogelsang 2012-12-02 09:45:49 UTC
After debugging this a bit more I found out a few things:

* This is definitely NOT related to network/NIC issues, as the problem even occurs when mounting a share directly on the server (through //127.0.0.1/TestShare).
* After provisioning Samba4 with --use-ntvfs the issue disappears.

Since I was able to solve the issue with "--use-ntvfs" I can now no longer provide any debugging information. If none of the Samba devs have time to look into the remove_child_pid-error I was getting you may close this bug. As far as I understand the NTVFS mode is being used in future versions of Samba4 anyway, so I guess I'll be fine then.
Comment 4 Stefan Metzmacher 2012-12-02 20:49:49 UTC
(In reply to comment #3)
> After debugging this a bit more I found out a few things:
> 
> * This is definitely NOT related to network/NIC issues, as the problem even
> occurs when mounting a share directly on the server (through
> //127.0.0.1/TestShare).
> * After provisioning Samba4 with --use-ntvfs the issue disappears.
> 
> Since I was able to solve the issue with "--use-ntvfs" I can now no longer
> provide any debugging information. If none of the Samba devs have time to look
> into the remove_child_pid-error I was getting you may close this bug. As far as
> I understand the NTVFS mode is being used in future versions of Samba4 anyway,
> so I guess I'll be fine then.

No it won't, we'll keep the current default and use the 'smbd' file server.
Comment 5 frederik.vogelsang 2012-12-02 20:50:26 UTC
Created attachment 8245 [details]
Log/config from samba4 with s3fs

Sorry, my last comment was based on wrong information. While ntvfs works for me, I figured that s3fs really is the way to go. Since I am also interested in getting this thing sorted out for s3fs I added the debug parameters to the smb.conf file and performed the following steps directly on the Samba server for (Samba3|Samba4-ntvfs|Samba4-s3fs):

1. mount -t cifs //127.0.0.1/TestShare /mnt/smb -o username=Administrator
2. metaflac --add-replay-gain /mnt/smb/cdparanoia/*.flac

The 2nd step works great with Samba3 and Samba4-ntvfs, but causes timeouts on Samba4-s3fs. I hope you'll be able to fix the bug with this information.
Comment 6 frederik.vogelsang 2012-12-02 20:50:50 UTC
Created attachment 8246 [details]
Log/config from samba4 with ntvfs
Comment 7 frederik.vogelsang 2012-12-02 20:51:25 UTC
Created attachment 8247 [details]
Log/config from samba3
Comment 8 frederik.vogelsang 2012-12-02 22:30:43 UTC
Sorry for the noise, but I made an interesting discovery:

As soon as I disable UNIX extensions (either in smb.conf or as a mount option for mount.cifs) the timeouts are no longer an issue. BTW: Only using "nobrl,noposixpaths,noacl" as mount options is not sufficient to work around the issue - one definitely has to disable UNIX extensions completely.

Is there any further information I could provide? Do you still need network captures (maybe with and without UNIX extensions enabled)?
Comment 9 Stefan Metzmacher 2012-12-03 09:05:12 UTC
(In reply to comment #8)
> Sorry for the noise, but I made an interesting discovery:
> 
> As soon as I disable UNIX extensions (either in smb.conf or as a mount option
> for mount.cifs) the timeouts are no longer an issue. BTW: Only using
> "nobrl,noposixpaths,noacl" as mount options is not sufficient to work around
> the issue - one definitely has to disable UNIX extensions completely.
> 
> Is there any further information I could provide? Do you still need network
> captures (maybe with and without UNIX extensions enabled)?

Ok, thanks for debugging this!

I think network captures would be really good together with level 10 logs.

3.6 as server (with and without unix extension)
4.0 as server (with and without unix extension)
Comment 10 Jeff Layton 2012-12-03 10:57:57 UTC
When you mount with unix extensions, it's likely that the client is negotiating a larger rsize and wsize with the server. It's possible that this bug is a duplicate of this one:

    https://bugzilla.samba.org/show_bug.cgi?id=9422

You may want to make sure your samba binaries have the patch for that bug and retest before going to great lengths to debug this.
Comment 11 frederik.vogelsang 2012-12-03 19:52:51 UTC
(In reply to comment #10)
> You may want to make sure your samba binaries have the patch for that bug and
> retest before going to great lengths to debug this.
Thanks a lot for this hint. I've applied this patch to rc5:
> [PATCH] Fix Bug 9422 - large read requests cause server to issue malformed reply

The patch fixes the nasty timeouts and Samba4 now works as expected with UNIX extensions enabled.
Comment 12 Jeremy Allison 2012-12-03 19:56:43 UTC

*** This bug has been marked as a duplicate of bug 9422 ***