14988 – smbd pauses activity for 10s of seconds before resuming, silently

Bug 14988 - smbd pauses activity for 10s of seconds before resuming, silently

Summary: smbd pauses activity for 10s of seconds before resuming, silently

Status:	RESOLVED INVALID

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	File services (show other bugs)
Version:	4.13.17
Hardware:	Other Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Samba QA Contact
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-02-23 12:40 UTC by Nicolas Will
Modified:	2022-03-03 11:44 UTC (History)
CC List:	0 users

See Also:

Attachments
network diagram (90.10 KB, image/png) 2022-02-23 12:40 UTC, Nicolas Will	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nicolas Will 2022-02-23 12:40:14 UTC

Created attachment 17175 [details]
network diagram

Hello!

Setup is per attached picture.

Server: LibreELEC 11 nightly running smbd 4.13.17

Client (current version of DietPi, kept up-to-date) mounts CIFS using:

//10.25.25.2/content on /mnt/content type cifs (rw,relatime,vers=3.1.1,cache=strict,username=guest,uid=1000,forceuid,gid=1000,forcegid,addr=10.25.25.2,file_mode=0770,dir_mode=0770,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1,x-systemd.automount)

On a very irregular basis, can be in minutes or hours, the client reports:

Feb 14 21:46:10 grabber kernel: CIFS: Attempting to mount //10.25.25.2/content
Feb 14 21:55:20 grabber kernel: CIFS: VFS: \\10.25.25.2 sends on sock 00000000d18678b6 stuck for 15 seconds
Feb 14 21:55:20 grabber kernel: CIFS: VFS: \\10.25.25.2 Error -11 sending data on socket to server
Feb 14 21:58:32 grabber kernel: CIFS: VFS: \\10.25.25.2 has not responded in 180 seconds. Reconnecting...
Feb 14 21:58:58 grabber kernel: CIFS: VFS: \\10.25.25.2 sends on sock 000000009546efff stuck for 15 seconds
Feb 14 21:58:58 grabber kernel: CIFS: VFS: \\10.25.25.2 Error -11 sending data on socket to server
Feb 14 21:59:27 grabber kernel: CIFS: VFS: \\10.25.25.2 sends on sock 0000000022e8f049 stuck for 15 seconds
Feb 14 21:59:27 grabber kernel: CIFS: VFS: \\10.25.25.2 Error -11 sending data on socket to server

Needless to say, this breaks havoc with the software using the mounted network share!

I has setup client specific logging at level 10 on the server:

https://youplala.net/~nico/log.10.25.25.1

Both client and server are on NTP, so timestamps are valid. 

You will notice that smbd appears to just hang for a while, logging nothing, and then resume operations as if nothing has happened… Which probably will not help troubleshooting…

I have used the same client machine under the same workflow and load against another Samba server (old Lenovo/EMC NAS) with no issue, which leads me to believe that the server has a problem, not the CIFS client.

Thanks!

Comment 1 Volker Lendecke 2022-02-23 14:32:55 UTC

Can you try without sendfile and recvfile?

min receivefile size = 0
use sendfile = no

Comment 2 Nicolas Will 2022-02-23 22:17:17 UTC

With

min receivefile size = 0
use sendfile = no

Feb 23 21:57:51 grabber kernel: CIFS: VFS: \\10.25.25.2 sends on sock 0000000017c2b08d stuck for 15 seconds
Feb 23 21:57:51 grabber kernel: CIFS: VFS: \\10.25.25.2 Error -11 sending data on socket to server
Feb 23 21:57:51 grabber kernel: CIFS: VFS: Send error in read = -11
Feb 23 21:57:51 grabber kernel: CIFS: VFS: Send error in read = -11
Feb 23 21:57:51 grabber kernel: CIFS: VFS: Send error in read = -11
Feb 23 22:00:59 grabber kernel: CIFS: VFS: \\10.25.25.2 has not responded in 180 seconds. Reconnecting...
Feb 23 22:00:59 grabber kernel: CIFS: VFS: Send error in read = -11

Client specific level 10 log on server:

https://youplala.net/~nico/log.10.25.25.1%20-%202021022301

Comment 3 Volker Lendecke 2022-02-24 08:59:20 UTC

Next try: Also disable async I/O. It seems that your disk subsystem is very slow in writing. Please set

aio write size = 0

Comment 4 Nicolas Will 2022-02-24 10:49:38 UTC

Oooooh... Indeed, disk I/O perf *is* very bad, for some reason.

Investigating...

I've set aio write size = 0 in the meantime and will report too.

Comment 5 Volker Lendecke 2022-02-24 10:56:37 UTC

If the aio write size setting does not help, the next step will be "sync always = yes" and "strict sync = yes". We need to avoid that the kernel accumulates unwritten data in RAM, which can lead to these large timeouts that Samba has no control over. If you do the "sync always = yes", my hope is that the disk slowness is smoothed out over the time of all writes and is not batched when the kernel decides that it's enough to flush everything to the very slow disk.

Comment 6 Nicolas Will 2022-02-24 11:01:50 UTC

Understood, and logical. Thanks for the explanation.

My attention is now mainly taken away from Samba for the foreseeable future.

That disk subsystem is abnormally slow. I makes sense to understand why and fix the situation, instead of pointing fingers at smbd and trying to adapt its settings to conform to a broken underlying disk subsystem.

So... Bear with me! Thanks for the reactive and informative response so far.

I'll make sure to report back here and not disappear into the sunset.

Comment 7 Nicolas Will 2022-03-03 11:44:09 UTC

Coming back to this with more info after more testing, as promised. It just took some time to juggle a large amount of data in order to shift filesystems on large disks.

RESOLVED INVALID is indeed correct.

The cause was a very slow disk I/O.

The root cause, so far, is the use of the Paragon ntfs3 kernel driver.

This is further correlated with the date of the switch from the ntfs-3g FUSE driver to the ntfs3 driver.

Using the same hardware, an ext4 filesystem solved the slow disk I/O issue and Samba works as expected without any non-standard configuration option.

Thanks to everyone who shined in, you caught the issue!