Bug 8498 - smbclient: cli_push returned NT_STATUS_IO_TIMEOUT
Summary: smbclient: cli_push returned NT_STATUS_IO_TIMEOUT
Status: NEW
Alias: None
Product: Samba 3.6
Classification: Unclassified
Component: Client Tools (show other bugs)
Version: 3.6.0
Hardware: All All
: P5 normal
Target Milestone: ---
Assignee: Volker Lendecke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-29 11:38 UTC by Bram
Modified: 2015-07-16 17:08 UTC (History)
1 user (show)

See Also:


Attachments
test script (298 bytes, text/plain)
2011-09-29 11:38 UTC, Bram
no flags Details
work around 1 (417 bytes, patch)
2011-09-29 11:41 UTC, Bram
no flags Details
work around 2 (964 bytes, patch)
2011-09-29 11:42 UTC, Bram
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bram 2011-09-29 11:38:54 UTC
Created attachment 6964 [details]
test script

When I pipe data to smbclient and use 'put - /some/file' then smbclient regurally returns 'cli_push returned NT_STATUS_IO_TIMEOUT'.

I initially reported this to the user mailinglist:
	https://lists.samba.org/archive/samba/2011-September/164165.html
	[Samba] cli_push returned NT_STATUS_IO_TIMEOUT


The use case in production:
	$ tar -cz /some/directory | smbclient '\\server_ip\share' -N -c 'put - "\directory\filename"'

I managed to reproduce this using a test case which sleeps 20 seconds when outputting data.

The test case (attached):
	$ perl test.pl 20 | smbclient '\\server_ip\share' -N -c 'put - "\directory\filename"'

I believe the problem is that 'push_source' in 'source3/client/client.c' performs a blocking read.

Looking at the code:

'cli_push' in 'source3/libsmb/clireadwrite.c' is called,
this function calls 'cli_push_send' in 'source3/libsmb/clireadwrite.c' which creates (in my case) 5 requests.

'cli_push_write_setup' is called for each request which calls 'push_source' which reads 'chunk_size' of bytes (in a blocking mode).
After the read it adds an event to write the data via the socket.
This event eventually ends up in 'cli_smb_req_create' in 'source3/libsmb/async_smb.c' which adds a write-request with a timeout

Next: the write request is complete and the callback 'cli_push_written' is called;
This calls 'cli_push_write_setup' again and performs the same tasks as above until all the data is written.


What I believe is happening:

A. cli_push_write_setup is called with idx = 0 -> performs a blocking read and adds a write event with a timeout
B. cli_push_write_setup is called with idx = 1 -> performs a blocking read and adds a write event with a timeout
C. cli_push_write_setup is called with idx = 2 -> performs a blocking read and adds a write event with a timeout
D. cli_push_write_setup is called with idx = 3 -> performs a blocking read and adds a write event with a timeout
E. cli_push_write_setup is called with idx = 4 -> performs a blocking read and adds a write event with a timeout
F. write event with idx = 0 is handled -> data is written -> callback cli_push_written called (somewhere/somehow) -> cli_push_write_setup is called ->  performs a blocking read and adds a write event with a timeout
G. write event with idx = 1 is handled -> data is written -> callback cli_push_written called (somewhere/somehow) -> cli_push_write_setup is called ->  performs a blocking read and adds a write event with a timeout
H. write event with idx = 2 is handled -> data is written -> callback cli_push_written called (somewhere/somehow) -> cli_push_write_setup is called ->  performs a blocking read and adds a write event with a timeout
I. write event with idx = 3 is handled -> data is written -> callback cli_push_written called (somewhere/somehow) -> cli_push_write_setup is called ->  performs a blocking read and adds a write event with a timeout
J. write event with idx = 4 is handled -> data is written -> callback cli_push_written called (somewhere/somehow) -> cli_push_write_setup is called ->  performs a blocking read and adds a write event with a timeout

=>

in 'F' a blocking read is performed. In the test case it takes 20 seconds for this to complete (because the test script waits 20 seconds with outputting data).
When the read completes the queue continues and it starts processing 'G'. BUT: it detects that it is already on the queue for more then 20 seconds
Result: the timer/timeout has passed so it stops with the status NT_STATUS_IO_TIMEOUT.
Comment 1 Bram 2011-09-29 11:41:24 UTC
Forgot to mention in the report: I first experienced this with samba-3.5.3, but samba-3.5.11 and samba-3.6.0 behave the same.


I can think of two possible work arounds for the problem.

The proper fix is probably to remove the blocking read with an event that waits until the fh is readable.



Work-around-1: set the number of requests created in 'cli_push_send' to 1.
This prevents multiple write events from appearing on the queue which avoid the problem.
See also work-around-1.patch (patch tested with samba-3.5.3 but should also work on git-master)

I've been running smbclient in production with this patch applied and since then the error has not occured.
I do assume that this work-around affects the performance of the transfer since it's not using the full window.


Work-around-2: set 'max_mux' to 1 when smbclient reads from stdin.
See also work-around-2.patch (patch tested with samba-3.5.3 but should also work on git-master)

I do not know what other side effects changing cli->max_mux may have..
For that reason I only tested this very briefly/shortly.
Comment 2 Bram 2011-09-29 11:41:56 UTC
Created attachment 6965 [details]
work around 1
Comment 3 Bram 2011-09-29 11:42:07 UTC
Created attachment 6966 [details]
work around 2
Comment 4 Jeremy Allison 2011-09-29 23:18:02 UTC
Very good analysis. It's certainly the blocking read from stdin that's causing the problem.

I'll examine this more closely. Volker and Metze may also have some ideas here.

Jeremy.
Comment 5 Glyn 2011-10-17 11:59:13 UTC
Hi,

I think I'm hitting this bug with Samba 3.5.6, the problem seems to arise when copying to a server which has quite a slow disk array.

All the clients are debian squeeze, is there any way to work around the bug without a custom compiled version?  Even a selotape & string soloution would suffice.

Thanks
Comment 6 Bram 2011-10-17 16:58:50 UTC
(In reply to comment #5)
>
> All the clients are debian squeeze, is there any way to work around the bug
> without a custom compiled version?  Even a selotape & string soloution would
> suffice.

A possible work around (untested!) is to include
[global]
  max mux = 1

in your samba config file but I do not know what side effect this may have...

(This is the reason why I chose to patch the code and not change the config variable since I know what happens there. I have no clue about what happens in other parts of the code with the max mux variable)
Comment 7 Andreas Richter 2012-06-15 14:24:13 UTC
Why the bug was not fixed?

I've installed the version 3.5.6 and even the version 3.6.5 but the problem with smbclient still exist. Why? I thought it was been fixed ...

I tried the solution with the max mux variable in smb.conf under the section [global] but it didn't work. And the other solutions I don't want to try. Because I don't know what happens if I change the source code.

In my case the bug comes with sending messages to Windows 7 clients with an installed messenger that's able to handle such messages (because in Windows 7 itself there isn't a net send such in Windows XP)

Please send me some informations about this.
I don't want to install third party versions of samba (it means from other mirrors as the default ...)

thanks and greetings

A. Richter
Comment 8 Mikhail T. 2013-05-07 12:46:05 UTC
"Me too" is disappointed, this remains an open problem. Like the original reporter, I set up a new backup job piping the output to smbclient to be uploaded to a remote computer.

I'm using dump rather than tar and the uploads writes a few megabytes before failing with NT_STATUS_IO_TIMEOUT. Using smbclient 3.6.13.

The remote server (a media-entertainment device with embedded Linux) is running Samba-3.0.30.

I will try the suggested workarounds.
Comment 9 Edmar Wiggers 2015-01-09 17:31:29 UTC
Same problem here. Like Andreas Richter, I've also tried smb.conf with [global] max mux = 1 , but it did not solve the problem. I have not tried patching smbclient myself, yet.
Comment 10 Jeremy Allison 2015-01-09 19:27:35 UTC
This isn't fixed yet as the fix is *really* *really* hard in the general case.

Inside cli_push() we expect to be able to call a blocking function 'source' supplied to it, which returns with the buffer to send.

As this is just a generic function pointer it doesn't expose a file descriptor to wait on.

Let me think about this some more.
Comment 11 Edmar Wiggers 2015-04-21 21:04:30 UTC
For the record: The "workaround 1" smbclient patch (state->num_reqs = 1) solved the problem here. I implemented it in 2015-01-09, and I've got no NT_STATUS_IO_TIMEOUT since.

// BTM6964
Comment 12 Armands Liepins 2015-07-16 08:34:21 UTC
Is this fixed in samba 4? I am getting this with 4.1.17 on Fedora 21. This happens on slow links, 512 kbit e.g.
Comment 13 Jeremy Allison 2015-07-16 17:08:40 UTC
(In reply to Armands Liepins from comment #12)

No, it isn't fixed. Note my comment:

> This isn't fixed yet as the fix is *really* *really* hard in the general case.
>
> Inside cli_push() we expect to be able to call a blocking function 'source' 
> supplied to it, which returns with the buffer to send.
>
> As this is just a generic function pointer it doesn't expose a file descriptor 
> to wait on.

This isn't going to be a quick or easy fix.