Created attachment 10676 [details] Checking the connection alive status before reusing the cached connection handle Libsmbclient maintains the connection handle (cli) cache per SMBC context. This connection cached is looked up while conneting to the server and share combination. Found one issue while the SMB context specific cached server connection is re-used. The server connection is looked up and found in the cache but before using it doesn't check whether the connection is still alive. In SMB2 mode, windows file server (2008r2) has a timer called 'Session Expiration Timer', which is 45 secs. If we are holding the SMBC context for more than that much time (45 secs) and re-using it, libsmbclient fails for that operation with NT_STATUS_CONNECTION_RESET error. Proposing a patch to resolve this issue. In the patch attempting SMB echo (keepalive) request before re-using the connection. This is done in SMBC_check_server() which is called SMBC_find_server() which is called from SMBC_server_internal() to get the cached server connection. If the connection is closed from file server end, smb echo request detects that. Then code in SMBC_find_server() takes care of returning NULL to SMBC_server_internal() which will then re-attempt the connection afresh.
Created attachment 10690 [details] Checking the connection alive status before reusing the cached connection handle Updated patch file.
Can please capture showing the problem? Which server (exact os and version) are you connecting to?
(In reply to Stefan (metze) Metzmacher from comment #2) We are finding it hard to trace it in the packet traces are they are getting too huge when the problem starts happening. So the packet traces goes huge. I enabled smbclient logging and got these kind of errors : ************************************ [2015/01/17 16:04:30.092963, 2, pid=1085, effective(0, 0), real(0, 0)] ../lib/util/tevent_debug.c:63(samba_tevent_debug) samba_tevent: EPOLL_CTL_DEL EBADF for fde[0x7a2a30] mpx_fde[(nil)] fd[1194] - disabling [2015/01/17 16:04:30.654751, 5, pid=1085, effective(0, 0), real(0, 0)] ../source3/libsmb/libsmb_xattr.c:910(cacl_get) cacl_get failed to open \share-name\somefile.doc: NT_STATUS_CONNECTION_RESET [2015/01/17 16:04:30.654833, 10, pid=1085, effective(0, 0), real(0, 0)] ../source3/lib/errmap_unix.c:265(map_errno_from_nt_status) map_errno_from_nt_status: 32 bit codes: code=c000020c ************************************ The packet captures are too huge to put them in the bugzilla ( > 10gb). I will reproduce the problem in a contained environment and get a smaller packet trace for attaching here. The Server we are using is "Windows 2008 R2 Standard".
*** Bug 10615 has been marked as a duplicate of this bug. ***
I'm wondering if we can reduce the impact of this patch (doing an echo call before each pathname activity) by adding a 'last checked' timestamp, and only calling the echo if that exceeds a threshold (30 seconds ?).
Created attachment 10884 [details] Proposed patch for master. Hi, this fixes the smbstat.c test for me. Can someone check if it fixes the issues with the gnome VFS ? This won't fix situations where the server goes away whilst valid resources (open files or directories) are held, but it should auto-reconnect in situations where the server drops the connection due to idle timeouts (no open resources). It only does the echo check every cli->timeout miliseconds, not on every use of the SMBCSRV *server struct. Thanks a *lot* to <shargagan@novell.com> who came up with the idea for this ! Jeremy.
(In reply to Stefan (metze) Metzmacher from comment #2) Metze, I can reproduce this problem using the smbstat.c code from bug 10615, with deadtime = 1 on smbd. Let me know if you want traces of the failure case + the success case with this patch applied. Jeremy.
Created attachment 10885 [details] Test case as reproducer.
Comment on attachment 10885 [details] Test case as reproducer. >#include <stdio.h> >#include <unistd.h> >#include <string.h> >#include <time.h> >#include <libsmbclient.h> > >static void >get_auth_data_fn(const char * pServer, > const char * pShare, > char * pWorkgroup, > int maxLenWorkgroup, > char * pUsername, > int maxLenUsername, > char * pPassword, > int maxLenPassword) >{ > strcpy(pUsername, "USER"); > strcpy(pPassword, "PASSWORD"); >} > >static void do_stat(void) >{ > char m_time[32]; > char c_time[32]; > char a_time[32]; > struct stat st; > > if (smbc_stat("smb://127.0.0.1/tmp/a", &st) < 0) > { > perror("smbc_stat"); > fprintf(stderr, "SMBC_STAT failed\n"); > exit(1); > } else { > printf("\nSAMBA\n mtime:%lu/%s ctime:%lu/%s atime:%lu/%s\n", > st.st_mtime, ctime_r(&st.st_mtime, m_time), > st.st_ctime, ctime_r(&st.st_ctime, c_time), > st.st_atime, ctime_r(&st.st_atime, a_time)); > } >} > >int main(int argc, char * argv[]) >{ > smbc_init(get_auth_data_fn, 10); > > do_stat(); > sleep(130); > do_stat(); > do_stat(); > do_stat(); > > return 0; >}
Created attachment 10886 [details] Test case as reproducer.
The content of attachment 10885 [details] has been deleted
Jeremy, I've given it a test and it works fine against Samba servers. However, against a Windows 7 server, it causes a 20 second hang. See how after it gets ECONNRESET, an epoll_wait is performed which times out after 20 seconds. [pid 4983] 1426719864.344717 epoll_create(64) = 7 [pid 4983] 1426719864.344730 fcntl(7, F_GETFD) = 0 [pid 4983] 1426719864.344742 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 4983] 1426719864.344768 epoll_ctl(7, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=403003680, u64=140192430512416}}) = 0 [pid 4983] 1426719864.344784 epoll_wait(7, {{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=403003680, u64=140192430512416}}}, 1, 20000) = 1 [pid 4983] 1426719864.344801 ioctl(10, FIONREAD, [0]) = 0 [pid 4983] 1426719864.344827 epoll_ctl(7, EPOLL_CTL_DEL, 10, 7f812711e7a0) = 0 [pid 4983] 1426719864.344839 writev(10, [{"\0\0\0005", 4}, {"\377SMB+\0\0\0\0\30C\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\f\0"..., 33}, {"\1\0", 2}, {"\20\0", 2}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16}], 5) = -1 ECONNRESET (Connection reset by peer) [pid 4983] 1426719864.344869 close(8) = 0 [pid 4983] 1426719864.344881 close(10) = 0 [pid 4983] 1426719864.344898 epoll_wait(7, {}, 1, 20000) = 0 [pid 4983] 1426719884.349196 close(7) = 0 [pid 4983] 1426719884.349320 epoll_create(64) = 7 [pid 4983] 1426719884.349438 fcntl(7, F_GETFD) = 0 [pid 4983] 1426719884.349511 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 4983] 1426719884.349588 close(7) = 0 [pid 4983] 1426719884.349615 write(1, "smbc_remove_usused_server: 0x7f8"..., 51) = 51 [pid 4983] 1426719884.349642 write(1, "SMBC_server: server_n=[192.168.1"..., 59) = 59 [pid 4983] 1426719884.349657 write(1, " -> server_n=[192.168.1.51] serv"..., 50) = 50 The difference is that with Samba servers, the remote closes the connection properly (FIN) and the tevent system handles that correctly. With Windows, it just issues an RST and tevent doesn't handle it properly. Log against Samba: [pid 5320] 1426720838.557344 epoll_create(64) = 7 [pid 5320] 1426720838.557358 fcntl(7, F_GETFD) = 0 [pid 5320] 1426720838.557371 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 5320] 1426720838.557398 epoll_ctl(7, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=1611002240, u64=140610250338688}}) = 0 [pid 5320] 1426720838.557416 epoll_wait(7, {{EPOLLIN|EPOLLOUT, {u32=1611002240, u64=140610250338688}}}, 1, 20000) = 1 [pid 5320] 1426720838.557434 ioctl(10, FIONREAD, [0]) = 0 [pid 5320] 1426720838.557460 epoll_ctl(7, EPOLL_CTL_MOD, 10, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=1611002240, u64=140610250338688}}) = 0 [pid 5320] 1426720838.557474 writev(10, [{"\0\0\0005", 4}, {"\377SMB+\0\0\0\0\30C\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0*\0"..., 33}, {"\1\0", 2}, {"\20\0", 2}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16}], 5) = 57 [pid 5320] 1426720838.557514 epoll_ctl(7, EPOLL_CTL_DEL, 10, 7fe26d29b5e0) = 0 [pid 5320] 1426720838.557531 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1611002064, u64=140610250338512}}) = 0 [pid 5320] 1426720838.557546 epoll_wait(7, {{EPOLLIN, {u32=1611002064, u64=140610250338512}}}, 1, 20000) = 1 [pid 5320] 1426720838.557561 recvfrom(8, "", 4, 0, NULL, NULL) = 0 [pid 5320] 1426720838.557580 epoll_ctl(7, EPOLL_CTL_DEL, 8, 7fe26d29b5d0) = 0 [pid 5320] 1426720838.557601 close(8) = 0 [pid 5320] 1426720838.557614 close(10) = 0 [pid 5320] 1426720838.557639 close(7) = 0 [pid 5320] 1426720838.557657 epoll_create(64) = 7 [pid 5320] 1426720838.557670 fcntl(7, F_GETFD) = 0 [pid 5320] 1426720838.557683 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 5320] 1426720838.557711 close(7) = 0 [pid 5320] 1426720838.557729 write(1, "smbc_remove_usused_server: 0x7fe"..., 51) = 51 [pid 5320] 1426720838.557750 write(1, "SMBC_server: server_n=[192.168.1"..., 59) = 59 [pid 5320] 1426720838.557765 write(1, " -> server_n=[192.168.1.48] serv"..., 50) = 50 Thanks for looking at this.
It helps to set the idle time out on Windows to 1 minute rather than 15 and then reboot: \HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\ Value type: Dword Value name: Autodisconnect Default: 15 (minutes)
(In reply to Ross Lagerwall from comment #12) Ah - ok. This is a tevent bug then. In the hope of shortening my testing time, do you know offhand what flag we need to pass to EPOLL_CTL_ADD in order to get an epoll_wait() to return on a RST on a TCP socket ? I'll try and work this out tomorrow... Thanks ! Jeremy.
Created attachment 10888 [details] Raw horrible hack patch for tevent :-). Ross, from reading the man page it seems we might need to add EPOLLRDHUP to the tevent epoll lists. Can you try the test again with this (hacky:-) raw patch to tevent. If it works, then I've got to write the test code that proves to Metze we need EPOLLRDHUP added here :-). I've added metze in as a reviewer (just to get him to look at this, not really asking for a +1 :-) to give him a heads-up. Thanks ! Jeremy.
(In reply to Jeremy Allison from comment #14) > Ah - ok. This is a tevent bug then. Or maybe not. Looking at the strace really carefully again I see in the Windows failure case we close the sockets 8 and 10 once we get the ECONNRESET after the write, and then the epoll_wait is being done with *just* a timeout - so no wonder it's not getting a return on RST. That means the "Raw horrible hack patch for tevent" attachment is almost certainly wrong :-). That might be a logic bug in the way libcli/smb/smbXcli_base.c drives tevent instead. I think I'll need to reproduce this against Windows and gdb the smbstat.c to find out *exactly* what is going on. Can you get me a debug level 10 log of the client reproducing this against Windows so I can make a start ?
Comment on attachment 10888 [details] Raw horrible hack patch for tevent :-). Withdrawing this patch - it's clearly wrong (and the strace shows EPOLLERR on RST so this isn't it).
Created attachment 10889 [details] raw patch for master. Ok, I'm pretty sure I've found it. In the SMB1 case, if a writev fails we notive inside libcli/smb/smbXcli_base.c:smb1cli_req_writev_done() when nwritten = writev_recv(subreq, &err); comes back as -1. We then call: if (nwritten == -1) { NTSTATUS status = map_nt_error_from_unix_common(err); smbXcli_conn_disconnect(state->conn, status); smbXcli_conn_disconnect() calls tevent_req_nterror(req, status) on all the reqs on the pending queue. However, in the SMB1 case - because we can have one way writes, we don't add the req onto the pending queue until *after* the writev_recv() completed *successfully*. We can see this in the original code for smb1cli_req_writev_done(), which looks like: -------------------------------------------------------------------------- static void smb1cli_req_writev_done(struct tevent_req *subreq) { struct tevent_req *req = tevent_req_callback_data(subreq, struct tevent_req); struct smbXcli_req_state *state = tevent_req_data(req, struct smbXcli_req_state); ssize_t nwritten; int err; nwritten = writev_recv(subreq, &err); TALLOC_FREE(subreq); if (nwritten == -1) { NTSTATUS status = map_nt_error_from_unix_common(err); smbXcli_conn_disconnect(state->conn, status); return; } if (state->one_way) { state->inbuf = NULL; tevent_req_done(req); return; } if (!smbXcli_req_set_pending(req)) { tevent_req_nterror(req, NT_STATUS_NO_MEMORY); return; } } -------------------------------------------------------------------------- Note how smbXcli_req_set_pending(req) isn't called until *after* we would have returned early if nwritten == -1. So the call to smbXcli_conn_disconnect() in this case doesn't cancel the current outstanding tevent req, and so it's left on the event queue until it times out with the default timeout of 20 seconds. The solution is the given patch, which adds a: tevent_req_nterror(req, status); just after the call to smbXcli_conn_disconnect(state->conn, status) and so correctly cancels the current request. Note this isn't needed in any other calls that use smbXcli_conn_disconnect() on error, as they are either SMB1 reads (which by definition already have the original sender on the pending queue) or are SMB2 requests - and in the SMB2 code path we add the tevent req's to the pending queue *before* we do the writev_send - there are no one-way sends in SMB2. I haven't tested this (I'll do that tomorrow), but I'm pretty sure this will fix the problem you discovered. Thanks for the help. That would have been a very difficult bug to find without that strace !
The additional patch works properly against Samba and Windows servers. Thanks
(In reply to Ross Lagerwall from comment #19) Thanks for the testing ! I'm going to split out this patch to a separate bug, as it's a generic issue in the client code, not reconnect specific. I'll make this bug dependent on that new bug and submit both fixes for master then once they're in, get back-ports done. Cheers, Jeremy.
Created attachment 10898 [details] git-am cherry-pick from master for 4.2.next, 4.1.next. Applies cleanly to 4.2.next, 4.1.next - cherry-picked from master.
Comment on attachment 10898 [details] git-am cherry-pick from master for 4.2.next, 4.1.next. LGTM.
@Karolin, please merge to the maintenance branches.
Re-assigning to Karolin for inclusion in 4.2.next, 4.1.next. Note this also depends on the patch in bug #11173.
(In reply to Jeremy Allison from comment #24) One question regarding the changes. Will this patch work fine for SMBv2 negotiated connections as well ? Or we would to make a call to smb2cli_echo() after checking the protocol level of connection. Thanks, Har Gagan Sahai
cli_echo takes care of sending the echo over both SMB1 and SMB2, depending on what we're connected using, so it covers both cases..
Pushed to autobuild-v4-[1|2]-test.
(In reply to Jeremy Allison from comment #26) Thanks for the details.
(In reply to Karolin Seeger from comment #27) Pushed to both branches. Closing out bug report. Thanks!