This bug report is a little unusual. Working from the top down--I report the symptoms and someone else finds the bug--proved to be too difficult. The manifestations occurred too infrequently. Therefore I carefully analysed the library code and added additional trace messages to assist in pin-pointing it. Consequently I have a good understanding of the problem and can give you a detailed description of what is happening. In the description that follows, I will use smbc_opendir() to exemplify the problem. The call chain that results from a call to this function is where I concentrated my diagnostic efforts. I suspect, but have only confirmed with a few runs of grep, that the problem is manifest in any library function that results in I/O at the transport layer (i.e. most of the library's interface layer). To describe the problem, I will break it into two: the trigger and the consequence. The consequence will probably disappear once the trigger has been eliminated--but I'm not sufficiently familiar with the code to say that for certain. Let's do a code walk-through of a trigger event. I'll use the sources distributed with 2.2.8a for line number references. I've had a peek at later versions in the CVS and they all seem to have the same problem. You won't be able to confirm the execution profile unless you use a source-level debugger. There are not enough trace messages in the distributed code. (You can trust me though. I've spent about a full-time week on this). // Here's a library client code fragment that attempts to open // a *directory* on a w2k server. Assume that smbc_init() has been // called and passed a valid authenticator. if ((dfd = smbc_opendir("smb://w2ksvr/directory")) < 0) printf("opendir failed - %s\n", strerror(errno)); else { printf("opendir succeeded\n"); smbc_closedir(dfd); } The above fragment results in the following call chain (line numbers are in parentheses): libsmb/libsmbclient.c:smbc_opendir(1820) libsmb/clilist.c:cli_list(465) libsmb/clilist.c:cli_list_new(206) Note that the above line is inside a loop. This is important because it leads to another aspect of this bug that I haven't encountered but that is there nonetheless. This will ultimately effect the value of total_received. I'll discuss that in a moment. libsmb/clitrans.c:cli_receive_trans(155 or 269) libsmb/clientgen.c:cli_receive_smb(50) lib/util_sock.c:client_receive_smb(670) lib/util_sock.c:receive_smb(607) lib/util_sock.c:read_smb_length_return_keepalive(545) lib/util_sock.c:read_socket_with_timeout(285 or 288) lib/system.c:sys_read(85) libc:read() If read(2) returns an error (let's say we get a connection reset) it returns -1 and sets errno to ECONNRESET. If we unwind the stack, we'll get to the trigger event. lib/system.c:sys_read(87) return ret; // ret == -1 [2003/11/07 17:00:16, 0] lib/util_sock.c:read_socket_with_timeout(300) read_socket_with_timeout: timeout read. read error = Connection reset by peer. lib/util_sock.c:read_socket_with_timeout(301) smb_read_error = READ_ERROR; return -1; lib/util_sock.c:read_smb_length_return_keepalive(550) return(-1); lib/util_sock.c:receive_smb(619) return(False); libsmb/clientgen.c:cli_receive_smb(69) /* If the server is not responding, note that now */ if (!ret) { cli->smb_rw_error = smb_read_error; close(cli->fd); cli->fd = -1; } return ret; This is the trigger. Note that we invalidate the socket and we record the reason in the client state structure (smb_rw_error). Note too that we do not reset errno. Keep unwinding. libsmb/clitrans.c:cli_receive_trans(156) return False; libsmb/clilist.c:cli_list_new(220) if (cli_is_error(cli) || !rdata || !rparam) break; We break out here because both rdata and rparam are zero, *not* because cli_is_error(cli) is TRUE as you might expect. (They're zero as a side-effect of the error though). I'll come back to cli_is_error() in a moment. Let's finish the unwind. The above break takes us out of the while (ff_eos == 0) loop. There's two possible scenarios here. If this is the first pass through the loop, total_received == -1. If it's a subsequent pass--possible if the number of files in the directory requires more than one trans--total_received will hold the number we've fetched so far. libsmb/clilist.c:cli_list_new(297) We're out of the loop and we process any file names we've received. If we baled out of the loop because of an error on the first pass, total_received will be - 1; otherwise it will hold the number of files we've read so far (AFAICS it won't be zero but I can't say for sure. I'm no expert in this SMB stuff). libsmb/clilist.c:cli_list_new(304) return(total_received); libsmb/clilist.c:cli_list(465) return cli_list_new(cli, Mask, attribute, fn, state); libsmb/libsmbclient.c:smbc_opendir(1820) We have our two scenarios to contend with here too. Since I've only seen cli_list_new() fail on the first pass, I can only speculate about what might happen if it partially read the directory. Let's follow the concrete case and return to the speculation in a moment. libsmb/libsmbclient.c:smbc_opendir(1823) if (smbc_file_table[slot]) { SAFE_FREE(smbc_file_table[slot]->fname); SAFE_FREE(smbc_file_table[slot]); } errno = smbc_errno(&srv->cli); return -1; We tidy up the file table entry and (attempt to) set errno. Unfortunately, smbc_errno() is broken so errno == 0. Here we obliterate the real errno, which up to this point was holding the reason that the transport failed. We then pass an error indication back to the caller. Note that we don't do anything with the server structure (*srv) which contains the invalidated client state (srv->cli). This is the real effect of the trigger. The library interface only allows the caller to expunge the server structure from the cache indirectly--through a call to smbc_closedir(). They won't be able to call closedir because we're giving them back -1. Let me summarise the trigger: Transport error occurs. The client socket is invalidated in cli_receive_smb(). The error is caught by a side-effect in cli_list_new(). The error cleanup in smbc_opendir() overlooks the (now invalid) server structure. The library client gets no indication of the nature of the error (errno == 0). Let's look at the consequential failure. It becomes evident on subsequent calls to smbc_opendir for the *same* share. libsmb/libsmbclient.c:smbc_opendir(1802) srv = smbc_server(server, share, workgroup, user, password); This retrieves our server structure (with the invalid client state). Now we continue down the same call chain as before: libsmb/libsmbclient.c:smbc_opendir(1820) libsmb/clilist.c:cli_list(465) libsmb/clilist.c:cli_list_new(206) libsmb/clitrans.c:cli_receive_trans(155) libsmb/clientgen.c:cli_receive_smb(46) if (cli->fd == -1) return False; The invalid client state is trapped and we unwind the stack much like before. At this point our caller is deadlocked. The invalid server structure will remain in cache forever. There's nothing the caller can do, other than terminate the process, that will expunge it. Let me return to the speculation about what will happen if cli_list_new() is partially successful (i.e. the transport failure occurs after some filenames have been returned). smbc_opendir() will return a valid directory descriptor to its caller. The functions smbc_readdir() and smbc_getdents(), which don't (appear to) perform any actual I/O, will operate on the incomplete file list. The caller has no way of knowing that the list is incomplete. Eventually the caller will call smbc_closedir() which will trigger the expulsion of the invalid server structure from the cache. If the caller performs other operations on that share *before* they call smbc_closedir(), the invalid server structure will be retrieved from the cache and those I/O attempts will fail (because the socket fd is invalid). Again they'll have no indication as to what caused the failure. As I said, it's speculation not backed up by observation. But it might explain a few other problems that library users have reported (I'm not thinking of anything specific). There's some unfinished business: smbc_errno() and cli_is_error() both fail for the same ultimate reason. They only consider server-side errors. Client-side errors, presumably as indicated by a non-zero cli->smb_rw_error are ignored. This is probably true of all the cli*error functions. I've only looked at the above two because they're implicated in the failure I'm describing here. NOTE: even if they did grok client-side errors, the bug(s) I've described in this report would remain. This is because there are many break-out points from the while (ff_eos == 0) loop in cli_list_new() but the subsequent code doesn't handle all the conditions that lead to the break. One final, possibly useful, observation. Aside from the places where it is set in clientgen.c, cli->smb_rw_error is *only* ever accessed in libsmb/clierror.c:cli_errstr() and only then for the purposes of creating an error string. This bug seems to be more a design problem than a simple one-line coding error. When I started researching it, I was flirting with the idea of fixing it myself and offering a patch. By the time I had finished I had come to the conclusion that any fix I could offer would be naive. The author(s) of the library have left us no insights into their error-handling intentions. And since transport errors can impinge on almost every library interface function, the fix does not appear to be simple. If I can assist any further in the elimination of this bug, drop me a line. Cameron Paine cbp@null.net
Does this same problem exist in 3.0 ? If so, I would like to forward it there so we can fix it.
Jerry, based on the version 3 modules I downloaded (for inspiration) I'd say that this exists in 3.0. I'm not running 3 on any of the sites I support. If it will help to get it fixed, I'm happy to bench test the code from 3.0. I know the problem profile quite well so I reckon I could give you an authoratative answer. Is that of any use to you? Cameron PS: I have four or five other aberrant behaviours in 2.2.8a to report. Should I check them against 3 before I submit them?
Sorry, but the 2.2 is not under development any longer. If you can reproduce this bug against the latest 3.0 release, please reopen this bug and change the version in the report. Thanks.
database cleanup