Bug 816 - client-side transport errors lead to invalid client state structure
Summary: client-side transport errors lead to invalid client state structure
Status: RESOLVED WONTFIX
Alias: None
Product: Samba 2.2
Classification: Unclassified
Component: libsmbclient (show other bugs)
Version: 2.2.8a
Hardware: All Solaris
: P3 major
Target Milestone: ---
Assignee: Gerald (Jerry) Carter (dead mail address)
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-11-25 15:36 UTC by Cameron Paine
Modified: 2005-11-14 09:31 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Cameron Paine 2003-11-25 15:36:32 UTC
This bug report is a little unusual. Working from the top down--I report the 
symptoms and someone else finds the bug--proved to be too difficult. The 
manifestations occurred too infrequently. Therefore I carefully analysed the 
library code and added additional trace messages to assist in pin-pointing it. 
Consequently I have a good understanding of the problem and can give you a 
detailed description of what is happening.

In the description that follows, I will use smbc_opendir() to exemplify the 
problem. The call chain that results from a call to this function is where I 
concentrated my diagnostic efforts. I suspect, but have only confirmed with a 
few runs of grep, that the problem is manifest in any library function that 
results in I/O at the transport layer (i.e. most of the library's interface 
layer).

To describe the problem, I will break it into two: the trigger and the 
consequence. The consequence will probably disappear once the trigger has been 
eliminated--but I'm not sufficiently familiar with the code to say that for 
certain.

Let's do a code walk-through of a trigger event. I'll use the sources 
distributed with 2.2.8a for line number references. I've had a peek at later 
versions in the CVS and they all seem to have the same problem. You won't be 
able to confirm the execution profile unless you use a source-level debugger. 
There are not enough trace messages in the distributed code. (You can trust me 
though. I've spent about a full-time week on this).

// Here's a library client code fragment that attempts to open
// a *directory* on a w2k server. Assume that smbc_init() has been
// called and passed a valid authenticator.
if ((dfd = smbc_opendir("smb://w2ksvr/directory")) < 0)
  printf("opendir failed - %s\n", strerror(errno));
else
{
  printf("opendir succeeded\n");
  smbc_closedir(dfd);
}

The above fragment results in the following call chain (line numbers are in 
parentheses):

libsmb/libsmbclient.c:smbc_opendir(1820)
  libsmb/clilist.c:cli_list(465)
    libsmb/clilist.c:cli_list_new(206)

Note that the above line is inside a loop. This is important because it leads 
to another aspect of this bug that I haven't encountered but that is there 
nonetheless. This will ultimately effect the value of total_received. I'll 
discuss that in a moment.

      libsmb/clitrans.c:cli_receive_trans(155 or 269)
        libsmb/clientgen.c:cli_receive_smb(50)
          lib/util_sock.c:client_receive_smb(670)
            lib/util_sock.c:receive_smb(607)
              lib/util_sock.c:read_smb_length_return_keepalive(545)
                lib/util_sock.c:read_socket_with_timeout(285 or 288)
                  lib/system.c:sys_read(85)
                    libc:read()

If read(2) returns an error (let's say we get a connection reset) it returns -1 
and sets errno to ECONNRESET. If we unwind the stack, we'll get to the trigger 
event.

                  lib/system.c:sys_read(87)
                  return ret; // ret == -1

[2003/11/07 17:00:16, 0] lib/util_sock.c:read_socket_with_timeout(300)
  read_socket_with_timeout: timeout read. read error = Connection reset by peer.

                lib/util_sock.c:read_socket_with_timeout(301)
                smb_read_error = READ_ERROR;
                return -1;

              lib/util_sock.c:read_smb_length_return_keepalive(550)
              return(-1);

            lib/util_sock.c:receive_smb(619)
            return(False);

        libsmb/clientgen.c:cli_receive_smb(69)
        /* If the server is not responding, note that now */
        if (!ret) {
                cli->smb_rw_error = smb_read_error;
                close(cli->fd);
                cli->fd = -1;
        }

        return ret;

This is the trigger. Note that we invalidate the socket and we record the 
reason in the client state structure (smb_rw_error). Note too that we do not 
reset errno. Keep unwinding.

      libsmb/clitrans.c:cli_receive_trans(156)
      return False;

    libsmb/clilist.c:cli_list_new(220)
    if (cli_is_error(cli) || !rdata || !rparam)
      break;

We break out here because both rdata and rparam are zero, *not* because 
cli_is_error(cli) is TRUE as you might expect. (They're zero as a side-effect 
of the error though). I'll come back to cli_is_error() in a moment. Let's 
finish the unwind.

The above break takes us out of the while (ff_eos == 0) loop. There's two 
possible scenarios here. If this is the first pass through the loop, 
total_received == -1. If it's a subsequent pass--possible if the number of 
files in the directory requires more than one trans--total_received will hold 
the number we've fetched so far.

    libsmb/clilist.c:cli_list_new(297)

We're out of the loop and we process any file names we've received. If we baled 
out of the loop because of an error on the first pass, total_received will be -
1; otherwise it will hold the number of files we've read so far (AFAICS it 
won't be zero but I can't say for sure. I'm no expert in this SMB stuff).

    libsmb/clilist.c:cli_list_new(304)
    return(total_received);

  libsmb/clilist.c:cli_list(465)
  return cli_list_new(cli, Mask, attribute, fn, state);

libsmb/libsmbclient.c:smbc_opendir(1820)

We have our two scenarios to contend with here too. Since I've only seen 
cli_list_new() fail on the first pass, I can only speculate about what might 
happen if it partially read the directory. Let's follow the concrete case and 
return to the speculation in a moment.

libsmb/libsmbclient.c:smbc_opendir(1823)
if (smbc_file_table[slot]) {
  SAFE_FREE(smbc_file_table[slot]->fname);
  SAFE_FREE(smbc_file_table[slot]);
}
errno = smbc_errno(&srv->cli);
return -1;

We tidy up the file table entry and (attempt to) set errno. Unfortunately, 
smbc_errno() is broken so errno == 0. Here we obliterate the real errno, which 
up to this point was holding the reason that the transport failed. We then pass 
an error indication back to the caller.

Note that we don't do anything with the server structure (*srv) which contains 
the invalidated client state (srv->cli). This is the real effect of the 
trigger. The library interface only allows the caller to expunge the server 
structure from the cache indirectly--through a call to smbc_closedir(). They 
won't be able to call closedir because we're giving them back -1.

Let me summarise the trigger:

Transport error occurs.
The client socket is invalidated in cli_receive_smb().
The error is caught by a side-effect in cli_list_new().
The error cleanup in smbc_opendir() overlooks the (now invalid) server 
structure.
The library client gets no indication of the nature of the error (errno == 0).

Let's look at the consequential failure. It becomes evident on subsequent calls 
to smbc_opendir for the *same* share.

libsmb/libsmbclient.c:smbc_opendir(1802)
srv = smbc_server(server, share, workgroup, user, password);

This retrieves our server structure (with the invalid client state). Now we 
continue down the same call chain as before:

libsmb/libsmbclient.c:smbc_opendir(1820)
  libsmb/clilist.c:cli_list(465)
    libsmb/clilist.c:cli_list_new(206)
      libsmb/clitrans.c:cli_receive_trans(155)
        libsmb/clientgen.c:cli_receive_smb(46)
        if (cli->fd == -1)
                return False;

The invalid client state is trapped and we unwind the stack much like before. 
At this point our caller is deadlocked. The invalid server structure will 
remain in cache forever. There's nothing the caller can do, other than 
terminate the process, that will expunge it.

Let me return to the speculation about what will happen if cli_list_new() is 
partially successful (i.e. the transport failure occurs after some filenames 
have been returned).

smbc_opendir() will return a valid directory descriptor to its caller. The 
functions smbc_readdir() and smbc_getdents(), which don't (appear to) perform 
any actual I/O, will operate on the incomplete file list. The caller has no way 
of knowing that the list is incomplete. Eventually the caller will call 
smbc_closedir() which will trigger the expulsion of the invalid server 
structure from the cache.

If the caller performs other operations on that share *before* they call 
smbc_closedir(), the invalid server structure will be retrieved from the cache 
and those I/O attempts will fail (because the socket fd is invalid). Again 
they'll have no indication as to what caused the failure.

As I said, it's speculation not backed up by observation. But it might explain 
a few other problems that library users have reported (I'm not thinking of 
anything specific).

There's some unfinished business:

smbc_errno() and cli_is_error() both fail for the same ultimate reason. They 
only consider server-side errors. Client-side errors, presumably as indicated 
by a non-zero cli->smb_rw_error are ignored. This is probably true of all the 
cli*error functions. I've only looked at the above two because they're 
implicated in the failure I'm describing here.

NOTE: even if they did grok client-side errors, the bug(s) I've described in 
this report would remain. This is because there are many break-out points from 
the while (ff_eos == 0) loop in cli_list_new() but the subsequent code doesn't 
handle all the conditions that lead to the break.

One final, possibly useful, observation. Aside from the places where it is set 
in clientgen.c, cli->smb_rw_error is *only* ever accessed in 
libsmb/clierror.c:cli_errstr() and only then for the purposes of creating an 
error string.

This bug seems to be more a design problem than a simple one-line coding error. 
When I started researching it, I was flirting with the idea of fixing it myself 
and offering a patch. By the time I had finished I had come to the conclusion 
that any fix I could offer would be naive. The author(s) of the library have 
left us no insights into their error-handling intentions. And since transport 
errors can impinge on almost every library interface function, the fix does not 
appear to be simple.

If I can assist any further in the elimination of this bug, drop me a line.

Cameron Paine
cbp@null.net
Comment 1 Gerald (Jerry) Carter (dead mail address) 2003-11-25 15:59:58 UTC
Does this same problem exist in 3.0 ?  If so, I would like 
to forward it there so we can fix it.
Comment 2 Cameron Paine 2003-11-25 19:22:09 UTC
Jerry, based on the version 3 modules I downloaded (for inspiration) I'd say 
that this exists in 3.0. I'm not running 3 on any of the sites I support. If it 
will help to get it fixed, I'm happy to bench test the code from 3.0. I know 
the problem profile quite well so I reckon I could give you an authoratative 
answer. Is that of any use to you?

Cameron

PS: I have four or five other aberrant behaviours in 2.2.8a to report. Should I 
check them against 3 before I submit them?
Comment 3 Gerald (Jerry) Carter (dead mail address) 2004-02-17 08:45:53 UTC
Sorry, but the 2.2 is not under development any longer.
If you can reproduce this bug against the latest 3.0 release, 
please reopen this bug and change the version in the report.
Thanks.
Comment 4 Gerald (Jerry) Carter (dead mail address) 2005-11-14 09:31:00 UTC
database cleanup