Bug 3271 - Rsync instances stay in memory when using in daemon mode
Summary: Rsync instances stay in memory when using in daemon mode
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.8
Hardware: x86 Linux
: P3 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-18 06:14 UTC by Andrei Kalach
Modified: 2006-10-15 13:33 UTC (History)
2 users (show)

See Also:


Attachments
log file of strace, log file of rsync and log.c file (497.99 KB, application/x-bzip2)
2005-11-18 06:21 UTC, Andrei Kalach
no flags Details
patch to prevent lingering processes (1.55 KB, patch)
2005-11-26 10:08 UTC, James Ranson
no flags Details
different way of handling a broken socket (1.83 KB, patch)
2005-11-26 12:23 UTC, James Ranson
no flags Details
even better way of handling a dead socket (1.35 KB, patch)
2005-12-01 19:24 UTC, James Ranson
no flags Details
avoids select() call when possible (1.99 KB, patch)
2005-12-02 22:35 UTC, James Ranson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrei Kalach 2005-11-18 06:14:36 UTC
Hello, all

We are using rsync as file transfer tool to make run-time
synchronization of two remote servers. So rsync is started each time a
file is modified to make changes be propagated to the remote peer.

Receiving side.
   The rsync is working in daemon mode (rsync --daemon)

Sending site.
   There is a frequently modified file, which is sended by rsync to
receiving side after each modification.
   Before sending the file our system performs check for existing rsync
process, which are already sending the same file.
   If it is so, this rsync process is terminated by sending to it TERM
signal, and next starting new rsync process.

Problem.
   Sometimes on the receiving side appear child rsync processes which
permanently hang in memory and never finish. I observed that these processes write to log file string "rsync: read error: Connection reset by peer" and then hang.
   After I removed logging from rsync (see log.c file) this problem is
not appeared anymore.
   This problem is encountered only when rsync is in the daemon mode.
When we are using rsync in the standalone mode all works fine.

   I assume that the cause of the problem is the usage of string
formatting functions in the signal handlers in the rsync source code.
Many of such formattings functions are using static buffers/variables
and when invoked in signal handler may caused to inpredicable results.

All testing was done on the SuSe 9.0 Linux x86 with rsync 2.6.6 release
and with the latest CVS snapshot.


:/var/log/trace/rsync> ps ax|grep rsync
18998 pts/8    S      0:36 strace -f rsync --daemon --no-detach
18999 pts/8    S      0:00 rsync --daemon --no-detach
19450 pts/8    S      0:00 rsync --daemon --no-detach
19945 pts/8    S      0:00 rsync --daemon --no-detach
20049 pts/8    S      0:00 rsync --daemon --no-detach
20077 pts/8    S      0:00 rsync --daemon --no-detach
Comment 1 Andrei Kalach 2005-11-18 06:21:37 UTC
Created attachment 1580 [details]
log file of strace, log file of rsync and log.c file
Comment 2 James Ranson 2005-11-23 22:07:32 UTC
I have found that adding "#define SHUTDOWN_ALL_SOCKETS" to config.h (at least on the client side) solves the problem.  Without it, the connection is closed with an error (connection reset), rather than with an end-of-file.  The server's receiving code handles the error, and it goes into _exit_cleanup and gets all the way to the exit() call, but exit() hangs because it tries to flush a socket that is in an error state.  If the client properly closes its sockets, the problem does not occur.

Of course, this is only a workaround.  It does not handle the case where the connection is legitimately reset because of network errors.  It just keeps rsync from foiling itself.  A more complete solution in the receiving code is required.
Comment 3 James Ranson 2005-11-24 19:30:12 UTC
OK, further research has shown that I was not quite right about the cause of the problem.  In my tests, the network i/o process does die, and it's the "generator" process that hangs.  I traced it all the way into the exit_cleanup routine, and it hangs when it calls log_exit().
Comment 4 James Ranson 2005-11-26 10:05:28 UTC
(In reply to comment #3)
> In my tests, the network i/o process does die, and it's the
> "generator" process that hangs.  I traced it all the way into the exit_cleanup
> routine, and it hangs when it calls log_exit().

The generator process tries to write its error messages over the network socket, which has gone into an error state.  However, it does not detect that because select() keeps saying that the descriptor is not available for writing, so it enters an infinite loop.
Comment 5 James Ranson 2005-11-26 10:08:37 UTC
Created attachment 1594 [details]
patch to prevent lingering processes

This causes the writefd_unbuffered routine to check if the socket has died.  However, it seems to break syslog when chrooted, so let's be careful.
Comment 6 James Ranson 2005-11-26 12:23:08 UTC
Created attachment 1595 [details]
different way of handling a broken socket

Perhaps this is better.  The generator process doesn't use the timeout given in the config file anyway.  It is designed to block!  The other patch may give up on the socket prematurely.  This one tries to write, though it may block for a long time.
Comment 7 James Ranson 2005-12-01 19:24:38 UTC
Created attachment 1604 [details]
even better way of handling a dead socket

This patch doesn't break the contract of writefd_unbuffered because it risks a blocking write only when no timeout has been specified and there is no io_error_fd to read.  It avoids the messy attempts to detect a dead socket.
Comment 8 Wayne Davison 2005-12-02 18:05:50 UTC
Thanks for the suggested patches.  I agree that those continue statements after select() returns -1 look like they should be improved, but I'm not sure why you allowed the code in the read_timeout() function to fall through after an error: the bit-field values will be left undefined by the error, so I think we need more code to figure out what to do.  Perhaps we should change the code to only continue if count is 0 or if errno is EINTR, EWOULDBLOCK, or EAGAIN (and exit with an error on all other errno values).

In the writefd_unbuffered() function, what errno is being returned when your new code is being triggered?
Comment 9 James Ranson 2005-12-02 21:23:56 UTC
(In reply to comment #8)
> Thanks for the suggested patches.  I agree that those continue statements after
> select() returns -1 look like they should be improved, but I'm not sure why you
> allowed the code in the read_timeout() function to fall through after an error:


I didn't change read_timeout().


> In the writefd_unbuffered() function, what errno is being returned when your
> new code is being triggered?


Good point.  When select() returns an error, we ought not to write.  So, my code could be put right after check_timeout().  The bug is caused by the fact that select() doesn't return an error.  It keeps returning 0.  In a perfect world, select() would detect that the socket is in an error state, but it doesn't.  It sees a full transmit buffer and says that the socket is not ready.

How about this:

if (count == 0) {
  check_timeout();
  --- my patch ---
}

if (count <= 0) {
   --- leave this as is ---
}
Comment 10 James Ranson 2005-12-02 22:35:57 UTC
Created attachment 1605 [details]
avoids select() call when possible

Forget what I said in my last post.  It doesn't fix the bug!!

Perhaps this patch is clearer?  It avoids the select() call entirely when it is safe to block.
Comment 11 Wayne Davison 2006-02-01 12:53:30 UTC
Sorry for forgetting to get back to this bug for so long.  I've finally analyzed the strace output, and it shows that the generator received an error about the socket closing from the receiver (who was obviously trying to read it), but when the generator tries to select on that socket to write out the error message, the select never returns that the dead socket is writeable.

I have chosen to deal with this situation by adding a new sibling message:  when the receiver is reporting a socket-read error, the generator will note this fact and avoid writing an error down the socket, allowing it to close down gracefully when the pipe from the receiver closes.  I have checked this fix into CVS.

However, one can imagine a situation where the generator got a read-error on the socket and then died before being able to tell this to the generator.  To handle this case, I am looking into a modification of the idea that James was working on in his patches.
Comment 12 Wayne Davison 2006-10-15 13:33:07 UTC
The latest code now has the select() return exceptions, which should hopefully stop this from happening.