Hello, all We are using rsync as file transfer tool to make run-time synchronization of two remote servers. So rsync is started each time a file is modified to make changes be propagated to the remote peer. Receiving side. The rsync is working in daemon mode (rsync --daemon) Sending site. There is a frequently modified file, which is sended by rsync to receiving side after each modification. Before sending the file our system performs check for existing rsync process, which are already sending the same file. If it is so, this rsync process is terminated by sending to it TERM signal, and next starting new rsync process. Problem. Sometimes on the receiving side appear child rsync processes which permanently hang in memory and never finish. I observed that these processes write to log file string "rsync: read error: Connection reset by peer" and then hang. After I removed logging from rsync (see log.c file) this problem is not appeared anymore. This problem is encountered only when rsync is in the daemon mode. When we are using rsync in the standalone mode all works fine. I assume that the cause of the problem is the usage of string formatting functions in the signal handlers in the rsync source code. Many of such formattings functions are using static buffers/variables and when invoked in signal handler may caused to inpredicable results. All testing was done on the SuSe 9.0 Linux x86 with rsync 2.6.6 release and with the latest CVS snapshot. :/var/log/trace/rsync> ps ax|grep rsync 18998 pts/8 S 0:36 strace -f rsync --daemon --no-detach 18999 pts/8 S 0:00 rsync --daemon --no-detach 19450 pts/8 S 0:00 rsync --daemon --no-detach 19945 pts/8 S 0:00 rsync --daemon --no-detach 20049 pts/8 S 0:00 rsync --daemon --no-detach 20077 pts/8 S 0:00 rsync --daemon --no-detach
Created attachment 1580 [details] log file of strace, log file of rsync and log.c file
I have found that adding "#define SHUTDOWN_ALL_SOCKETS" to config.h (at least on the client side) solves the problem. Without it, the connection is closed with an error (connection reset), rather than with an end-of-file. The server's receiving code handles the error, and it goes into _exit_cleanup and gets all the way to the exit() call, but exit() hangs because it tries to flush a socket that is in an error state. If the client properly closes its sockets, the problem does not occur. Of course, this is only a workaround. It does not handle the case where the connection is legitimately reset because of network errors. It just keeps rsync from foiling itself. A more complete solution in the receiving code is required.
OK, further research has shown that I was not quite right about the cause of the problem. In my tests, the network i/o process does die, and it's the "generator" process that hangs. I traced it all the way into the exit_cleanup routine, and it hangs when it calls log_exit().
(In reply to comment #3) > In my tests, the network i/o process does die, and it's the > "generator" process that hangs. I traced it all the way into the exit_cleanup > routine, and it hangs when it calls log_exit(). The generator process tries to write its error messages over the network socket, which has gone into an error state. However, it does not detect that because select() keeps saying that the descriptor is not available for writing, so it enters an infinite loop.
Created attachment 1594 [details] patch to prevent lingering processes This causes the writefd_unbuffered routine to check if the socket has died. However, it seems to break syslog when chrooted, so let's be careful.
Created attachment 1595 [details] different way of handling a broken socket Perhaps this is better. The generator process doesn't use the timeout given in the config file anyway. It is designed to block! The other patch may give up on the socket prematurely. This one tries to write, though it may block for a long time.
Created attachment 1604 [details] even better way of handling a dead socket This patch doesn't break the contract of writefd_unbuffered because it risks a blocking write only when no timeout has been specified and there is no io_error_fd to read. It avoids the messy attempts to detect a dead socket.
Thanks for the suggested patches. I agree that those continue statements after select() returns -1 look like they should be improved, but I'm not sure why you allowed the code in the read_timeout() function to fall through after an error: the bit-field values will be left undefined by the error, so I think we need more code to figure out what to do. Perhaps we should change the code to only continue if count is 0 or if errno is EINTR, EWOULDBLOCK, or EAGAIN (and exit with an error on all other errno values). In the writefd_unbuffered() function, what errno is being returned when your new code is being triggered?
(In reply to comment #8) > Thanks for the suggested patches. I agree that those continue statements after > select() returns -1 look like they should be improved, but I'm not sure why you > allowed the code in the read_timeout() function to fall through after an error: I didn't change read_timeout(). > In the writefd_unbuffered() function, what errno is being returned when your > new code is being triggered? Good point. When select() returns an error, we ought not to write. So, my code could be put right after check_timeout(). The bug is caused by the fact that select() doesn't return an error. It keeps returning 0. In a perfect world, select() would detect that the socket is in an error state, but it doesn't. It sees a full transmit buffer and says that the socket is not ready. How about this: if (count == 0) { check_timeout(); --- my patch --- } if (count <= 0) { --- leave this as is --- }
Created attachment 1605 [details] avoids select() call when possible Forget what I said in my last post. It doesn't fix the bug!! Perhaps this patch is clearer? It avoids the select() call entirely when it is safe to block.
Sorry for forgetting to get back to this bug for so long. I've finally analyzed the strace output, and it shows that the generator received an error about the socket closing from the receiver (who was obviously trying to read it), but when the generator tries to select on that socket to write out the error message, the select never returns that the dead socket is writeable. I have chosen to deal with this situation by adding a new sibling message: when the receiver is reporting a socket-read error, the generator will note this fact and avoid writing an error down the socket, allowing it to close down gracefully when the pipe from the receiver closes. I have checked this fix into CVS. However, one can imagine a situation where the generator got a read-error on the socket and then died before being able to tell this to the generator. To handle this case, I am looking into a modification of the idea that James was working on in his patches.
The latest code now has the select() return exceptions, which should hopefully stop this from happening.