Bug 2766 - rsync endless loop
Summary: rsync endless loop
Status: CLOSED INVALID
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.5
Hardware: x86 Linux
: P3 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-06-02 08:27 UTC by Cord Seele
Modified: 2006-03-12 02:56 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Cord Seele 2005-06-02 08:27:58 UTC
Hello,

for me rsync seems to get stuck in an endless loop (burning cpu cycles at 100%)
when syncing between two local directories. This hang only occurs once in a
while (every other week) while the same command succeeds many times daily. 

Here are the details I gathered so far:
backup # ps axf
  PID TTY      STAT   TIME COMMAND
[...]
24691 ?        S      0:00  |   |   \_ /bin/bash /root/bin/backup.external
/backup/admin/log/backup_2005-06-02-external.log
 2226 ?        R    532:18  |   |       \_ /root/bin/rsync -v --archive --delete
/backup/zserv/ daily.0/zserv
 2227 ?        S      0:00  |   |           \_ /root/bin/rsync -v --archive
--delete /backup/zserv/ daily.0/zserv

The log-file (from the -v output) is stuck right after
 building file list ... 

backup # netstat -anp|grep rsync
unix  3      [ ]         STREAM     CONNECTED     5569523 2227/rsync          
unix  3      [ ]         STREAM     CONNECTED     5569522 2226/rsync          
unix  3      [ ]         STREAM     CONNECTED     5569521 2226/rsync          
unix  3      [ ]         STREAM     CONNECTED     5569520 2227/rsync          

backup # strace -p 2226
Process 2226 attached - interrupt to quit
Process 2226 detached

backup # strace -p 2227
Process 2227 attached - interrupt to quit
select(1, [0], [], NULL, {40, 118000})  = 0 (Timeout)
select(1, [0], [], NULL, {60, 0} <unfinished ...>
Process 2227 detached

backup # rsync --version
rsync  version 2.6.5cvs  protocol version 29
Copyright (C) 1996-2005 by Andrew Tridgell and others
<http://rsync.samba.org/>
Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, 
              inplace, IPv6, 64-bit system inums, 64-bit internal inums

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.
(pulled from CVS on 2005-5-9)

But I experienced this behaviour also on older rsync versions.

Any further information I should gather?

Thanks,

Cord
Comment 1 Paul Slootman 2005-06-02 08:47:13 UTC
It seems to be looping in userspace (no system calls, hence strace doesn't show
anything).  Could you try ltrace instead of strace, that should show what
library functions (if any) are being called which may help.
Comment 2 Cord Seele 2005-06-02 09:27:52 UTC
This is ltrace and strace again (until the segfault...?)
Hope this helps a bit further?

backup:~ # ltrace -p 2227 -S
--- SIGSTOP (Stopped (signal)) ---
--- SIGSTOP (Stopped (signal)) ---
SYS__newselect(1, 0xbfffc6d0, 0xbfffc650, 0, 0xbfffc648 <unfinished ...>
backup:~ # ltrace -S -p 22276
Cannot attach to pid 22276: No such process
backup:~ # ltrace -S -p 2226
--- SIGSTOP (Stopped (signal)) ---
--- SIGSTOP (Stopped (signal)) ---
--- SIGCHLD (Child exited) ---
waitpid(-1, 0xbfff763c, 1 <unfinished ...>
SYS_waitpid(-1, 0xbfff763c, 1, 1, 0x0809e728)                                  
                   = 0
<... waitpid resumed> )                                                        
                   = 0
SYS_sigreturn(0xb7fc8ff4, 0xbfff763c, 1, 1, 0x0809e728)                        
                   = 5
backup:~ # ltrace -S -p 2227
--- SIGSTOP (Stopped (signal)) ---
--- SIGSTOP (Stopped (signal)) ---
SYS__newselect(1, 0xbfffc6d0, 0xbfffc650, 0, 0xbfffc648 <unfinished ...>
backup:~ # strace -p 2227
Process 2227 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
select(1, [0], [], NULL, {21, 780000} <unfinished ...>
Process 2227 detached
backup:~ # strace -p 2226
Process 2226 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, 0xbfff763c, WNOHANG)        = 0
sigreturn()                             = ? (mask now [])
Process 2226 detached
backup:~ # strace -p 2226
Process 2226 attached - interrupt to quit
Process 2226 detached
backup:~ # strace -p 2226
Process 2226 attached - interrupt to quit
Process 2226 detached
backup:~ # strace -p 2227
Process 2227 attached - interrupt to quit
select(1, [0], [], NULL, {0, 188000})   = 0 (Timeout)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 2227 detached
backup:~ # strace -p 2227
attach: ptrace(PTRACE_ATTACH, ...): No such process
backup:~ # strace -p 2226
attach: ptrace(PTRACE_ATTACH, ...): No such process
Comment 3 Cord Seele 2005-06-08 06:23:03 UTC
Hello again,

here's another 'wild' rsync. Same symtoms as before. What's wondering me is,
that after ltrace, there's something to see with strace - until both processes
segfault. (The final strace commands were run in paralell)

backup:~ # ps axf
  PID TTY      STAT   TIME COMMAND
 7416 ?        S      0:00  |   |   \_ /bin/bash /root/bin/backup.external
/backup/admin/log/backup_2005-06-07-external.log
 9065 ?        R    581:58  |   |       \_ /root/bin/rsync -v --archive --delete
/backup/pserv/ daily.0/pserv
 9066 ?        S      0:00  |   |           \_ /root/bin/rsync -v --archive
--delete /backup/pserv/ daily.0/pserv

backup:~ # netstat -anp|grep rsync
unix  3      [ ]         STREAM     CONNECTED     2344808 9066/rsync          
unix  3      [ ]         STREAM     CONNECTED     2344807 9065/rsync          
unix  3      [ ]         STREAM     CONNECTED     2344806 9065/rsync          
unix  3      [ ]         STREAM     CONNECTED     2344805 9066/rsync          

backup:~ # strace -p 9065
Process 9065 attached - interrupt to quit
Process 9065 detached
backup:~ # strace -p 9066
Process 9066 attached - interrupt to quit
select(1, [0], [], NULL, {44, 568000} <unfinished ...>
Process 9066 detached

backup:~ # ltrace -p 9065 -S
backup:~ # ltrace -p 9066 -S
SYS__newselect(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678)                       
                   = 0
__errno_location()                                                             
                   = 0xb7eb6600
select(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678 <unfinished ...>
SYS__newselect(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678 <unfinished ...>

backup:~ # strace -p 9065
Process 9065 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, 0xbffee24c, WNOHANG)        = 0
sigreturn()                             = ? (mask now [])
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], WNOHANG) = 9066
waitpid(-1, 0xbffee25c, WNOHANG)        = -1 ECHILD (No child processes)
sigreturn()                             = ? (mask now [])
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 9065 detached

backup:~ # strace -p 9066
Process 9066 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
select(1, [0], [], NULL, {38, 186000})  = 0 (Timeout)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 9066 detached
Comment 4 Wayne Davison 2005-06-08 09:07:12 UTC
The best thing you can do is to run a version with debug symbols intact: rsync
builds with debug symbols by default, but strips them from the installed
version, so either run a non-stripped binary from an rsync build dir, or make
sure that the stripped version that is running is based on the non-stripped
version (so that it differs only in the stripped symbols).  Then, when the
processes hang, attach gdb to each process using a non-stripped rsync binary as
the first arg and the process number as the second arg.  You'll then be able to
run "bt" (backtrace) and see exactly where rsync is in the code.  You can look
around, printing out any variables via the "p" command, etc.
Comment 5 Cord Seele 2005-07-25 09:55:45 UTC
We had to set up the machine again and now I don't see this error again. Don't
know why :-( Sorry to have bothered you - I set the bug to INVALID. Thanks for
your time!

Cord