Hello, for me rsync seems to get stuck in an endless loop (burning cpu cycles at 100%) when syncing between two local directories. This hang only occurs once in a while (every other week) while the same command succeeds many times daily. Here are the details I gathered so far: backup # ps axf PID TTY STAT TIME COMMAND [...] 24691 ? S 0:00 | | \_ /bin/bash /root/bin/backup.external /backup/admin/log/backup_2005-06-02-external.log 2226 ? R 532:18 | | \_ /root/bin/rsync -v --archive --delete /backup/zserv/ daily.0/zserv 2227 ? S 0:00 | | \_ /root/bin/rsync -v --archive --delete /backup/zserv/ daily.0/zserv The log-file (from the -v output) is stuck right after building file list ... backup # netstat -anp|grep rsync unix 3 [ ] STREAM CONNECTED 5569523 2227/rsync unix 3 [ ] STREAM CONNECTED 5569522 2226/rsync unix 3 [ ] STREAM CONNECTED 5569521 2226/rsync unix 3 [ ] STREAM CONNECTED 5569520 2227/rsync backup # strace -p 2226 Process 2226 attached - interrupt to quit Process 2226 detached backup # strace -p 2227 Process 2227 attached - interrupt to quit select(1, [0], [], NULL, {40, 118000}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0} <unfinished ...> Process 2227 detached backup # rsync --version rsync version 2.6.5cvs protocol version 29 Copyright (C) 1996-2005 by Andrew Tridgell and others <http://rsync.samba.org/> Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, inplace, IPv6, 64-bit system inums, 64-bit internal inums rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the GNU General Public Licence for details. (pulled from CVS on 2005-5-9) But I experienced this behaviour also on older rsync versions. Any further information I should gather? Thanks, Cord
It seems to be looping in userspace (no system calls, hence strace doesn't show anything). Could you try ltrace instead of strace, that should show what library functions (if any) are being called which may help.
This is ltrace and strace again (until the segfault...?) Hope this helps a bit further? backup:~ # ltrace -p 2227 -S --- SIGSTOP (Stopped (signal)) --- --- SIGSTOP (Stopped (signal)) --- SYS__newselect(1, 0xbfffc6d0, 0xbfffc650, 0, 0xbfffc648 <unfinished ...> backup:~ # ltrace -S -p 22276 Cannot attach to pid 22276: No such process backup:~ # ltrace -S -p 2226 --- SIGSTOP (Stopped (signal)) --- --- SIGSTOP (Stopped (signal)) --- --- SIGCHLD (Child exited) --- waitpid(-1, 0xbfff763c, 1 <unfinished ...> SYS_waitpid(-1, 0xbfff763c, 1, 1, 0x0809e728) = 0 <... waitpid resumed> ) = 0 SYS_sigreturn(0xb7fc8ff4, 0xbfff763c, 1, 1, 0x0809e728) = 5 backup:~ # ltrace -S -p 2227 --- SIGSTOP (Stopped (signal)) --- --- SIGSTOP (Stopped (signal)) --- SYS__newselect(1, 0xbfffc6d0, 0xbfffc650, 0, 0xbfffc648 <unfinished ...> backup:~ # strace -p 2227 Process 2227 attached - interrupt to quit --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGSTOP (Stopped (signal)) @ 0 (0) --- select(1, [0], [], NULL, {21, 780000} <unfinished ...> Process 2227 detached backup:~ # strace -p 2226 Process 2226 attached - interrupt to quit --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, 0xbfff763c, WNOHANG) = 0 sigreturn() = ? (mask now []) Process 2226 detached backup:~ # strace -p 2226 Process 2226 attached - interrupt to quit Process 2226 detached backup:~ # strace -p 2226 Process 2226 attached - interrupt to quit Process 2226 detached backup:~ # strace -p 2227 Process 2227 attached - interrupt to quit select(1, [0], [], NULL, {0, 188000}) = 0 (Timeout) --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 2227 detached backup:~ # strace -p 2227 attach: ptrace(PTRACE_ATTACH, ...): No such process backup:~ # strace -p 2226 attach: ptrace(PTRACE_ATTACH, ...): No such process
Hello again, here's another 'wild' rsync. Same symtoms as before. What's wondering me is, that after ltrace, there's something to see with strace - until both processes segfault. (The final strace commands were run in paralell) backup:~ # ps axf PID TTY STAT TIME COMMAND 7416 ? S 0:00 | | \_ /bin/bash /root/bin/backup.external /backup/admin/log/backup_2005-06-07-external.log 9065 ? R 581:58 | | \_ /root/bin/rsync -v --archive --delete /backup/pserv/ daily.0/pserv 9066 ? S 0:00 | | \_ /root/bin/rsync -v --archive --delete /backup/pserv/ daily.0/pserv backup:~ # netstat -anp|grep rsync unix 3 [ ] STREAM CONNECTED 2344808 9066/rsync unix 3 [ ] STREAM CONNECTED 2344807 9065/rsync unix 3 [ ] STREAM CONNECTED 2344806 9065/rsync unix 3 [ ] STREAM CONNECTED 2344805 9066/rsync backup:~ # strace -p 9065 Process 9065 attached - interrupt to quit Process 9065 detached backup:~ # strace -p 9066 Process 9066 attached - interrupt to quit select(1, [0], [], NULL, {44, 568000} <unfinished ...> Process 9066 detached backup:~ # ltrace -p 9065 -S backup:~ # ltrace -p 9066 -S SYS__newselect(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678) = 0 __errno_location() = 0xb7eb6600 select(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678 <unfinished ...> SYS__newselect(1, 0xbfffc700, 0xbfffc680, 0, 0xbfffc678 <unfinished ...> backup:~ # strace -p 9065 Process 9065 attached - interrupt to quit --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, 0xbffee24c, WNOHANG) = 0 sigreturn() = ? (mask now []) --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], WNOHANG) = 9066 waitpid(-1, 0xbffee25c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 9065 detached backup:~ # strace -p 9066 Process 9066 attached - interrupt to quit --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGSTOP (Stopped (signal)) @ 0 (0) --- select(1, [0], [], NULL, {38, 186000}) = 0 (Timeout) --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 9066 detached
The best thing you can do is to run a version with debug symbols intact: rsync builds with debug symbols by default, but strips them from the installed version, so either run a non-stripped binary from an rsync build dir, or make sure that the stripped version that is running is based on the non-stripped version (so that it differs only in the stripped symbols). Then, when the processes hang, attach gdb to each process using a non-stripped rsync binary as the first arg and the process number as the second arg. You'll then be able to run "bt" (backtrace) and see exactly where rsync is in the code. You can look around, printing out any variables via the "p" command, etc.
We had to set up the machine again and now I don't see this error again. Don't know why :-( Sorry to have bothered you - I set the bug to INVALID. Thanks for your time! Cord