I'm syncing a rather big list of small files using rsync. My client is rsync 3.0.7-1ubuntu1 running on Ubuntu Lucid, and the server is a backport for rsync 3.0.0-1 for Debian Etch (although I had the same problem with rsync 2.6.9 as server, so it seems to be a problem on the client side). Some time in the middle of the sync, the client freezes. When I kill it (using ctrl+C or kill), the client returns with an error "undefined error" but the process keeps running in state D (as seen in ps), so it cannot be killed. The servers returns, too, and is not seen anymore in the process list. Here are the last lines of the strace tracks on both the client and the server: client: ================= lstat("archives/private/sun-exploit-build/2009-August/date.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory) lstat("archives/private/sun-exploit-build/2009-August/index.html", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0 lstat("archives/private/sun-exploit-build/2009-August/subject.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory) lstat("archives/private/sun-exploit-build/2009-August/thread.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory) select(5, [3], [4], [4], {60, 0}) = 1 (out [4], left {59, 999997}) write(4, "\37\3\0\7\0\1\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\240\0\0\0\0\0"..., 803) = 803 gettimeofday({1274270586, 245243}, NULL) = 0 select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}) = 0 (Timeout) select(4, [3], [], NULL, {60, 0}^C <unfinished ...> ================= server: ================= select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7", 4) = 4 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\"> (Priorité supervision:"..., 4092) = 4092 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7", 4) = 4 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "AHx8fA6H3JUKXLFmiAGzelQDAw527\nAl"..., 4092) = 4092 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7", 4) = 4 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "itOSosZqrN5to9mRnA\nOGBL3x59BrQUd"..., 4092) = 4092 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7", 4) = 4 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "eric@lists.sourceforge.x-echo.co"..., 4092) = 4092 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7", 4) = 4 select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout) select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {51, 216000}) write(1, ".131.30])\n\tby gwprod01m1.infra.m"..., 4092) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- write(2, "rsync: writefd_unbuffered failed"..., 79) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 write(2, "rsync error: errors with program"..., 83) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 gettimeofday({1274271146, 160025}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) gettimeofday({1274271146, 259470}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) gettimeofday({1274271146, 263457}, NULL) = 0 exit_group(13) = ? Process 20281 detached ================= I can't find where this problem comes from, but I get it every time I relaunch rsync.
A non-killable process typically means that it is stuck in the kernel, and can't be killed until it comes out again. That is also likely to be the cause of the hang -- it is trying to perform an operation that is hanging in the kernel. You should check the system calls that rsync is doing, perhaps by running it under an "strace" or "truss" type of program. That can help you to find out what is triggering the problem.