Bug 7438 - rsync timeouts and stays in state D
rsync timeouts and stays in state D
Status: RESOLVED WORKSFORME
Product: rsync
Classification: Unclassified
Component: core
3.0.7
Other Linux
: P3 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-19 07:31 UTC by Raphaël Pinson
Modified: 2010-06-19 11:54 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Raphaël Pinson 2010-05-19 07:31:45 UTC
I'm syncing a rather big list of small files using rsync. My client is rsync 3.0.7-1ubuntu1 running on Ubuntu Lucid, and the server is a backport for rsync 3.0.0-1 for Debian Etch (although I had the same problem with rsync 2.6.9 as server, so it seems to be a problem on the client side).

Some time in the middle of the sync, the client freezes. When I kill it (using ctrl+C or kill), the client returns with an error "undefined error" but the process keeps running in state D (as seen in ps), so it cannot be killed. The servers returns, too, and is not seen anymore in the process list.

Here are the last lines of the strace tracks on both the client and the server:

client:
=================
lstat("archives/private/sun-exploit-build/2009-August/date.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
lstat("archives/private/sun-exploit-build/2009-August/index.html", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
lstat("archives/private/sun-exploit-build/2009-August/subject.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
lstat("archives/private/sun-exploit-build/2009-August/thread.html", 0x7fff4c604ec0) = -1 ENOENT (No such file or directory)
select(5, [3], [4], [4], {60, 0}) = 1 (out [4], left {59, 999997})
write(4, "\37\3\0\7\0\1\0\240\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\240\0\0\0\0\0"..., 803) = 803
gettimeofday({1274270586, 245243}, NULL) = 0
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
select(4, [3], [], NULL, {60, 0}^C <unfinished ...>
=================

server:
=================
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\"> (Priorit&eacute; supervision:"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "AHx8fA6H3JUKXLFmiAGzelQDAw527\nAl"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "itOSosZqrN5to9mRnA\nOGBL3x59BrQUd"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "eric@lists.sourceforge.x-echo.co"..., 4092) = 4092
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0})
write(1, "\374\17\0\7", 4) = 4
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 0 (Timeout)
select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {51, 216000})
write(1, ".131.30])\n\tby gwprod01m1.infra.m"..., 4092) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
write(2, "rsync: writefd_unbuffered failed"..., 79) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
write(2, "rsync error: errors with program"..., 83) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
gettimeofday({1274271146, 160025}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
gettimeofday({1274271146, 259470}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
gettimeofday({1274271146, 263457}, NULL) = 0
exit_group(13) = ?
Process 20281 detached
=================

I can't find where this problem comes from, but I get it every time I relaunch rsync.
Comment 1 Wayne Davison 2010-06-19 11:54:06 UTC
A non-killable process typically means that it is stuck in the kernel, and can't be killed until it comes out again.  That is also likely to be the cause of the hang -- it is trying to perform an operation that is hanging in the kernel.

You should check the system calls that rsync is doing, perhaps by running it under an "strace" or "truss" type of program.  That can help you to find out what is triggering the problem.