Bug 5701 - deadlock on local rsyncing, bisected to commit f303b749f2843433c9acd8218a4b9096d0d1bb8d
Summary: deadlock on local rsyncing, bisected to commit f303b749f2843433c9acd8218a4b90...
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All Linux
: P3 major (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-18 18:08 UTC by Sami Farin
Modified: 2008-08-31 12:24 UTC (History)
0 users

See Also:


Attachments
Test patch (463 bytes, patch)
2008-08-24 12:32 UTC, Wayne Davison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sami Farin 2008-08-18 18:08:35 UTC
$ rsync -v --delete --progress -a --exclude-from=/a/file some directories here /mnt/backup/some/
sending incremental file list

There it stays.

With older (working) rsync I get:
sent 899850 bytes  received 260 bytes  1800220.00 bytes/sec
total size is 9547899853  speedup is 10607.48


f303b749f2843433c9acd8218a4b9096d0d1bb8d is first bad commit
commit f303b749f2843433c9acd8218a4b9096d0d1bb8d
Author: Wayne Davison <wayned@samba.org>
Date:   Sat Aug 2 09:14:36 2008 -0700

    Added logic to the receiving side to ensure that the --delete-during
    code will not delete in a directory prior to receiving an I/O error
    for that directory (or not receiving it, as the case may be).



$ gstack 22366
#0  0x0000000000a24ad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00007fa36509b660 in read_timeout () from /proc/22366/exe
#2  0x00007fa36509bb86 in read_loop () from /proc/22366/exe
#3  0x00007fa36509a241 in readfd_unbuffered () from /proc/22366/exe
#4  0x00007fa36509bd33 in readfd () from /proc/22366/exe
#5  0x00007fa36509c956 in read_ndx () from /proc/22366/exe
#6  0x00007fa36507cc20 in read_ndx_and_attrs () from /proc/22366/exe
#7  0x00007fa365085ede in send_files () from /proc/22366/exe
#8  0x00007fa36508e738 in client_run () from /proc/22366/exe
#9  0x00007fa36508f58c in main () from /proc/22366/exe

$ gstack 22367
#0  0x0000000000a24ad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00007fa36509b660 in read_timeout () from /proc/22367/exe
#2  0x00007fa36509a40a in readfd_unbuffered () from /proc/22367/exe
#3  0x00007fa36509bd33 in readfd () from /proc/22367/exe
#4  0x00007fa36509c0f6 in read_msg_fd () from /proc/22367/exe
#5  0x00007fa3650834d5 in generate_files () from /proc/22367/exe
#6  0x00007fa36508e29c in do_recv () from /proc/22367/exe
#7  0x00007fa36508ed82 in start_server () from /proc/22367/exe
#8  0x00007fa36508fc55 in child_main () from /proc/22367/exe
#9  0x00007fa3650a943e in local_child () from /proc/22367/exe
#10 0x00007fa36508cb45 in do_cmd () from /proc/22367/exe
#11 0x00007fa36508f53f in main () from /proc/22367/exe

$ gstack 22368
#0  0x0000000000a24ad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x00007fa36509b660 in read_timeout () from /proc/22368/exe
#2  0x00007fa36509bb86 in read_loop () from /proc/22368/exe
#3  0x00007fa36509a241 in readfd_unbuffered () from /proc/22368/exe
#4  0x00007fa36509bd33 in readfd () from /proc/22368/exe
#5  0x00007fa36509c956 in read_ndx () from /proc/22368/exe
#6  0x00007fa36507cc20 in read_ndx_and_attrs () from /proc/22368/exe
#7  0x00007fa3650845ca in recv_files () from /proc/22368/exe
#8  0x00007fa36508e17c in do_recv () from /proc/22368/exe
#9  0x00007fa36508ed82 in start_server () from /proc/22368/exe
#10 0x00007fa36508fc55 in child_main () from /proc/22368/exe
#11 0x00007fa3650a943e in local_child () from /proc/22368/exe
#12 0x00007fa36508cb45 in do_cmd () from /proc/22368/exe
#13 0x00007fa36508f53f in main () from /proc/22368/exe
Comment 1 Wayne Davison 2008-08-18 19:36:30 UTC
Thanks for the report.  I'll see about reproducing this.
Comment 2 Wayne Davison 2008-08-24 12:32:29 UTC
Created attachment 3504 [details]
Test patch

I haven't been able to get rsync to hang for me yet.  If you're still seeing the problem, does this patch fix it?
Comment 3 Sami Farin 2008-08-24 12:50:41 UTC
Unfortunately the patch did not help...

(gdb) bt
#0  0x0000003000cdcad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000043620b in read_timeout (fd=5, buf=0x7fffd49a38d0 "\001", len=4) at io.c:692
#2  0x000000000043710a in read_loop (fd=5, buf=0x7fffd49a38d0 "\001", len=4) at io.c:1025
#3  0x000000000043728e in readfd_unbuffered (fd=5, buf=0x7fffd49a4f50 "", len=1) at io.c:1062
#4  0x0000000000437a46 in readfd (fd=5, buffer=0x7fffd49a4f50 "", N=1) at io.c:1191
#5  0x0000000000439459 in read_ndx (f=5) at io.c:1797
#6  0x0000000000411507 in read_ndx_and_attrs (f_in=5, iflag_ptr=0x7fffd49a5154, 
    type_ptr=0x7fffd49a515f "", buf=0x7fffd49a5160 "\200��, len_ptr=0x7fffd49a5150)
    at rsync.c:255
#7  0x000000000041d02a in send_files (f_in=5, f_out=4) at sender.c:185
#8  0x000000000042714b in client_run (f_in=5, f_out=4, pid=12999, argc=5, argv=0x1ab1240)
    at main.c:1053
#9  0x0000000000427c85 in start_client (argc=5, argv=0x1ab1240) at main.c:1301
#10 0x0000000000428347 in main (argc=6, argv=0x1ab1240) at main.c:1528

(gdb) bt
#0  0x0000003000cdcad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000043620b in read_timeout (fd=3, buf=0x1ab35c0 "", len=8184) at io.c:692
#2  0x00000000004371d5 in readfd_unbuffered (fd=3, buf=0x7fffd49a2790 "H", len=4) at io.c:1048
#3  0x0000000000437a46 in readfd (fd=3, buffer=0x7fffd49a2790 "H", N=4) at io.c:1191
#4  0x0000000000435235 in read_msg_fd () at io.c:375
#5  0x0000000000435e92 in wait_for_receiver () at io.c:578
#6  0x0000000000419768 in generate_files (f_out=1, local_name=0x0) at generator.c:2242
#7  0x0000000000426a32 in do_recv (f_in=0, f_out=1, local_name=0x0) at main.c:846
#8  0x0000000000426e7e in do_server_recv (f_in=0, f_out=1, argc=1, argv=0x7fffd49a5258)
    at main.c:960
#9  0x0000000000426f69 in start_server (f_in=0, f_out=1, argc=2, argv=0x7fffd49a5250)
    at main.c:990
#10 0x0000000000426eba in child_main (argc=2, argv=0x7fffd49a5250) at main.c:967
#11 0x0000000000448ac3 in local_child (argc=2, argv=0x7fffd49a5250, f_in=0x7fffd49a7288, 
    f_out=0x7fffd49a7284, child_main=0x426e95 <child_main>) at pipe.c:169
#12 0x0000000000425b3b in do_cmd (cmd=0x0, machine=0x0, user=0x0, remote_argv=0x1ab1270, 
    remote_argc=0, f_in_p=0x7fffd49a7288, f_out_p=0x7fffd49a7284) at main.c:472
#13 0x0000000000427c30 in start_client (argc=5, argv=0x1ab1240) at main.c:1289
#14 0x0000000000428347 in main (argc=6, argv=0x1ab1240) at main.c:1528

(gdb) bt
#0  0x0000003000cdcad3 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000043620b in read_timeout (fd=0, buf=0x7fffd499e6d0 "\001", len=4) at io.c:692
#2  0x000000000043710a in read_loop (fd=0, buf=0x7fffd499e6d0 "\001", len=4) at io.c:1025
#3  0x000000000043728e in readfd_unbuffered (fd=0, buf=0x7fffd499fd50 "", len=1) at io.c:1062
#4  0x0000000000437a46 in readfd (fd=0, buffer=0x7fffd499fd50 "", N=1) at io.c:1191
#5  0x0000000000439459 in read_ndx (f=0) at io.c:1797
#6  0x0000000000411507 in read_ndx_and_attrs (f_in=0, iflag_ptr=0x7fffd49a003c, 
    type_ptr=0x7fffd49a004f "", buf=0x7fffd49a2050 "", len_ptr=0x7fffd49a0038) at rsync.c:255
#7  0x000000000041aaab in recv_files (f_in=0, local_name=0x0) at receiver.c:416
#8  0x000000000042684e in do_recv (f_in=0, f_out=1, local_name=0x0) at main.c:785
#9  0x0000000000426e7e in do_server_recv (f_in=0, f_out=1, argc=1, argv=0x7fffd49a5258)
    at main.c:960
#10 0x0000000000426f69 in start_server (f_in=0, f_out=1, argc=2, argv=0x7fffd49a5250)
    at main.c:990
#11 0x0000000000426eba in child_main (argc=2, argv=0x7fffd49a5250) at main.c:967
#12 0x0000000000448ac3 in local_child (argc=2, argv=0x7fffd49a5250, f_in=0x7fffd49a7288, 
    f_out=0x7fffd49a7284, child_main=0x426e95 <child_main>) at pipe.c:169
#13 0x0000000000425b3b in do_cmd (cmd=0x0, machine=0x0, user=0x0, remote_argv=0x1ab1270, 
    remote_argc=0, f_in_p=0x7fffd49a7288, f_out_p=0x7fffd49a7284) at main.c:472
#14 0x0000000000427c30 in start_client (argc=5, argv=0x1ab1240) at main.c:1289
#15 0x0000000000428347 in main (argc=6, argv=0x1ab1240) at main.c:1528
Comment 4 Sami Farin 2008-08-24 13:13:53 UTC
With "--info 2 --debug 2", it does not hang, but spits out some errors, and exits abruptly:
(I gave five dirs as parameters -- let's call them dir0, dir1, dir2, dir3, dir4)

...
[sender] make_file(dir0/filexyz,*,2)
rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32)
server_recv(2) starting pid=19852
uid 500(safari) maps to 500
process has 5 gids:  37 500 509 546 598
gid 500(safari) maps to 500
recv_file_name(dir0)
recv_file_name(dir2)
gid 0() maps to 0
recv_file_name(dir3)
recv_file_name(dir4)
received 4 names
recv_file_list done
get_local_name count=4 /dest/dir/
[receiver] change_dir(/dest/dir)
generator starting pid=19852
delta-transmission disabled for local transfer or --whole-file
recv_generator(dir0,1)
recv_generator(dir4,2)
recv_files(4) starting
recv_generator(dir3,3)
recv_generator(dir2,4)
[receiver] receiving flist for dir 0
rsync: connection unexpectedly closed (9 bytes received so far) [sender]
_exit_cleanup(code=12, file=io.c, line=639): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(639) [sender=3.1.0dev]
_exit_cleanup(code=12, file=io.c, line=639): about to call exit(12)
sending incremental file list

Comment 5 Wayne Davison 2008-08-24 16:15:14 UTC
The abnormal exit was a logging issue in 3.1.0 that I just checked-in a fix for (see latest git and "nightly" tar file).

I have checked in some flushing changes that I hope will take care of the issue.  If not, I can code up a work-around to avoid the hang.
Comment 6 Wayne Davison 2008-08-24 16:21:33 UTC
> I have checked in some flushing changes that I hope will take care of
> the issue.

It was probably not clear that the issue mentioned in the second paragraph was referring to the original issue of the hang.
Comment 7 Sami Farin 2008-08-24 16:41:05 UTC
git 1d891835e7933f2c82bf636b0f27a936dc432e5c still hangs without --info or --debug,
but with "--info 2 --debug 2" it finishes OK without errors.
Comment 8 Wayne Davison 2008-08-31 12:24:02 UTC
This is now fixed in the latest 3.1.0dev version in the git repository and "nightly" tar file.

I also removed the code that could hang from the 3.0.x branch, but the fix for the I/O error notification issue (that the hanging code was targeting) will now be handled only in the 3.1.0 release.