Bug 8265 - Various corruption of devices and sockets
Summary: Various corruption of devices and sockets
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.0.8
Hardware: All FreeBSD
: P3 major (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-25 04:25 UTC by grarpamp
Modified: 2020-07-26 07:14 UTC (History)
2 users (show)

See Also:


Attachments
Add debug output for the rdev major corruption (3.17 KB, patch)
2011-06-25 22:18 UTC, Matt McCutchen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description grarpamp 2011-06-25 04:25:32 UTC
Hi. I sent two mails to the list on this date
or the day before entitled...
 Regression: sockets for 3.0.9pre1 ?

Rather than include them here, please refer
to the list.

Due to the nature of this bug seeming to be
doing odd things with lengths, devices, etc,
I've marked it as a P1 blocker.
Feel free to adjust as appropriate. Thanks :)
Comment 1 Wayne Davison 2011-06-25 05:42:52 UTC
There's no reason rsync would be doing anything differently from prior versions, so check what's wrong with your system.
Comment 2 grarpamp 2011-06-25 06:26:39 UTC
Hi :) I replied on list since everything ended
up over there first. If you prefer here, I can
copy it all in. Thanks.
Comment 3 Matt McCutchen 2011-06-25 14:44:01 UTC
https://lists.samba.org/archive/rsync/2011-June/026506.html

If you want to help solve this, you could do some experiments with a single socket or special file, find the shortest path length at which the problems start to occur, and use a system call tracer to see what is going wrong.
Comment 4 grarpamp 2011-06-25 17:55:55 UTC
> https://lists.samba.org/archive/rsync/2011-June/026506.html
> 
> If you want to help solve this, you could do some experiments with a single
> socket or special file, find the shortest path length at which the problems
> start to occur, and use a system call tracer to see what is going wrong.

Hi :) That was the first note. I've done more testing since then.
In the above spirit of listing email links, here are the rest
of them to date... the last one includes the testing you speak of.

I'm happy to do more checks, just tell me what you want me to dig up.

https://lists.samba.org/archive/rsync/2011-June/026506.html
https://lists.samba.org/archive/rsync/2011-June/026507.html
https://lists.samba.org/archive/rsync/2011-June/026510.html
https://lists.samba.org/archive/rsync/2011-June/026511.html
https://lists.samba.org/archive/rsync/2011-June/026513.html
https://lists.samba.org/archive/rsync/2011-June/026522.html
Comment 5 grarpamp 2011-06-25 18:15:47 UTC
Noting exact version...
3.0.8 release tarball, plus this compilation patch to flist.c:
http://gitweb.samba.org/?p=rsync.git;a=commitdiff;h=8686d3abba4c5b4baf443871cd19c673056a3392
Comment 6 Matt McCutchen 2011-06-25 22:18:05 UTC
Created attachment 6626 [details]
Add debug output for the rdev major corruption

Let's keep the rest of the discussion on this bug.

I saw all the messages on the list, but you tried so many different cases that I overlooked the one with the short path where the rdev major number gets corrupted from 9 to 78.  Since that case seems relatively simple, let's look at it first.  Please apply the attached patch and run the case with -vvv to enable the existing mknod debug output so we can try to see where in the process the corruption occurs.  Also see if --no-i-r makes the problem go away.
Comment 7 grarpamp 2011-06-26 02:17:04 UTC
Hi :) Patch 6397 and 6626, configure, compiles, and exec's ok.
Here's the build log...

rsync.c:378: warning: unused parameter `fnamecmp'
generator.c:216: warning: `desc' might be used uninitialized in this function
generator.c:622: warning: unused parameter `fname'
generator.c:665: warning: unused parameter `fnamecmp'
generator.c:1112: warning: `type' might be used uninitialized in this function
sender.c:143: warning: unused parameter `fname'
sender.c:143: warning: unused parameter `file'
io.c:470: warning: unused parameter `convert'


I remade the entire real /dev to make sure nothing crept in over
the years, rebooted, still broken. Also ran a few tar's of / | sha1
to make sure that all the kernel stuff is reliably readable, it is.


Not sure if the patch is ok, rsync stalls in some different cases
and needs ctrl-c. Search stall below.

Empty is an empty dir, full is a freshly MAKEDEV'd dir.


### With patch 6397 and 6626, both cases stall.
# Additionally, in both cases, --no-i-r stalls.
# DEBUG is without --no-i-r.
# rsync -nHaxi --delete /dev/ ./empty/
...
[sender] i=4302 /dev zs0 mode=020600 len=0 uid=0 gid=0 flags=0
[sender] i=4303 /dev fd/ mode=040555 len=1024 uid=0 gid=0 flags=4
send_file_list done
file list sent

^C[sender] _exit_cleanup(code=20, file=rsync.c, line=549): entered
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549) [sender=3.0.8]
[sender] _exit_cleanup(code=20, file=rsync.c, line=549): about to call exit(20)
rsync: writefd_unbuffered failed to write 96 bytes to socket [Receiver]: Broken pipe (32)

DEBUG 9: make_file(./fd1.1440): major(tmp_rdev) = 9
DEBUG 1: send_file_entry(fd1.1440): major(rdev) = 9
DEBUG 3: send_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR
DEBUG 9: make_file(./rfd1.1440): major(tmp_rdev) = 9
DEBUG 1: send_file_entry(rfd1.1440): major(rdev) = 9
DEBUG 6: recv_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR, rdev_major was 3
DEBUG 7: recv_file_entry(fd1.1440): after assembly, major(rdev) = 3
DEBUG 8: recv_file_entry(fd1.1440): DEV_MAJOR(F_RDEV_P(file)) = 3
DEBUG 4: recv_file_entry(rfd1.1440): took major(rdev) = 3 from previous file fd1.1440
DEBUG 8: recv_file_entry(rfd1.1440): DEV_MAJOR(F_RDEV_P(file)) = 3

# rsync -nHaxi --delete /dev/ ./full/
...
[sender] i=4302 /dev zs0 mode=020600 len=0 uid=0 gid=0 flags=0
[sender] i=4303 /dev fd/ mode=040555 len=1024 uid=0 gid=0 flags=4
send_file_list done
file list sent

^C[sender] _exit_cleanup(code=20, file=rsync.c, line=549): entered
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549) [sender=3.0.8]
[sender] _exit_cleanup(code=20, file=rsync.c, line=549): about to call exit(20)
rsync: writefd_unbuffered failed to write 96 bytes to socket [Receiver]: Broken pipe (32)

DEBUG 9: make_file(./fd1.1440): major(tmp_rdev) = 9
DEBUG 1: send_file_entry(fd1.1440): major(rdev) = 9
DEBUG 3: send_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR
DEBUG 9: make_file(./rfd1.1440): major(tmp_rdev) = 9
DEBUG 1: send_file_entry(rfd1.1440): major(rdev) = 9
DEBUG 6: recv_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR, rdev_major was 3
DEBUG 7: recv_file_entry(fd1.1440): after assembly, major(rdev) = 3
DEBUG 8: recv_file_entry(fd1.1440): DEV_MAJOR(F_RDEV_P(file)) = 3
DEBUG 4: recv_file_entry(rfd1.1440): took major(rdev) = 3 from previous file fd1.1440
DEBUG 8: recv_file_entry(rfd1.1440): DEV_MAJOR(F_RDEV_P(file)) = 3


### With only patch 6397, stalls on empty case.
# In both cases, --no-i-r does not stall.
# rsync -nHaxi --delete /dev/ ./empty/
...
recv_files(fd/8)
recv_files(fd/9)

^C[sender] _exit_cleanup(code=20, file=rsync.c, line=549): entered
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549) [sender=3.0.8]
[sender] _exit_cleanup(code=20, file=rsync.c, line=549): about to call exit(20)

# rsync -nHaxi --delete /dev/ ./full/
...
recv_files(fd/8)
recv_files(fd/9)
recv_files phase=1
generate_files phase=2
send_files phase=2
send files finished
total: matches=0  hash_hits=0  false_alarms=0 data=0
recv_files phase=2
generate_files phase=3
recv_files finished
generate_files finished
client_run waiting on 19840

sent 72365 bytes  received 18680 bytes  9583.68 bytes/sec
total size is 49778  speedup is 0.55 (DRY RUN)
[sender] _exit_cleanup(code=0, file=main.c, line=1052): entered
[sender] _exit_cleanup(code=0, file=main.c, line=1052): about to call exit(0)
Comment 8 Matt McCutchen 2011-06-26 16:11:35 UTC
(In reply to comment #7)
> Not sure if the patch is ok, rsync stalls in some different cases
> and needs ctrl-c.

As you can see, all the patch does is add print statements, so the hang must be the result of a timing-dependent bug that was triggered by the addition of the debug output.

> # rsync -nHaxi --delete /dev/ ./empty/
> ...
> [sender] i=4302 /dev zs0 mode=020600 len=0 uid=0 gid=0 flags=0
> [sender] i=4303 /dev fd/ mode=040555 len=1024 uid=0 gid=0 flags=4
> send_file_list done
> file list sent
> 
> ^C[sender] _exit_cleanup(code=20, file=rsync.c, line=549): entered
> rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(549)
> [sender=3.0.8]
> [sender] _exit_cleanup(code=20, file=rsync.c, line=549): about to call exit(20)
> rsync: writefd_unbuffered failed to write 96 bytes to socket [Receiver]: Broken
> pipe (32)
> 
> DEBUG 9: make_file(./fd1.1440): major(tmp_rdev) = 9
> DEBUG 1: send_file_entry(fd1.1440): major(rdev) = 9
> DEBUG 3: send_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR
> DEBUG 9: make_file(./rfd1.1440): major(tmp_rdev) = 9
> DEBUG 1: send_file_entry(rfd1.1440): major(rdev) = 9
> DEBUG 6: recv_file_entry(fd1.1440): XMIT_SAME_RDEV_MAJOR, rdev_major was 3

Well that's interesting, the receiver's rdev_major value got out of sync with the sender's.  recv_file_entry::rdev_major is zero-initialized and the only assignment to it is instrumented with "DEBUG 5", so unless you redacted a previous device with major number 3 from the output, the value 3 could only appear via memory corruption.

The fastest way to find the problem is probably to use a gdb watchpoint, like this:

$ gdb --args rsync -nHaxi --delete /dev/ ./empty/
(gdb) set follow-fork-mode child
(gdb) watch recv_file_entry::rdev_major
(gdb) run
...
(gdb) bt

You could also try running rsync under valgrind if convenient.
Comment 9 grarpamp 2011-07-08 07:59:07 UTC
> assignment to it is instrumented with "DEBUG 5"

yes, seems there was such an assignment. i'll have that and
at least gdb to post when at keyboard next week. thanks.
Comment 10 Wayne Davison 2013-06-02 23:42:21 UTC
Any more info on this?
Comment 11 grarpamp 2013-06-03 07:11:56 UTC
Way back then... I think I modified the patch a bit, ran against just a few files in a dir, and reconfirmed it. The workproduct is still here somewhere... a bet from memory says don't close this till someone can recheck. Perhaps me when that box gets rebuilt.
Comment 12 Wayne Davison 2020-07-26 07:14:53 UTC
I found a bug in the handling of the hard-linked data where the sender sets the rdev_major value for hard-linked devices, but the receiver did not.  So, hopefully this fixes the case where the major rdev number was getting set wrong.

If there are other issues, either start one or more new bugs on GitHub or add more information here.