Bug 1529 - 32bit rollover problem rsyncing files greater than 4GB in size
Summary: 32bit rollover problem rsyncing files greater than 4GB in size
Status: CLOSED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.2
Hardware: x86 Linux
: P3 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-07-14 07:45 UTC by Evan Harris
Modified: 2005-09-07 01:36 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Evan Harris 2004-07-14 07:45:58 UTC
Tested and discovered between two machines with fairly recently updated debian
unstable running rsync 2.6.2-2.

I've been trying to rsync a large 7gig file across a slow network connection
during off-peak hours, and came across a nasty bug in rsync relating to
32-bit rollover of the file size.

I was restarting the rsync every night, and killing it every morning, and
expected to eventually get the whole file.  But much to my surprise, it
never did seem to get anywhere, though when it was restarted, it seemed to
be working fine.

Unless the started rsync is allowed to complete the data transfer of all
data past 4gig in a single run, rsync will never be able to complete this
transfer.

A verbose output of the start of a test rsync that makes the problem
apparent follows:

> rsync -avvvP --bwlimit=30 test2:bigtestfile ./
opening connection using ssh test2 rsync --server --sender -vvvlogDtpr
--bwlimit=30 --partial . "bigtestfile"
receiving file list ...
server_sender starting pid=22545
[sender] make_file(bigtestfile,*,2)
[sender] expand file_list to 131072 bytes, did move
recv_file_name(bigtestfile)
received 1 names
1 file to consider
recv_file_list done
get_local_name count=1 ./
recv_files(1) starting
generator starting pid=25013 count=1
delta transmission enabled
recv_generator(bigtestfile,0)
send_file_list done
send_files starting
generating and sending sums for 0
count=56894 rem=56488 blength=56888 s2length=4 flength=3236585472
Killed by signal 2.

The interesting thing is that the actual file size on both ends is
7531552768, but the size claimed by rsync is exactly 4gig less than that,
which is also the remainder of the size that fits in a 32bit unsigned int, which
 looks like a 32bit rollover or var size error.

I will happily provide any further information or try fixes.
Comment 1 Wayne Davison 2004-07-14 09:55:41 UTC
Your bug report made the problem easy to find.  I've checked in a fix to CVS,
but you can fix your current source by changing the "size_t" to "OFF_T" in the
generate_and_send_sums() function in generator.c.  It should look like this:

static void generate_and_send_sums(struct map_struct *buf, OFF_T len, int f_out)

Then, recompile, install, and the problem should be gone.
Comment 2 Ludi de Souza 2004-07-31 03:48:38 UTC
I was getting bitten by the same bug, im my case a 4.1GB file rolling over in
100MBs.  I've applied the patch and the problem in generate_and_send_sums is
indeed fixed, but it appears to have just moved onto another much graver bug
which causes nothing to get transfered:

rsync
rsync.planetmirror.com::fedora/linux/core/test/2.90/i386/iso/FC3-test1-i386-DVD.iso
-azvvv .

opening tcp connection to rsync.planetmirror.com port 873

  Welcome to PlanetMirror's rsync service.


  You can find a web front end to this archive at:

       http://planetmirror.com

  You can also access this service via FTP at

        ftp://ftp.planetmirror.com


  If you are a regular PM user, please consider supporting PM by
  subscribing to a Premium or PremiumDownload account via the
  web front end.


receiving file list ...
recv_file_name(FC3-test1-i386-DVD.iso)
received 1 names
done
recv_file_list done
get_local_name count=1 .
recv_files(1) starting
generator starting pid=2243 count=1
delta transmission enabled
recv_generator(FC3-test1-i386-DVD.iso,0)
generating and sending sums for 0
count=66400 rem=4729 blength=66384 s2length=4 flength=4407835945
generate_files phase=1
recv_files(FC3-test1-i386-DVD.iso)
FC3-test1-i386-DVD.iso
recv mapped FC3-test1-i386-DVD.iso of size 4407835945
rsync: connection unexpectedly closed (89 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(359)
_exit_cleanup(code=12, file=io.c, line=359): about to call exit(12)
rsync: connection unexpectedly closed (69 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(359)
_exit_cleanup(code=12, file=io.c, line=359): about to call exit(12)




(In reply to comment #1)
> Your bug report made the problem easy to find.  I've checked in a fix to CVS,
> but you can fix your current source by changing the "size_t" to "OFF_T" in the
> generate_and_send_sums() function in generator.c.  It should look like this:
> 
> static void generate_and_send_sums(struct map_struct *buf, OFF_T len, int f_out)
> 
> Then, recompile, install, and the problem should be gone.

Comment 3 Evan Harris 2004-07-31 09:43:40 UTC
(In reply to comment #2)
> I was getting bitten by the same bug, im my case a 4.1GB file rolling over in
> 100MBs.  I've applied the patch and the problem in generate_and_send_sums is
> indeed fixed, but it appears to have just moved onto another much graver bug
> which causes nothing to get transfered:

Keep in mind, BOTH ends of the rsync connection need to apply this fix in order
to work correctly, since both ends call generate_and_send_sums() to decide what
needs to be transferred.

So unless you've also gotten planetmirror.com to apply this fix to their rsync
server, you will still have related rollover problems.  This is probably what is
causing your problem.

Evan
Comment 4 Wayne Davison 2004-07-31 12:18:12 UTC
> Keep in mind, BOTH ends of the rsync connection need to apply
> this fix in order to work correctly

Only the receiving side calls generate_and_send_sums(), so only the receiver is
affected by this fix.  As long as Ludi is just pulling files, it shouldn't
matter if their rsync is patched or not.

I'm going to do some testing and determine if I need to reopen the bug.
Comment 5 Evan Harris 2004-07-31 12:42:20 UTC
I was under the impression that both sides needed it, because after I
patched one end, it still didn't work for me, but after I patched both ends,      
it did.  But maybe I screwed up something in the test.   

Evan
Comment 6 Wayne Davison 2004-08-01 13:45:37 UTC
(In reply to comment #2)
> I've applied the patch and the problem in generate_and_send_sums is
> indeed fixed, but it appears to have just moved onto another much graver bug
> which causes nothing to get transfered:

I too saw an error trying to copy that file from planetmirror.com when I have a
4GB basis file on the receiving system, but it is the rsync daemon on the
planetmirror.com server that is going away, so the only way to see what is wrong
is to ask someone at that site to look in the rsync log and see if there are any
errors.  I tried a similar copy from two local rsync daemons (one running 2.6.2
and one running CVS), and neither one exhibited this failure, so it may be some
kind of a resource problem on the planetmirror server (e.g. perhaps the daemon
exceeded the ulimit memory limit).
Comment 7 Ludi de Souza 2004-08-04 05:52:11 UTC
OK, so assuming the fault was with rsync.planetmirror.com I retried this time
using rsync.kernel.org, but seem to have hit another problem:

After taking an eternity the file was downloaded, size was ultimately correct,
upon which it barfed, deleted the tempfile and restarted.  Because the file was
automatically deleted I cannot confirm or deny if it actually had a bad checksum.

...
chunk[66398] of size 66384 at 4407764832 offset=4407675447
chunk[66399] of size 4729 at 4407831216 offset=4407741831
got file_sum
WARNING: FC3-test1-i386-DVD.iso failed verification -- update discarded (will
try again).
recv_generator(FC3-test1-i386-DVD.iso,0)
recv_files phase=1
gen mapped FC3-test1-i386-DVD.iso of size 4407835945
generating and sending sums for 0
count=66400 rem=4729 blength=66384 s2length=16 flength=4407835945
chunk[0] offset=0 len=66384 sum1=48387eea
chunk[1] offset=66384 len=66384 sum1=f402c97e




(In reply to comment #6)
> I too saw an error trying to copy that file from planetmirror.com when I have a
> 4GB basis file on the receiving system, but it is the rsync daemon on the
> planetmirror.com server that is going away, so the only way to see what is wrong
> is to ask someone at that site to look in the rsync log and see if there are any
> errors.  I tried a similar copy from two local rsync daemons (one running 2.6.2
> and one running CVS), and neither one exhibited this failure, so it may be some
> kind of a resource problem on the planetmirror server (e.g. perhaps the daemon
> exceeded the ulimit memory limit).
Comment 8 Thomas Biege 2005-09-06 06:37:20 UTC
Now when len is OFF_T is it possible that sum->count (which is size_t) in
sum_sizes_sqroot() will rollover too at line:
        sum->count      = (len + (blength - 1)) / blength; ?

When we assume all variables have all bits set:
2^64 + (2^32 - 1) / 2^32 = 2^32 + 1

But sum->count can only represent 2^32 and thereofre sum->count will be 0.

Did I made a mistake here?


Comment 9 Wayne Davison 2005-09-06 11:10:13 UTC
(In reply to comment #8)
> Did I made a mistake here?

I have some minor quibbles with what you stated (which I will describe below),
but, as you point out, this value can theoretically overflow.  However, for the
overflow to happen the file must be nearly an exbibyte in size (over a billion
GiB), and anyone who is trying to rsync anywhere near that large of a file will
experience rsync grinding to a halt due to the hash collisions in its checksum
search algorithm (which will happen long before reaching rsync's maximum
file-size limit).

So, yes -- that value could theoretically overflow, but is not likely that it
will in actual practice.

As promised (threatened?? :-) ), I'll point out my quibbles with your logic,
just for completeness:

If we just ignore that 2^32+1 really would truncate to 1 instead of 0, the
remaining problems are due to size differences from the variables in the latest
rsync, which may stem from you looking at an older version of the source:

- "len" is an int64 whose maximum positive value is 2^63 - 1.
- "blength" has an enforced upper bound of 2^29.
- "count" is an int32 whose maximum positive value is 2^31 - 1.

The program logic as written (adding blength - 1 to the length) could cause a
signed value to overflow into the negative when it is near its limit, so, let's
pretend the code is written like this (which it probably will be soon*):

    sum->remainder  = len % blength;
    sum->count      = len / blength + (sum->remainder != 0);

OK.  That makes the maximum result prior to truncation:

    (2^63-1) / (2^29) + 1 = 2^34

As noted before, this is larger than the 2^31 - 1 value that "count" can hold. 
I calculate that the maximum file size is therefore (2^31-1) * 2^29, or
1,152,921,504,069,976,064 bytes (or slightly less than 2^60, which is an exbibyte).

* The reason I will probably change this to avoid the potential overflow of
adding (blength-1) is that rsync allows itself to be compiled on a system that
does not have 64-bit integers, which means that the "len" value might really be
an int32, and we don't want the length to be able to overflow into negative values.
Comment 10 Thomas Biege 2005-09-06 11:38:50 UTC
thomas@spiral:~> cat rsync-intoverflowtest.c
#include <limits.h>
#include <stdio.h>

int main(void)
{
        unsigned i = UINT_MAX;

        printf("i = %u, i+1 = %u\n", i, i+1);

        exit(0);
}
thomas@spiral:~> ./rsync-intoverflowtest
i = 4294967295, i+1 = 0
thomas@spiral:~>

Yes, I looked at an older version (just the one we ship with our enterprise
server :).
I'll have a look at the rest of your comment tomorrow. :)
Comment 11 Thomas Biege 2005-09-07 01:36:26 UTC
Sorry, Wayne you are right 2^32+1 really is 1. :)

Yes 2^60 seems much to big to be practicable today but maybe, to avoid problems
in the future, it would be better to have all this type-clean.

BTW, AFAICR int overflows for signed types depend on the compiler and are not
standardized.