Tested and discovered between two machines with fairly recently updated debian unstable running rsync 2.6.2-2. I've been trying to rsync a large 7gig file across a slow network connection during off-peak hours, and came across a nasty bug in rsync relating to 32-bit rollover of the file size. I was restarting the rsync every night, and killing it every morning, and expected to eventually get the whole file. But much to my surprise, it never did seem to get anywhere, though when it was restarted, it seemed to be working fine. Unless the started rsync is allowed to complete the data transfer of all data past 4gig in a single run, rsync will never be able to complete this transfer. A verbose output of the start of a test rsync that makes the problem apparent follows: > rsync -avvvP --bwlimit=30 test2:bigtestfile ./ opening connection using ssh test2 rsync --server --sender -vvvlogDtpr --bwlimit=30 --partial . "bigtestfile" receiving file list ... server_sender starting pid=22545 [sender] make_file(bigtestfile,*,2) [sender] expand file_list to 131072 bytes, did move recv_file_name(bigtestfile) received 1 names 1 file to consider recv_file_list done get_local_name count=1 ./ recv_files(1) starting generator starting pid=25013 count=1 delta transmission enabled recv_generator(bigtestfile,0) send_file_list done send_files starting generating and sending sums for 0 count=56894 rem=56488 blength=56888 s2length=4 flength=3236585472 Killed by signal 2. The interesting thing is that the actual file size on both ends is 7531552768, but the size claimed by rsync is exactly 4gig less than that, which is also the remainder of the size that fits in a 32bit unsigned int, which looks like a 32bit rollover or var size error. I will happily provide any further information or try fixes.
Your bug report made the problem easy to find. I've checked in a fix to CVS, but you can fix your current source by changing the "size_t" to "OFF_T" in the generate_and_send_sums() function in generator.c. It should look like this: static void generate_and_send_sums(struct map_struct *buf, OFF_T len, int f_out) Then, recompile, install, and the problem should be gone.
I was getting bitten by the same bug, im my case a 4.1GB file rolling over in 100MBs. I've applied the patch and the problem in generate_and_send_sums is indeed fixed, but it appears to have just moved onto another much graver bug which causes nothing to get transfered: rsync rsync.planetmirror.com::fedora/linux/core/test/2.90/i386/iso/FC3-test1-i386-DVD.iso -azvvv . opening tcp connection to rsync.planetmirror.com port 873 Welcome to PlanetMirror's rsync service. You can find a web front end to this archive at: http://planetmirror.com You can also access this service via FTP at ftp://ftp.planetmirror.com If you are a regular PM user, please consider supporting PM by subscribing to a Premium or PremiumDownload account via the web front end. receiving file list ... recv_file_name(FC3-test1-i386-DVD.iso) received 1 names done recv_file_list done get_local_name count=1 . recv_files(1) starting generator starting pid=2243 count=1 delta transmission enabled recv_generator(FC3-test1-i386-DVD.iso,0) generating and sending sums for 0 count=66400 rem=4729 blength=66384 s2length=4 flength=4407835945 generate_files phase=1 recv_files(FC3-test1-i386-DVD.iso) FC3-test1-i386-DVD.iso recv mapped FC3-test1-i386-DVD.iso of size 4407835945 rsync: connection unexpectedly closed (89 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(359) _exit_cleanup(code=12, file=io.c, line=359): about to call exit(12) rsync: connection unexpectedly closed (69 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(359) _exit_cleanup(code=12, file=io.c, line=359): about to call exit(12) (In reply to comment #1) > Your bug report made the problem easy to find. I've checked in a fix to CVS, > but you can fix your current source by changing the "size_t" to "OFF_T" in the > generate_and_send_sums() function in generator.c. It should look like this: > > static void generate_and_send_sums(struct map_struct *buf, OFF_T len, int f_out) > > Then, recompile, install, and the problem should be gone.
(In reply to comment #2) > I was getting bitten by the same bug, im my case a 4.1GB file rolling over in > 100MBs. I've applied the patch and the problem in generate_and_send_sums is > indeed fixed, but it appears to have just moved onto another much graver bug > which causes nothing to get transfered: Keep in mind, BOTH ends of the rsync connection need to apply this fix in order to work correctly, since both ends call generate_and_send_sums() to decide what needs to be transferred. So unless you've also gotten planetmirror.com to apply this fix to their rsync server, you will still have related rollover problems. This is probably what is causing your problem. Evan
> Keep in mind, BOTH ends of the rsync connection need to apply > this fix in order to work correctly Only the receiving side calls generate_and_send_sums(), so only the receiver is affected by this fix. As long as Ludi is just pulling files, it shouldn't matter if their rsync is patched or not. I'm going to do some testing and determine if I need to reopen the bug.
I was under the impression that both sides needed it, because after I patched one end, it still didn't work for me, but after I patched both ends, it did. But maybe I screwed up something in the test. Evan
(In reply to comment #2) > I've applied the patch and the problem in generate_and_send_sums is > indeed fixed, but it appears to have just moved onto another much graver bug > which causes nothing to get transfered: I too saw an error trying to copy that file from planetmirror.com when I have a 4GB basis file on the receiving system, but it is the rsync daemon on the planetmirror.com server that is going away, so the only way to see what is wrong is to ask someone at that site to look in the rsync log and see if there are any errors. I tried a similar copy from two local rsync daemons (one running 2.6.2 and one running CVS), and neither one exhibited this failure, so it may be some kind of a resource problem on the planetmirror server (e.g. perhaps the daemon exceeded the ulimit memory limit).
OK, so assuming the fault was with rsync.planetmirror.com I retried this time using rsync.kernel.org, but seem to have hit another problem: After taking an eternity the file was downloaded, size was ultimately correct, upon which it barfed, deleted the tempfile and restarted. Because the file was automatically deleted I cannot confirm or deny if it actually had a bad checksum. ... chunk[66398] of size 66384 at 4407764832 offset=4407675447 chunk[66399] of size 4729 at 4407831216 offset=4407741831 got file_sum WARNING: FC3-test1-i386-DVD.iso failed verification -- update discarded (will try again). recv_generator(FC3-test1-i386-DVD.iso,0) recv_files phase=1 gen mapped FC3-test1-i386-DVD.iso of size 4407835945 generating and sending sums for 0 count=66400 rem=4729 blength=66384 s2length=16 flength=4407835945 chunk[0] offset=0 len=66384 sum1=48387eea chunk[1] offset=66384 len=66384 sum1=f402c97e (In reply to comment #6) > I too saw an error trying to copy that file from planetmirror.com when I have a > 4GB basis file on the receiving system, but it is the rsync daemon on the > planetmirror.com server that is going away, so the only way to see what is wrong > is to ask someone at that site to look in the rsync log and see if there are any > errors. I tried a similar copy from two local rsync daemons (one running 2.6.2 > and one running CVS), and neither one exhibited this failure, so it may be some > kind of a resource problem on the planetmirror server (e.g. perhaps the daemon > exceeded the ulimit memory limit).
Now when len is OFF_T is it possible that sum->count (which is size_t) in sum_sizes_sqroot() will rollover too at line: sum->count = (len + (blength - 1)) / blength; ? When we assume all variables have all bits set: 2^64 + (2^32 - 1) / 2^32 = 2^32 + 1 But sum->count can only represent 2^32 and thereofre sum->count will be 0. Did I made a mistake here?
(In reply to comment #8) > Did I made a mistake here? I have some minor quibbles with what you stated (which I will describe below), but, as you point out, this value can theoretically overflow. However, for the overflow to happen the file must be nearly an exbibyte in size (over a billion GiB), and anyone who is trying to rsync anywhere near that large of a file will experience rsync grinding to a halt due to the hash collisions in its checksum search algorithm (which will happen long before reaching rsync's maximum file-size limit). So, yes -- that value could theoretically overflow, but is not likely that it will in actual practice. As promised (threatened?? :-) ), I'll point out my quibbles with your logic, just for completeness: If we just ignore that 2^32+1 really would truncate to 1 instead of 0, the remaining problems are due to size differences from the variables in the latest rsync, which may stem from you looking at an older version of the source: - "len" is an int64 whose maximum positive value is 2^63 - 1. - "blength" has an enforced upper bound of 2^29. - "count" is an int32 whose maximum positive value is 2^31 - 1. The program logic as written (adding blength - 1 to the length) could cause a signed value to overflow into the negative when it is near its limit, so, let's pretend the code is written like this (which it probably will be soon*): sum->remainder = len % blength; sum->count = len / blength + (sum->remainder != 0); OK. That makes the maximum result prior to truncation: (2^63-1) / (2^29) + 1 = 2^34 As noted before, this is larger than the 2^31 - 1 value that "count" can hold. I calculate that the maximum file size is therefore (2^31-1) * 2^29, or 1,152,921,504,069,976,064 bytes (or slightly less than 2^60, which is an exbibyte). * The reason I will probably change this to avoid the potential overflow of adding (blength-1) is that rsync allows itself to be compiled on a system that does not have 64-bit integers, which means that the "len" value might really be an int32, and we don't want the length to be able to overflow into negative values.
thomas@spiral:~> cat rsync-intoverflowtest.c #include <limits.h> #include <stdio.h> int main(void) { unsigned i = UINT_MAX; printf("i = %u, i+1 = %u\n", i, i+1); exit(0); } thomas@spiral:~> ./rsync-intoverflowtest i = 4294967295, i+1 = 0 thomas@spiral:~> Yes, I looked at an older version (just the one we ship with our enterprise server :). I'll have a look at the rest of your comment tomorrow. :)
Sorry, Wayne you are right 2^32+1 really is 1. :) Yes 2^60 seems much to big to be practicable today but maybe, to avoid problems in the future, it would be better to have all this type-clean. BTW, AFAICR int overflows for signed types depend on the compiler and are not standardized.