I've noticed that with the -z option enabled, comparing two files that are identical (or nearly so) but have different modification times takes significantly longer than without the -z option. It looks like the entire file is compressed as the checksum is calculated even when no data needs to be transmitted to the receiver. To replicate: Create test folders/files (using incompressible data to maximize effect): mkdir a b dd if=/dev/urandom of=a/a.tst bs=1M count=250 cp a/a.tst b/ run rsync without compression: touch a/a.tst time rsync -av a/ b run rsync with compression: touch a/a.tst time rsync -avz a/ b The second time with the -z option will take significantly longer, even though the source and destination files are identical apart from the modification times. I've also found that the latter uses much more CPU, but only on one of the two processes- the sender process I believe. Even when I added --skip-compress=tst, it made rsync much faster than -z alone, but it still took about 30% longer than omitting -z entirely.
Confirmed: Without -z avg over 3 runs (real 0m5.998s) With -z avg over 3 runs (real 0m8.490s) On Mon, Apr 14, 2014 at 4:10 PM, <samba-bugs@samba.org> wrote: > https://bugzilla.samba.org/show_bug.cgi?id=10552 > > Summary: Sender checksum calculation significantly slower with > compression enabled > Product: rsync > Version: 3.1.1 > Platform: All > OS/Version: All > Status: NEW > Severity: normal > Priority: P5 > Component: core > AssignedTo: wayned@samba.org > ReportedBy: dougmiles@cox.net > QAContact: rsync-qa@samba.org > > > I've noticed that with the -z option enabled, comparing two files that are > identical (or nearly so) but have different modification times takes > significantly longer than without the -z option. It looks like the entire > file > is compressed as the checksum is calculated even when no data needs to be > transmitted to the receiver. > > To replicate: > > Create test folders/files (using incompressible data to maximize effect): > mkdir a b > dd if=/dev/urandom of=a/a.tst bs=1M count=250 > cp a/a.tst b/ > > run rsync without compression: > touch a/a.tst > time rsync -av a/ b > > run rsync with compression: > touch a/a.tst > time rsync -avz a/ b > > The second time with the -z option will take significantly longer, even > though > the source and destination files are identical apart from the modification > times. I've also found that the latter uses much more CPU, but only on > one of > the two processes- the sender process I believe. > > Even when I added --skip-compress=tst, it made rsync much faster than -z > alone, > but it still took about 30% longer than omitting -z entirely. > > -- > Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the QA contact for the bug. > -- > Please use reply-all for most replies to avoid omitting the mailing list. > To unsubscribe or change options: > https://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html >
Just adding that when using fast storage such as an SSD and/or a slow processor like in some NAS boxes, the gap can widen quite a lot. Here are my times on my 2.6GHz Core i7 Ivy Bridge Mac Mini with an SSD, for example: Without -z: real 0m1.789s user 0m0.968s sys 0m0.713s With -z: real 0m10.549s user 0m10.039s sys 0m0.967s
Yes, this is part of the way that rsync trades CPU (and disk I/O) to reduce transfer I/O. When compressing, both sides of the connection "prime the pump" for a compressed file's transfer by including matching data in the compression stream. This ensures that by the time a difference is found that the data will compress more optimally. It can't know in advance that there will be no differences in the whole file, since by the time it finds that out the transfer is done. If this is causing you problems, you might try --checksum, but that can be slower too if there are a lot of unchanged files in the transfer that match in size & mtime, though you can improve the speed of --checksum though one of the rsync patches that caches checksum data. There has been some thought to disable the shared-data part of the compression, since it complicates the compression-lib usage and (as you saw) is sometimes wasteful. I'm marking this as an enhancement request to add such a compression mode.
I've committed new-style compression that avoids compressing matching file data. If both sides are at least 3.1.1, then using -zz will give you the new compress method.