Bug 10552 - Add optionion compress mode that skips including unchanged-data in the compression stream
Add optionion compress mode that skips including unchanged-data in the compre...
Status: RESOLVED FIXED
Product: rsync
Classification: Unclassified
Component: core
3.1.1
All All
: P5 enhancement
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-14 20:10 UTC by dougmiles
Modified: 2014-04-19 19:29 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dougmiles 2014-04-14 20:10:57 UTC
I've noticed that with the -z option enabled, comparing two files that are identical (or nearly so) but have different modification times takes significantly longer than without the -z option.  It looks like the entire file is compressed as the checksum is calculated even when no data needs to be transmitted to the receiver.

To replicate:

Create test folders/files (using incompressible data to maximize effect):
mkdir a b
dd if=/dev/urandom of=a/a.tst bs=1M count=250
cp a/a.tst b/

run rsync without compression:
touch a/a.tst
time rsync -av a/ b

run rsync with compression:
touch a/a.tst
time rsync -avz a/ b

The second time with the -z option will take significantly longer, even though the source and destination files are identical apart from the modification times.  I've also found that the latter uses much more CPU, but only on one of the two processes- the sender process I believe.

Even when I added --skip-compress=tst, it made rsync much faster than -z alone, but it still took about 30% longer than omitting -z entirely.
Comment 1 John Pierman 2014-04-17 12:34:03 UTC
Confirmed:

Without -z avg over 3 runs (real    0m5.998s)
With -z avg over 3 runs (real    0m8.490s)


On Mon, Apr 14, 2014 at 4:10 PM, <samba-bugs@samba.org> wrote:

> https://bugzilla.samba.org/show_bug.cgi?id=10552
>
>            Summary: Sender checksum calculation significantly slower with
>                     compression enabled
>            Product: rsync
>            Version: 3.1.1
>           Platform: All
>         OS/Version: All
>             Status: NEW
>           Severity: normal
>           Priority: P5
>          Component: core
>         AssignedTo: wayned@samba.org
>         ReportedBy: dougmiles@cox.net
>          QAContact: rsync-qa@samba.org
>
>
> I've noticed that with the -z option enabled, comparing two files that are
> identical (or nearly so) but have different modification times takes
> significantly longer than without the -z option.  It looks like the entire
> file
> is compressed as the checksum is calculated even when no data needs to be
> transmitted to the receiver.
>
> To replicate:
>
> Create test folders/files (using incompressible data to maximize effect):
> mkdir a b
> dd if=/dev/urandom of=a/a.tst bs=1M count=250
> cp a/a.tst b/
>
> run rsync without compression:
> touch a/a.tst
> time rsync -av a/ b
>
> run rsync with compression:
> touch a/a.tst
> time rsync -avz a/ b
>
> The second time with the -z option will take significantly longer, even
> though
> the source and destination files are identical apart from the modification
> times.  I've also found that the latter uses much more CPU, but only on
> one of
> the two processes- the sender process I believe.
>
> Even when I added --skip-compress=tst, it made rsync much faster than -z
> alone,
> but it still took about 30% longer than omitting -z entirely.
>
> --
> Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the QA contact for the bug.
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options:
> https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
>
Comment 2 dougmiles 2014-04-17 15:26:35 UTC
Just adding that when using fast storage such as an SSD and/or a slow processor like in some NAS boxes, the gap can widen quite a lot.  Here are my times on my 2.6GHz Core i7 Ivy Bridge Mac Mini with an SSD, for example:

Without -z:
real	0m1.789s
user	0m0.968s
sys	0m0.713s

With -z:
real	0m10.549s
user	0m10.039s
sys	0m0.967s
Comment 3 Wayne Davison 2014-04-19 16:57:45 UTC
Yes, this is part of the way that rsync trades CPU (and disk I/O) to reduce transfer I/O.  When compressing, both sides of the connection "prime the pump" for a compressed file's transfer by including matching data in the compression stream.  This ensures that by the time a difference is found that the data will compress more optimally.  It can't know in advance that there will be no differences in the whole file, since by the time it finds that out the transfer is done.  If this is causing you problems, you might try --checksum, but that can be slower too if there are a lot of unchanged files in the transfer that match in size & mtime, though you can improve the speed of --checksum though one of the rsync patches that caches checksum data.

There has been some thought to disable the shared-data part of the compression, since it complicates the compression-lib usage and (as you saw) is sometimes wasteful.  I'm marking this as an enhancement request to add such a compression mode.
Comment 4 Wayne Davison 2014-04-19 19:29:35 UTC
I've committed new-style compression that avoids compressing matching file data.  If both sides are at least 3.1.1, then using -zz will give you the new compress method.