Bug 4162 - Wanted: a mechanism to prevent rsync network compression of compressed files
Summary: Wanted: a mechanism to prevent rsync network compression of compressed files
Status: CLOSED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.0.0
Hardware: PPC Mac OS X
: P3 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-11 00:57 UTC by Maynard Handley
Modified: 2008-07-26 10:05 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Maynard Handley 2006-10-11 00:57:02 UTC
rsync, of course, offers the -z flag to allow for the transfer of compressed file over the network during a backup. 
My experience using this flag has been that (on a 1GHz PPC laptop connected using 802.11g)
- without using the -z flag the average data rate over the connection is about 2.5MB/s which is about where you'd expect 802.11g to max out, all things considered
- with using the -z flag the average data rate over the connection is about 1MB/s, and the CPU is maxed out.

Now, if what was being transferred was a stream of text (compression ratio of say 4 or so), this would still be a win. But on a modern personal system, the bulk of the material transferred (by bytes, not by file number) is going to be photos, audio files and video files, ie already compressed stuff, so the mean compression rate over the entire stream of data is going to be just a bit over 1, and using -z is a loss.

The obvious issue, then, is how can we get the goodness of -z for text files, while avoiding the cost of the CPU to compress files that aren't going to compress much. 

Two obvious strategies spring to mind:
* We could track the progress of the compression and bail out if it is less than some lower limit, maybe 1.2 or so. Maybe run compression till about 8KiB into the file, see how things are going and coontinue compressed or switch to uncompressed. AND/OR
* We could simply allow for a user-supplied list of files, (presumably the same syntax as backup_excludes) that we would not bother to try to compress. This may not be as robust a strategy as the first scheme, but it is much easier to program, and should be good enough for most purposes. I'd recommend in addition that rsync ship with a starter template for this file that includes all the usual suspects from *.gzip through *.mp3, *.mov, *ogg etc etc.
Comment 1 Boris Folgmann (dead mail address) 2007-07-12 05:14:54 UTC
I would also like to see such a function. On Linux the file command could be used to determine the exact file type, which is more robust than using a built-in or user-supplied list of file extentions (like .gz, .bz2, .jpg and so on)

Instead of calling the file command rsync can directly use libmagic. That should also be possible on non-UNIX systems I think, since the library is surely portable.

The other solution by testing compression on the first 8k is also a very good one, that might be even faster to implement.
Comment 2 Wayne Davison 2007-07-14 14:46:32 UTC
Rsync was already skipping the list of file suffixes that were listed under the "dont compress" option in the daemon manpage (even though that wasn't clear from the docs).

I added a --skip-compress=LIST option that allows the user to specify a list of file suffixes to not compress.  When this option is specified, it overrides the default list except when pulling from a daemon, where it is appended to the daemon's rules.

I also added several suffixes to the default list and made the suffix-matching code much faster (so that the expanded list will not slow things down).

If anyone has suggestions for more suffixes that should be skipped by default, let me know.