Bug 5482 - look into a hierarchical checksum algorithm that would help to efficiently transmit really large files
look into a hierarchical checksum algorithm that would help to efficiently tr...
Product: rsync
Classification: Unclassified
Component: core
All All
: P3 enhancement
: ---
Assigned To: Wayne Davison
Rsync QA Contact
Depends on:
  Show dependency treegraph
Reported: 2008-05-21 22:14 UTC by Dave Yost
Modified: 2008-05-25 09:31 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Dave Yost 2008-05-21 22:14:08 UTC
These files are huge, and being able to drastically cut back on retransmission for such files would be a really big win. Many changes to these files involve changes to a tiny part of the file that changes in size and thus shifts the giant media data portions to a new offset not on a block boundary.

When rsync encounters one of these files, it should checksum each atom of the file instead of each block.
Comment 1 Matt McCutchen 2008-05-21 22:44:41 UTC
Rsync's delta-transfer algorithm already checks for each destination block at every byte offset in the source file (not just multiples of the block size), so the algorithm can match data across an insertion or deletion of an arbitrary number of bytes with a loss of about one additional block.  I don't think there is any need for special processing of .mov or .mp4 files.
Comment 2 Jamie Lokier 2008-05-24 17:46:18 UTC

Although it's true the delta-transfer algorithm detects the small changed regions, and block boundary is not an issue, even the delta-transfer algorithm is slow for _very_ large files with small changes.

Think about a 4GB media file with 10k of changes near the start.  In this, the delta-transfer algorithm transmits a lot of block checksum data just to do the comparisons - enough to be substantially affected by network bandwidth.  Alternatively, if the blocks are large to reduce the number of checksums, transmitting a single block of data is significant.

I doubt if the scenario described in this bug report is all that common.  How often do you change the header of a huge video file, without transcoding the contents as well?  However, if it is, can the delta-transfer algorithm be tuned better for this by using smaller blocks near the start of the file?

(More generally, a hierarchical delta algorithm (checksums of blocks of checksums of blocks - in a tree structure, but 2 or 3 levels may be plenty) would solve this in a general way for a number of things involving small changes in very large files.  If you concatenate all the files and metadata to make a single structured data stream to be delta-transferred, it may also be a good optimisation on data sets consisting of large numbers of files with only a few changed.  The logical extreme is transferring a single checksum being enough to compare the whole data set in just a few bytes, followed by a top down breadth-first checksum tree traversal.  This is what I am attempting in a project of mine.)
Comment 3 Dave Yost 2008-05-24 19:59:51 UTC
QuickTime and MPEG4 (as well as others, I think) use this file format

If rsync would understand this format, then for large atoms it could send a checksum for each atom instead of for each block.
Comment 4 Mike Frysinger 2008-05-25 02:07:52 UTC
making rsync understand random formats and parsing them really doesnt sound like an appropriate or scalable solution
Comment 5 Jamie Lokier 2008-05-25 08:33:45 UTC
I agree that it's not appropriate to parse file formats, unless there are a small number of commonly used, generic formats where it makes a very big difference.  I'm not sure if the media container formats would qualify or if it's a common scenario to have edited large media files.

However, what do you think of a _generic_ hierarchical checksum algorithm, for finding small changes in very large streams? 
Comment 6 Wayne Davison 2008-05-25 09:31:51 UTC
I also think coding a special transfer algorithm for media files is not something that would be appropriate for rsync (though someone may want to code their own custom version if this is important to them).

A general hierarchical checksum would be quite interesting, but we'd need to balance the costs of disk I/O and memory use, and avoid round-trip latency.  Rsync deals with latency by pipe-lining the checksum generation and file reconstruction in separate processes, but each one requires a separate read of the basis file.  So, we'd want to make rsync still able to process more than one active transfer and not do a read-read for each sub-level of checksum generation.

If rsync were made to send a first level set of checksums for each file, computing a certain number of sub-level checksums into memory, that might work for the generator at the expense of holding the hierarchical sums in memory for N active files at once (perhaps topping out after a certain selectable memory limit was reached).

The biggest unknown to me would be how to make the sender work efficiently for such an algorithm.  The sender is the part that must compute a new checksum on each character boundary (using a rolling checksum) to find relocation matches for each block.  Computing a rolling checksum using the current method would require holding huge amount of the file in memory at once (but that could be changed to use mem-mapped files, potentially triggering re-reads as the window slides forward).  We'd also need to move away from the current bifurcated checksum algorithm into a single, stronger rolling checksum to avoid having to recompute the strong checksum over a huge amount of data to double-check a match.  Also, each successive level of transfer would likely trigger a re-read of the part of the file that didn't match anything as we divide it into smaller checksums.

We'd also want to limit how hierarchical the checksums are based on file size, as smaller files just need the current method, while really large files might benefit from several levels.

We may want to even consider a reversal of the jobs of the sender and generator to lighten the load on the server.  This would slow things down latency-wise, but could make it possible for a server to deal with more clients, especially if this makes the per-byte processing more intensive.

Sounds like an intriguing idea to investigate further and see how feasible it would be.  Comments?  Jamie, do you have code to share?