The Samba-Bugzilla – Bug 9814
--cache parameter for storing recent file data
Last modified: 2013-04-18 19:42:14 UTC
I know rsync is generally stateless, but caching of recent data is something that could significantly speed it up by skipping checksumming entirely.
The idea is that a file's absolute path will be checksummed (not its contents) and then looked up in a folder structure of cached details, or maybe even a database. A filesystem solution can optimise by using lines in a file as final indices to the cache (so checksums for multiple files can be grouped into a single file until it gets too large), since checksums are a fixed size, and timestamps can be as well.
Ideally we'd get support for at least the file-system method.
Necessary options would include a threshold for discarding cache entries that are too old by when the entry was modified and/or by how many times the entry has been accessed. The latter would allow rsync to only recheck files on every second pass for example.
For continuous incremental updates a cache that does a good job of balancing speed and size should allow comparisons to be performed extremely quickly, and could also be used to skip files entirely on the client-side if some kind cache comparison can be performed (so that the client can quickly decide if a file has probably already been backed up).
I just wanted to add that I mean caching in a fairly broad sense, rather than simply caching of checksums alone.
My main concern is the seemingly vast difference in speed between OS X's Time Machine and rsync; I think Time Machine uses a Spotlight database to compare the source and destination, and while I know rsync is fundamentally stateless, so probably shouldn't try to hook into anything quite like that, it should really be able to use some kind of cache to accelerate its comparison.
A possibility would be to allow creation of an SQLite database containing details of all files copied to a destination (and conversely another in the source to record everything that was copied from that location). These can then be used for accelerated lookups. If rsync could then hook into common systems such as Spotlight and equivalents then it could avoid having to manage a database of its own at one or both ends.