Bug 9812 - Lookahead file-list loading and comparison
Summary: Lookahead file-list loading and comparison
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-18 17:25 UTC by Haravikk
Modified: 2013-04-22 11:34 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Haravikk 2013-04-18 17:25:35 UTC
I've been using rsync for various things for some time now, but only recently have I properly begun using it with a remote server, in my particular case to create redundant copies of very large backup structures (almost a million files, ~3tb in total) which of course is trying for most software to manage.


However, the main problem that I've noticed with rsync is that it takes a *very* long time to detect changes that can start being synced to the server, even with incremental file lists, presumably a result of having to build a list of current X current files, send to the other server and then await a response.

I think the best way to resolve this is to provide more look-ahead on the file list exchanges. Basically what would happen is that once the client has sent the parameters to the receiver, both will start loading all matching files in order to get timestamps/checksums ready for comparison. As soon as the first file-list segment is ready the client will send it. Hopefully by the time it does the server already has a full set of file-data in the same basic order to compare against, allowing it to rapidly detect changed, deleted or new files.

This process can also be optimised, such that if the file data for an entire directory is loaded before the next segment/comparison is required, then it will be condensed into a timestamp/checksum for the directory only. In this way the client can sent any available directory times/checksums for the receiver for rapid comparison; if the receiver's directory isn't matched then it will request the file-data from the client, which should still have it cached.

The whole mechanism would operate within a reasonable buffer, to conserve memory but while holding onto enough file-data at each end for quick sending/comparison as required.


Basically the idea is to get each end of the connection doing as much work as it can without actually having to communicate with each other, so that when communication does occur it is as optimised as possible.
Comment 1 Paul Slootman 2013-04-22 11:19:23 UTC
> Basically what would happen is that once the client has sent
the parameters to the receiver, both will start loading all matching files in
order to get timestamps/checksums ready for comparison. As soon as the first
file-list segment is ready the client will send it. Hopefully by the time it
does the server already has a full set of file-data in the same basic order to
compare against, allowing it to rapidly detect changed, deleted or new files.

Didn't you just basically describe the incremental file transfer already implemented by rsync?
Comment 2 Haravikk 2013-04-22 11:34:40 UTC
If that's the case then it doesn't seem to be very effective if it is working that way already; what about ahead of time comparisons? For example, if the first file in the transfer is huge, then rsync should be looking at all the files after it in order to find out what needs to be transferred next, possibly generating checksums in advance.

Basically, if the sender is currently sending a file, then does the receiver continue sending file data in return for the sender to process in advance?