Bug 8615 - feature request 'update by reference'
Summary: feature request 'update by reference'
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.0.9
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-16 16:07 UTC by itpp11
Modified: 2011-11-23 21:19 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description itpp11 2011-11-16 16:07:41 UTC
This is a new feature request for rsync.
I call it a 'update by reference' option.

What is it suppose to do?
Update(create) a target file by using a (server)local file that is similar to the target being updated.

Scenario:
On a rsync server,
/Folder1/huge_file_1

On a rsync client,
/FolderLocal/huge_file_2

Commandline example:
rsync -rvt --inplace /FolderLocal/huge_file_2 root@127.0.0.1::Folder1/huge_file_2 --byref Folder1/huge_file_1

The file 'huge_file_2' will be created on the Server as '/Folder1/huge_file_2' but the actual Delta is taken from the difference between '/FolderLocal/huge_file_2' and '/Folder1/huge_file_1' (the reference file)

Argumentation:
A remote site holds large backup files which are created far away each day, these backup files do not differ alot from day to day but have to be made due to a backup policy.

Each day backup is then a new file which a rsync remote site does not know anything about and thus the entire backup file needs to be send across, this is wasting alot of bandwidth and alot of transfer time.

However the remote rsync site does know about previous backup files that might contain alot of similar data blocks.

If we could tell the remote rsync site(server) to 'create' the new backup file but reference the Delta on another file we might save alot of bandwidth and time.

Personally I've done and still doing this manually with DVD images by copying a DVD image that is most similar to the one being rsynced to the new named target and then 'delta overwrite' it with the real image file, not perfect but it does save me about 35% bandwidth and 45% time. These are just 4-5gb files, backup files run into the hundreds of gb's so a 'quick copy' like this dvd example is not an option.
Comment 1 itpp11 2011-11-18 11:35:57 UTC
While trying to find a workaround via ssh, like:
plink -ssh -v -C -L 875:localhost:873 -l root %NASDEST% -pw %rootPW% -m sshcmds.txt
Where sshcmds.txt contains cp(copy) commands to get future copies of backup files ready for a delta-overwrite on the remote(server) side I stumbled on rsync's fuzzy! which does what I was looking for!

For example:
rsync -vtrz --inplace --delete-after --fuzzy --copy-dest="/88_John_Cleese2" "/88_John_Cleese" "test@127.0.0.1::test/test/"

Will copy a new file to its destination and when a similar file is found IN the destination it will use that existing file as a Delta base ! Also the --copy-dest will tell the rsync server to look there as well for possible matches to the new file.

The only thing which isn't documented, is it possible to use:
--copy-dest="/*"
so that the entire destination is searched for a match ?
And how do you provide multiple destinations ?
Ea. is this allowed:
--copy-dest="/88_John_Cleese2";"/88_John_Cleese4"
Or do we need to repeat the command ?
--copy-dest="/88_John_Cleese2" --copy-dest="/88_John_Cleese4"
Comment 2 Wayne Davison 2011-11-23 20:40:07 UTC
Yeah, --fuzzy helps out in this situation.  The current option only scans the destination directory for similar/moved files that have a different name.  I have committed an enhancement to the 3.1.0dev git that lets the repeating of the --fuzzy option (e.g. -yy) ask rsync to also look through the matching alt-dest dir(s) that were specified.  Hopefully that will meet your needs?
Comment 3 itpp11 2011-11-23 21:19:41 UTC
Tnx! yes it would help if fuzzy would have a wider search area, for example ntbackup files are single files, wbadmin backup files are stored inside their own folder structure which changes with each new backup. If possible allow "../" to narrow fuzzy search to the tree backups are stored, ea.

ntbackup:
\backupFULL\server-021\20111120
\backupFULL\server-021\20111130

while you are in 20111130 a fuzzy search including ../ would include all known FULL backup files but only those of server-021 but without having to know the folder names.

wbadmin:
\backupFULL\server-041\20111114\windowsimagebackup\server-041\backup 2011-11-14 013016

The principle is then the same as the next backup rsync position would be "\backupFULL\server-041\201111xx" so that a fuzzy include of ../ would eventually search upwards into the tree of the last FULL backups finding a similar VHD backup file to Delta compare with.

Hopefully you can think of some kind of dynamic solution for this.