The Samba-Bugzilla – Bug 4768
problem sync big filesystem over slow connection
Last modified: 2007-07-13 22:49:42 UTC
i have a big problem with rsync.
i use it to backup a lot of data over slow internet-connection from one system to an other system.
if there are changed to many data, so rsync must sync too much, the connection is broken or not all changed data can copied befor timeout is reached.
so it happens, that only 10% is synced from all data. after this break, rsync must copy all data from 10% to 100% what is not needed, because 80% of this data are at the backup-system.
i use the option --link-dest=dir to get one backup for every day of the month.
an option, which remember all files, which must be copied, and then copy this files, after all hardlinks are maked, speed up the process.
with such an option, rsync must only copy all changed files and no more all files, which are not synced the day bevor, even if they are copied in older copies.
this option also save space, because no file must exist more than once.
It would make more sense to fix the problem by using a script that runs rsync repeatedly with the same destination until rsync exits with a code other than 30 (timeout).
One thing you can do to avoid a gap causing a problem in the hard-linking of files is to use multiple --link-dest options. If the previous night's rsync did not complete, there may not be a file in yesterday's directory for an unchanged file to hard-link with, but if it's in the prior day's directory, it will be found and hard-linked. So, order the days like this (ignore the fact that the path's should be either absolute or .. relative):
rsync -av --link-dest=july10 --link-dest=july09 --link-dest=july08 ...
This will solve the problem where rsync tries to copy too much data if you don't have time to re-run it to completion as Matt suggested.
Also note that if you do rerun a --link-dest copy, there is the risk that an already-hard-linked file could get its attributes changed if a file has changed since the start of the first rsync run (since the destination directory is no longer empty). Because of this, it is a good idea to use the option --ignore-existing when re-running a partial --link-dest rsync copy; this tells rsync to skip the files that have already been duplicated into the destination hierarchy.
(In reply to comment #2)
> Also note that if you do rerun a --link-dest copy, there is the risk that an
> already-hard-linked file could get its attributes changed if a file has changed
> since the start of the first rsync run (since the destination directory is no
> longer empty). Because of this, it is a good idea to use the option
> --ignore-existing when re-running a partial --link-dest rsync copy; this tells
> rsync to skip the files that have already been duplicated into the destination
My proposed --no-tweak-hlinked option (bug 4561) would make it possible to safely update already-copied files when re-running a --link-dest copy. :)
i can't find the option --no-tweak-hlinked in rsync.
the idee, to use more then one link-dest, is good, but in this special situation, i need over 14 day to sync the filesystem, so i must set for all for days a link-dest.
i think, i use this with a master dir to protect rsync to copy files, which are already here and not changed after the syncronisation ist complete.
this help, if some big files need to much time to be copied.
this help me, but it is not a good solution for this problem.
if the syncronisation is complete, i have for all days of a month a directory with all files and every new syncronisation have all old files in the new destination directory.
but this must build a first time complete and this need many time.
i help me by syncing first only one local dir with the remote dir, but i must do this manual and can activate the automatic syncronisation after this is complete.
(In reply to comment #4)
> i can't find the option --no-tweak-hlinked in rsync.
That's because there is no such option in rsync. It's a proposed patch. It's also not needed if you use --ignore-existing, as I suggested.
I'm not planning to add the option you propose.
(In reply to comment #5)
> (In reply to comment #4)
> > i can't find the option --no-tweak-hlinked in rsync.
> That's because there is no such option in rsync. It's a proposed patch. It's
> also not needed if you use --ignore-existing, as I suggested.
> I'm not planning to add the option you propose.
There's a problem with --ignore-existing in this context.
If you are making multiple snapshots of a directory (as prompted this
little discussion), often they are backups of some kind, and that
means you often want the snapshots to preserve the hard-link structure
of the source directory.
I.e. you'll want to use --ignore-existing with -H.
But that doesn't work very usefully:
ln source/1 source/2
Let's make a copy:
rsync -avH --ignore-existing source/ backup/
But simulate as if it was aborted before finishing:
Now let's redo the copy, to finish. Using --ignore-existing as suggested:
rsync -avH --ignore-existing source/ backup/
ls -l backup
>>> total 0
>>> -rw-r--r-- 1 jamie jamie 0 2007-07-13 21:07 1
Oh dear! It doesn't copy a/2.
This is logical but it's not useful for this operation.
Thus --ignore-existing is not suitable for finishing previously
aborted copies when you use -H, whether it's to save space or preserve
structure. And not using -H makes those backups much less like backups.
i use this command-line to sync the backup:
rsync -l -H -r -t -D -p -v -P --bwlimit=10 -z --delete-after --force --timeout=300 --exclude-from=/cron/backup_server.not --partial --link-dest=/backup/server/master server::server/ /backup/server/13/
normaly, this works fine. but i must reinstall the server and somebody put to much data at one day to the server, so my backup is out of sync and i must resync it.
at the moment, i start the program every time, it is finished to get the rest of data to my backup-system.
rsync works fine for this function and the backups on local systems make noc problem. with a 100Mbit-network, the syncronisation is fast enough to copy all data in one day.
but the remote-system must sync over internet with max 10 kBytes/s and datacompression and is very slow if much data must be copied.
and this makes sometimes problems if an user think, he must make a backup of a directory with 2 gb of files on the server.
after more then 2 weeks, i have a syncronisation from 92% of the files and i think, that i need over 10 days for the rest.
i hope, you change your mind and insert an option in future, which make it easy, to add every time, the programs runs, the next part of data without risk, to loos an existing part while program is terminated before all data is copied.
an option, which tell rsync only make hardlinks and don't copy any data can also help to solve this problem.
then i can make all hardlinks at first run and copy all missing data at second run.
(In reply to comment #6)
> Oh dear! It doesn't copy a/2.
That was indeed the case in 2.6.9, but fortunately not in 3.0.0 (where the hard-link code is much improved). So, my suggestion is apparently only going to work reliably when the new version is released.