The Samba-Bugzilla – Bug 10678
performance problem with lots of hard links?
Last modified: 2014-06-27 15:50:15 UTC
We use rsync to do backups. Each backed up file is hard linked to the corresponding file from the day before unless the file differs.
On the backup disk there are 3 directory trees, one per backed-up machine, each with 45 directory trees. The backup disk has about 1TB of backup content. df says each disk being backed up is about 90GB.
At the moment, we're copying an old backup disk to a new larger backup disk. It's taking forever and seems to be slowing down. After 8 hours it's copied only 37%. Maybe this is normal. Don't know.
Thinking that perhaps a later version of rsync would do better, we started again, overwriting what was already copied. 8 hours later rsync hasn't started writing any new files yet. CPU time usage is about 1/10 of real time, so rsync seems I/O bound.
It's possible that rsync is thrashing in the L2 cache, but there is a pretty big L3 cache on this Xeon chip.
I wonder if you have ever profiled a case like this to see if there is anything that could be better in the algorithm that manages huge numbers of hard links.
Perhaps yet another option flag could tell rsync that it's working on this kind of workload, and rsync could throw away bookkeeping data for hard links farther back than yesterday's backup tree.
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 190742 152692 20 0 189568 3728 poll_s S+ pts/5 0:00 | \_ sudo /usr/local/rsync/3.1.1/bin/rsync -aSHAX /oldbackup/ /backup
4 0 190756 190742 20 0 937948 760856 poll_s S+ pts/5 18:45 | \_ /usr/local/rsync/3.1.1/bin/rsync -aSHAX /oldbackup/ /backup
5 0 190757 190756 20 0 1369872 1202448 sync_b D+ pts/5 19:44 | \_ /usr/local/rsync/3.1.1/bin/rsync -aSHAX /oldbackup/ /backup
1 0 190758 190757 20 0 792392 605480 poll_s S+ pts/5 0:49 | \_ /usr/local/rsync/3.1.1/bin/rsync -aSHAX /oldbackup/ /backup
Filesystem 1K-blocks Used Available Use% Mounted_on
/dev/sdb3 1,906,881,372 377,934,500 1,432,082,872 21% /backup
/dev/sdc2 1,442,142,772 986,617,308 382,268,788 73% /oldbackup
This is a common problem. The thing is that rsync has to stat() every single one of those hard links and remember an inode>filename mapping for each. If you have lots of backups this can mean hundreds of millions of calls to stat() and many megabytes of memory to hold the table.
I highly recommend using lvm2 (I am assuming Linux here) storage for rsync backups. This way if you need to move your backups to new storage you can do so with pvmove instead of rsync. That isn't going to help you right now but if you move to lvm2 storage now you won't have this problem the next time you need to upgrade your storage.
BTW, I have had to do this before and I have never once completed it. Every time a few days into the process it was decided to abandon the old backups and only copy the most recent backups so that backing up could resume. pvmove otoh could do this in a couple of hours and not interfere with usage while it is running.
BTW, it is possible to move to lvm2 based storage without rsync. You could setup the lvm2 volume and then use dd or ddrescue (better status display and can be restarted if you specify a logfile) to copy the filesystem at the block level. Both would have to be unmounted of course. Then the filesystem can be resized to utilize the extra space.
Yes, I know that this means copying the unused portion of the filesystem but that will probably be faster than making all those stat and ln calls.