The Samba-Bugzilla – Bug 5324
Reduce the performance penalty of --xattrs on Mac OS X
Last modified: 2016-08-14 21:48:53 UTC
system on both server :
os x server 10.4.11
when i use the option -X (xattr) to synchronize about 400000 files between two server, the time is four time more long that i don't use this (2h us 1/2h).
rsync -aAX --del --force /source/ server2:/dest/ (400000 files -> 2hours)
rsync -aA --del --force /source/ server2:/dest/ (400000 files -> half hours)
Asking rsync to do more (preserve the xattrs) will inevitably make it take longer, but a 4x slowdown does seem excessive.
Do your files have a lot of differing xattrs? One thing I never much liked about the xattrs patch (from the very beginning) is that the code attempts to do a very simplistic linear search through all the prior xattrs looking for a matching set of attributes (to share matching attributes between files). If your files have a lot of xattr entries, that search will eat up more and more time as the list of unique attributes grows.
One solution to this might be to create a hash of all the names and xattr data, and then store the xattrs in a hash lookup. That should speed things up quite a bit when the list of unique xattr values grows large.
Note: if you are using the osx-create-time.diff, please switch to the crtime.diff instead -- the oxs-create-time.diff patch is known to be slow, quitely possibly due to the bloating of unique xattr values in the list.
> One solution to this might be to create a hash of all the names and xattr data,
and then store the xattrs in a hash lookup. That should speed things up quite
a bit when the list of unique xattr values grows large.
but how do it this hash ?
thank's a lot
(I use now the rsync 3.0.2)
Mac OS X makes extensive use of xattrs, and I've seen hundreds of thousands of unique xattrs on several end-user systems. In those cases rsync eventually runs out of memory and bails.
rsync should probably stop calling find_matching_xattr when the list reaches a specified size. And at least for Mac OS X, xattrs shouldn't be cached in rsync_xal_l, they should probably be compared on the fly by the generator somewhere near generate_and_send_sums.
Also, to address Fabrice's concern, performance will ultimately be directly linked to the lack of a quick heuristic for determining whether xattrs have been modified. xattrs don't have modifications dates, so you can use either the size or a checksum to determine if they've changed. In my analysis, rsync spent 20% of its time in md5_process on a task involving many xattrs with lots of data. Unless we come up with something faster than md5 (potentially at the cost of reliability), this is just a performance hit we'll have to live with.
I'm going to take a stab at this, but I'm curious whether you've put any more thought into how xattr support is implemented since your last comment here.
how about adding an option like "--use-ctime-before-xattr-compares", which only reads and compares EAs for files where the ctime on the source side is newer than the ctime on the target side. EA modifications update the ctime ususally. This would be a way to speed up syncing with EAs quite a lot I think.
Created attachment 12285 [details]
A first work in progress patch to add a hashtable
I need to clean this up and do more tests.
But with 1000 unique xattrs I got 50% less cpu instruction in callgrind.
Created attachment 12286 [details]
valgrind clean work in progress patch
Created attachment 12287 [details]
Possible patch for master
I hope this is an acceptable patchset to fix the problem for rsync master.
The patchset looks very nice. Thanks!
I've made some very minor tweaks and committed it.