The Samba-Bugzilla – Bug 5124
Parallelize the rsync run using multiple threads and/or connections
Last modified: 2019-02-07 16:50:19 UTC
I would love to see rsync grow up to working better. One huge lesson that could be learned would be in taking ideas from other tools. One such tool would be lftp. I would totally go bananas if rsync could scan dirs while syncing in parallel. The gains from doing this are phenomenal. Since lftp already has a very good working model on how to accomplish this feat, I suggest that it is taken a look at. if you have never experienced how much better lftp works over wget, give it a try, and you will see exactly what I am talking about. Granted, the way rsync does it's job is vastly different, and for good reasons, however with today's more modern systems, that have dual cores, hyper threads, and plain amazing speeds, I see no reason not to at least offer the option to do such a very useful thing. Having limits on the server side, and limits set from the client side is also a good thing too, which is something ftpd's and httpd's have.
Thanks to all who have helped to develop this killer tool! Let's make it get faster now!
How does the feature you want differ from the incremental recursion mode that is being added to rsync 3.0.0?
It does so in parallel, via fork()... Have not looked to see if rsync is doint the same thing, but the point is that it opens multiple sockets to do it's job.
A stab at a more meaningful summary, and some thoughts:
My first reaction to the suggestion to use multiple connections is that it's a gimmick to get a higher total bandwidth allocation from routers that allocate bandwidth per connection; IMO, that would not be an appropriate goal. But there's another more fundamental benefit, even if the total bandwidth were to remain the same: loss of a single packet won't stall the rsync run because the other connections can continue (at least for a while) without that packet.
But why stop at several streams? Rsync could use datagrams (UDP) and just act on packets as they arrive, so that loss of a packet doesn't affect /any/ of the other packets. The only drawback is that we have really good tooling for working with streams (pipes, nc, port forwarding, TLS, etc.), while the tooling for datagrams is nonexistent or less mature (there is Datagram TLS, but I've never tried it).
Rather than implement the UDP stuff ad-hoc for rsync, I would like to see it adopt an application-level scheduler that maintains a list of active tasks (scanning a directory, transferring a file, etc.) and handles the rudiments of accepting a packet and calling the appropriate routine to take the next step on that task. If the scheduler would support asynchronous I/O, rsync could use that to dramatically cut time blocked on I/O by letting the OS decide the order in which to fulfill requests based on the actual layout of the files on disk. Once rsync exposes a set of available tasks to the scheduler, it becomes trivial to vary the number of OS threads in which the tasks run. This would be awesome but is probably better pursued in a successor to rsync.
I see this is quite old, and to be honest I'm not completely familiar with rsync's implementation, but more rsync performance is of benefit to everyone so I thought I'd chip in my thoughts.
While UDP would be a good option, it's a fairly complex one to implement as you'd essentially be reinventing the wheel when it came to re-requesting packets etc., though I believe there may be new libraries out there that could help with this; many Bittorrent clients for example now use µTP (micro Transport Protocol) which is basically just UDP with some failure tolerance, though this would still require some form of SSL support for widespread adoption.
Personally I don't think the number of TCP connections is the problem though as a single connection should be capable of utilising all available bandwidth. That said, one of the problems with TCP is the self-adjusting frame-size, so to get the most out of a connection you really need to utilise it at a constant rate, otherwise the window size will go down, this means any long pauses waiting for the next chunk of the file-list can result in performance dropping until the next file starts being sent.
An alternative fix for this problem is to do something similar to Google's SPDY protocol for HTTP, which is to multiplex several TCP connections together. Basically, rsync would add its own information to packets, allowing them to be quickly routed to/from multiple threads at each end, while sending all packets over a single connection. This means you can have file-list packets mixed in with multiple packets from various different file being transferred etc.; TCP will continue to ensure they arrive in the correct order etc., and all rsync has to do is setup an appropriate number of threads for generating chunks of the file-list, performing delta comparisons, and transferring files. Basically you end up with one thread acting a message dispatch service for this single connection, taking all messages received and sending them to appropriate worker threads, and packing outgoing messages ready to send down the TCP connection; the worker threads then perform file/folder comparison for different parts of the sync operation.
Not that this latter option isn't still complex and a lot of work, but IMO it's the best way to do things (if rsync isn't already), and it allows rsync to run multiple file/folder comparisons simultaneously depending upon the hardware at each end and the current speed of the sync operation.
Actually having two or three TCP streams at the same time has proven to be faster, because it can scan ahead. It is proven that if you download one large file, while downloading several smaller ones, that the entire transfer is faster because the handshake turn-around is hidden. It has nothing to do with getting around per-connection bandwidth limiters, although in some cases it can help with that too. Another proven case is your typical modern web browser. There is a very good reason why multiple connections are used to load in those pretty pictures you see. It is all about getting around the latency by using TCP as a double buffer. What is needed is the ability to be scanning on one side while transferring a file, and if we have a match as we go with the other process, start sending it on a second stream. Again, look at how lftp does it. the concept simply works fantastic. You get multiple dir scans in parallel, and data when it is to be updated, while still scanning.
UDP? Interesting idea but, not needed. Just do > 1 scan and send process.
I would also love to see multi-stream rsync. A quick Google shows many examples of people hacking up unsatisfactory versions of multi-stream rsync with find, xargs, parallel, and for-in. Check out this ugly-but-effective hack, for example:
Or this one, written in Perl:
Or these many suggestions:
You can also try this, if you want:
$threads=24; $src=/src/; $dest=/dest/ rsync -aL -f"+ */" -f"- *" $src $dest && (cd $src && find . -type f | xargs -n1 -P$threads -I% rsync -az % $dest/% )
As you can see, lots of people want to do this, even if they're not saying so in this Bugzilla thread. I'm currently tripling my transfer speed by using xargs to launch 4 rsync process in parallel, after failing to get any improvement from playing with TCP socket options.
But trying to create a general purpose multi-stream script to do it leads to ugly, ugly hacks. All of them have at least one of these faults:
- They don't take advantage of the new, faster directory-traversal code in recent rsync versions.
- They don't work with a remote rsync server.
- Their parallelism gets much worse if one of the subdirectories ends up with much more data than the others.
- They choke on unexpected characters in filenames (e.g. space, newline).
All of these problems would go away if rsync had native multi-streaming.
I also vote for this feature. Using multiple connections, rsync can use multiples internet connections at the same time.
+1 from me on this.
We have several situations where we need to copy a large number of very small files, and I expect that having multiple file transfer threads, allowing say ~5 transfers concurrently, would speed up the process considerably. I expect that this would also make better use of the available network bandwidth as each transfer appears to have an overhead for starting and completing the transfer which makes the effective transfer rate far less than the available network bandwidth. This is the method one of our pieces of backup software uses to speed up backups and is also implemented in FileZilla for file transfers. Consider a very large file that needs to be transferred, along with a number of small files. In a single transfer mode, all other files would need to wait while the large file is transferred. If there are multiple transfers happening concurrently, the smaller files will continue transferring while the large file transfers. I have seen the benefits of this sort of implementation in other software.
I can also see benefits in having file transfers begin whilst rsync is comparing files. This could logically work if you consider rsync makes a 'list' of files to be transferred and that it begins transferring files as soon as this list begins to be populated. In situations where there are a large number of files and few of these files changed, the sync could effectively be completed by the time rsync is finished comparing files (given the few changed files may have already been transferred during the file comparison). This also is effectively implemented in FileZilla (consider copying a directory in which FileZilla has to recurse into each directory and add each file to copy into the queue).
Interestingly, I assumed this was already an option for rsync, so I went looking to find the necessary option. However, all I found were the previously mentioned hacks, which weren't what I was going for.
The issue when copying a large number of small files is disk IO / seeking. Check the wait for IO values using top / whatever when doing such a transfer.
Running multiple threads in such a situation will only cause the disk to thrash
Multiple threads makes sense on high latency links.
(In reply to Paul Slootman from comment #9)
Multiple connections also makes sense on high bandwidth links. I’ve never been able to rsync at wire speed on a 40G link using only one connection.
(In reply to Haravikk from comment #4)
SPDY has apparently evolved into QUIC. QUIC supports multiple streams, which can be created by either end. There can be a huge number of these. It seems like a sender of files could create a stream per file it wanted to send, then send to that stream as async reads completed. The reads that complete first are sent first. Complete files on fast storage might be sent as one on slow storage streamed out at a lower rate. This should also allow the receiver to consume the incoming streams at different rates, as they might do if their destination media had different write performance.