Parallelizing RSYNC Processes August 18, 2007

Rsync: receiving file list…

If you’ve ever used RSYNC, you are probably a dedicated fan. It’s a fast, stable, mature, easy-to-use black-box tool that you can use to exactly backup/mirror directories between machines over the network. The attraction of RSYNC is that it only transfers what it has to in order to synchronize file sets. It does lots more, but using RSYNC as a backup/mirroring tool seems to be among the most popular applications.

However, synchronizing huge file systems across networks can cause some issues. It will work every time, but the speed in which it completes it’s initial “receiving file list…” operation can sometimes take an entire day to complete if you have a file system with 3.2 million small files on either side like one that I work with.

The RSYNC daemon/client negotiates a list of files to “deal with” when you first start the operation. This initial negotiation runs at a constant pace, in my case around 10,000 files/minute. Not at all bad, but when considering 3,200,000 files, this equates to 5.3 hours. Then if you use the –delete function, this is doubled since for some reason it does the same operation on your local receiving side after the remote file list is gathered.

Parallelizing RSYNC Processes

The good news is that RSYNC is very good at running parallel instances. With a little planning, you can reduce the amount of time it takes for RSYNC to gather it’s file list. The trick it to have multiple process each get a section of the directory tree that you are trying to backup. These multiple processes all run at the same time, with little impact on each other. For example, if you have a filesystem with 4,000,000 files, it would take around 6.6 hours to complete. However, if you go through the contents of your directory tree and pick out some criteria to split that into 2 x 2,000,000 file operations, and ran a seperate RSYNC process for each, the amount of time it takes until the RSYNC process actually start transferring files would be reduced in half to around 3.3 hours.

This does increase the amount of CPU and I/O that both your sending and receiving side use, but I’ve been able to run ~25 parallel instances without remotely degrading the rest of the system or slowing down the other RSYNC instances.

The key is to use the –include and –exclude command line switches to create selection criteria.

Example

drwxr-xr-x   2 root     root         179 Jul 19 16:22 directory_a
drwxr-xr-x   2 root     root         179 Aug 12 00:08 directory_b

If directory_a has 2,000,000 files underneath it. and directory_b also has 2,000,000 files, use the following idea to split them up. The –exclude option says in essence to “exclude everything that is not explicitly included”.

#!/bin/bash
rsync -av --include="/directory_a*" --exclude="/*" --progress remote::/ /localdir/ >  /tmp/myoutputa.log &
rsync -av --include="/directory_b*" --exclude="/*" --progress remote::/ /localdir/ >  /tmp/myoutputb.log &

The following will take about twice the amount of time gathering files than the above:

#!/bin/bash
rsync -av --progress remote::/ /localdir/ > /tmp/myoutput.log &

This is sure to save you a lot of time, especially if you have services quiesced and down while you are doing your rsync. You just need to be careful not leave anything out when you are explicitly including files, as this is rather easy to do.

At some point, I’d like to see the RSYNC daemon have an option to automatically split up the workload into manageable chunks at the expense of your CPU and I/O. However, I don’t know exactly how this would be accomplished since RSYNC would somehow need to know what the directories looked like before it started it’s process.

One option I could think of would be to add a feature to RSYNC to gather, save and use a “statistics file”. So, it could generate this statistics file on a remote directory tree, which would save general information like how many files are in each part of the tree. Then it could read this file to decide how to split up the file gathering tasks in order to optimizing time. This file could then be reused during subsequent operations. Of course this would only work if data was not constantly moving around between those directories and they stayed relatively proportional with respect to each other over time. This could also be dealt with entirely outside of the RSYNC process, perhaps in a script that would gather statistics and fork RSYNC processes based on what it finds.

Well, until then, the above method works great for me.

Leave a Reply