rsync - Synchronize content of two buckets/directories
Synopsis
gsutil rsync [-c] [-C] [-d] [-e] [-n] [-p] [-R] src_url dst_url
Description
The gsutil rsync command makes the contents under dst_url the same as the contents under src_url, by copying any missing files/objects, and (if the -d option is specified) deleting any extra files/objects. For example, to make gs://mybucket/data match the contents of the local directory “data” you could do:
gsutil rsync -d data gs://mybucket/data
To recurse into directories use the -r option:
gsutil rsync -d -r data gs://mybucket/data
To copy only new/changed files without deleting extra files from gs://mybucket/data leave off the -d option:
gsutil rsync -r data gs://mybucket/data
If you have a large number of objects to synchronize you might want to use the gsutil -m option, to perform parallel (multi-threaded/multi-processing) synchronization:
gsutil -m rsync -d -r data gs://mybucket/data
The -m option typically will provide a large performance boost if either the source or destination (or both) is a cloud URL. If both source and destination are file URLs the -m option will typically thrash the disk and slow synchronization down.
To make the local directory “data” the same as the contents of gs://mybucket/data:
gsutil rsync -d -r gs://mybucket/data data
To make the contents of gs://mybucket2 the same as gs://mybucket1:
gsutil rsync -d -r gs://mybucket1 gs://mybucket2
You can also mirror data across local directories:
gsutil rsync -d -r dir1 dir2
To mirror your content across clouds:
gsutil rsync -d -r gs://my-gs-bucket s3://my-s3-bucket
Note: If you are synchronizing a large amount of data between clouds you might consider setting up a Google Compute Engine account and running gsutil there. Since cross-provider gsutil data transfers flow through the machine where gsutil is running, doing this can make your transfer run singificantly faster than running gsutil on your local workstation.
Checksum Validation And Failure Handling
At the end of every upload or download, the gsutil rsync command validates that the checksum of the source file/object matches the checksum of the destination file/object. If the checksums do not match, gsutil will delete the invalid copy and print a warning message. This very rarely happens, but if it does, please contact gs-team @ google . com .
The rsync command will retry when failures occur, but if enough failures happen during a particular copy or delete operation the command will skip that object and move on. At the end of the synchronization run if any failures were not successfully retried, the rsync command will report the count of failures, and exit with non-zero status. At this point you can run the rsync command again, and it will attempt any remaining needed copy and/or delete operations.
Note that there are cases where retrying will never succeed, such as if you don’t have write permission to the destination bucket or if the destination path for some objects is longer than the maximum allowed length.
For more details about gsutil’s retry handling, please see gsutil help retries .
Change Detection Algorithm
To determine if a file or object has changed gsutil rsync first checks whether the source and destination sizes match. If they match, it next checks if their checksums match, using whatever checksums are available (see below). Unlike the Unix rsync command, gsutil rsync does not use timestamps to determine if the file/object changed, because the GCS API does not permit the caller to set an object’s timestamp (hence, timestamps of identical files/objects cannot be made to match).
Checksums will not be available in two cases:
- When synchronizing to or from a file system. By default, gsutil does not checksum files, because of the slowdown caused when working with large files. You can cause gsutil to checksum files by using the gsutil rsync -c option, at the cost of increased local disk I/O and run time when working with large files.
- When comparing composite GCS objects with objects at a cloud provider that does not support CRC32C (which is the only checksum available for composite objects). See gsutil help compose for details about composite objects.
Copying In The Cloud And Metadata Preservation
If both the source and destination URL are cloud URLs from the same provider, gsutil copies data “in the cloud” (i.e., without downloading to and uploading from the machine where you run gsutil). In addition to the performance and cost advantages of doing this, copying in the cloud preserves metadata (like Content-Type and Cache-Control). In contrast, when you download data from the cloud it ends up in a file, which has no associated metadata. Thus, unless you have some way to hold on to or re-create that metadata, synchronizing a bucket to a directory in the local file system will not retain the metadata.
Note that by default, the gsutil rsync command does not copy the ACLs of objects being synchronized and instead will use the default bucket ACL (see gsutil help defacl ). You can override this behavior with the -p option (see OPTIONS below).
Slow Checksums
If you find that CRC32C checksum computation runs slowly, this is likely because you don’t have a compiled CRC32c on your system. Try running:
gsutil ver -l
If the output contains:
compiled crcmod: False
you are running a Python library for computing CRC32C, which is much slower than using the compiled code. For information on getting a compiled CRC32C implementation, see gsutil help crc32c .
Limitations
- The gsutil rsync command doesn’t make the destination object’s timestamps match those of the source object (it can’t; timestamp setting is not allowed by the GCS API).
- The gsutil rsync command ignores versioning, synchronizing only the live object versions in versioned buckets.
Options
-c | Causes the rsync command to compute checksums for files if the size of source and destination match, and then compare checksums. This option increases local disk I/O and run time if either src_url or dst_url are on the local file system. |
-C | If an error occurs, continue to attempt to copy the remaining files. If errors occurred, gsutil’s exit status will be non-zero even if this flag is set. This option is implicitly set when running “gsutil -m rsync...”. Note: -C only applies to the actual copying operation. If an error occurs while iterating over the files in the local directory (e.g., invalid Unicode file name) gsutil will print an error message and abort. |
-d | Delete extra files under dst_url not found under src_url. By default extra files are not deleted. |
-e | Exclude symlinks. When specified, symbolic links will be ignored. |
-n | Causes rsync to run in “dry run” mode, i.e., just outputting what would be copied or deleted without actually doing any copying/deleting. |
-p |
Causes ACLs to be preserved when synchronizing in the cloud. Note that this option has performance and cost implications when using the XML API, as it requires separate HTTP calls for interacting with ACLs. The performance issue can be mitigated to some degree by using gsutil -m rsync to cause parallel synchronization. Also, this option only works if you have OWNER access to all of the objects that are copied. You can avoid the additional performance and cost of using rsync -p if you want all objects in the destination bucket to end up with the same ACL by setting a default object ACL on that bucket instead of using rsync -p. See ‘help gsutil defacl’. |
-R , -r | Causes directories, buckets, and bucket subdirectories to be synchronized recursively. If you neglect to use this option gsutil will make only the top-level directory in the source and destination URLs match, skipping any sub-directories. |