Scripting Production Transfers

Scripting Production Transfers

Scripting Production Transfers

Overview

If you use gsutil in large production tasks (such as uploading or downloading many GBs of data each night), there are a number of things you can do to help ensure success. Specifically, this section discusses how to script large production tasks around gsutil’s resumable transfer mechanism.

Background On Resumable Transfers

First, it’s helpful to understand gsutil’s resumable transfer mechanism, and how your script needs to be implemented around this mechanism to work reliably. gsutil uses resumable transfer support when you attempt to upload or download a file larger than a configurable threshold (by default, this threshold is 2 MB). When a transfer fails partway through (e.g., because of an intermittent network problem), gsutil uses a truncated randomized binary exponential backoff-and-retry strategy that by default will retry transfers up to 6 times over a 63 second period of time (see gsutil help retries for details). If the transfer fails each of these attempts with no intervening progress, gsutil gives up on the transfer, but keeps a “tracker” file for it in a configurable location (the default location is ~/.gsutil/, in a file named by a combination of the SHA1 hash of the name of the bucket and object being transferred and the last 16 characters of the file name). When transfers fail in this fashion, you can rerun gsutil at some later time (e.g., after the networking problem has been resolved), and the resumable transfer picks up where it left off.

Scripting Data Transfer Tasks

To script large production data transfer tasks around this mechanism, you can implement a script that runs periodically, determines which file transfers have not yet succeeded, and runs gsutil to copy them. Below, we offer a number of suggestions about how this type of scripting should be implemented:

When resumable transfers fail without any progress 6 times in a row over the course of up to 63 seconds, it probably won’t work to simply retry the transfer immediately. A more successful strategy would be to have a cron job that runs every 30 minutes, determines which transfers need to be run, and runs them. If the network experiences intermittent problems, the script picks up where it left off and will eventually succeed (once the network problem has been resolved).
If your business depends on timely data transfer, you should consider implementing some network monitoring. For example, you can implement a task that attempts a small download every few minutes and raises an alert if the attempt fails for several attempts in a row (or more or less frequently depending on your requirements), so that your IT staff can investigate problems promptly. As usual with monitoring implementations, you should experiment with the alerting thresholds, to avoid false positive alerts that cause your staff to begin ignoring the alerts.
There are a variety of ways you can determine what files remain to be transferred. We recommend that you avoid attempting to get a complete listing of a bucket containing many objects (e.g., tens of thousands or more). One strategy is to structure your object names in a way that represents your transfer process, and use gsutil prefix wildcards to request partial bucket listings. For example, if your periodic process involves downloading the current day’s objects, you could name objects using a year-month-day-object-ID format and then find today’s objects by using a command like gsutil ls “gs://bucket/2011-09-27-*”. Note that it is more efficient to have a non-wildcard prefix like this than to use something like gsutil ls “gs://bucket/*-2011-09-27”. The latter command actually requests a complete bucket listing and then filters in gsutil, while the former asks Google Storage to return the subset of objects whose names start with everything up to the “*”.

For data uploads, another technique would be to move local files from a “to be processed” area to a “done” area as your script successfully copies files to the cloud. You can do this in parallel batches by using a command like:
```
gsutil -m cp -R to_upload/subdir_$i gs://bucket/subdir_$i
```
where i is a shell loop variable. Make sure to check the shell $status variable is 0 after each gsutil cp command, to detect if some of the copies failed, and rerun the affected copies.

With this strategy, the file system keeps track of all remaining work to be done.
If you have really large numbers of objects in a single bucket (say hundreds of thousands or more), you should consider tracking your objects in a database instead of using bucket listings to enumerate the objects. For example this database could track the state of your downloads, so you can determine what objects need to be downloaded by your periodic download script by querying the database locally instead of performing a bucket listing.
Make sure you don’t delete partially downloaded files after a transfer fails: gsutil picks up where it left off (and performs an MD5 check of the final downloaded content to ensure data integrity), so deleting partially transferred files will cause you to lose progress and make more wasteful use of your network. You should also make sure whatever process is waiting to consume the downloaded data doesn’t get pointed at the partially downloaded files. One way to do this is to download into a staging directory and then move successfully downloaded files to a directory where consumer processes will read them.
If you have a fast network connection, you can speed up the transfer of large numbers of files by using the gsutil -m (multi-threading / multi-processing) option. Be aware, however, that gsutil doesn’t attempt to keep track of which files were downloaded successfully in cases where some files failed to download. For example, if you use multi-threaded transfers to download 100 files and 3 failed to download, it is up to your scripting process to determine which transfers didn’t succeed, and retry them. A periodic check-and-run approach like outlined earlier would handle this case.

If you use parallel transfers (gsutil -m) you might want to experiment with the number of threads being used (via the parallel_thread_count setting in the .boto config file). By default, gsutil uses 10 threads for Linux and 24 threads for other operating systems. Depending on your network speed, available memory, CPU load, and other conditions, this may or may not be optimal. Try experimenting with higher or lower numbers of threads to find the best number of threads for your environment.

Running Gsutil On Multiple Machines

When running gsutil on multiple machines that are all attempting to use the same OAuth2 refresh token, it is possible to encounter rate limiting errors for the refresh requests (especially if all of these machines are likely to start running gsutil at the same time). To account for this, gsutil will automatically retry OAuth2 refresh requests with a truncated randomized exponential backoff strategy like that which is described in the “BACKGROUND ON RESUMABLE TRANSFERS” section above. The number of retries attempted for OAuth2 refresh requests can be controlled via the “oauth2_refresh_retries” variable in the .boto config file.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License , and code samples are licensed under the Apache 2.0 License . For details, see our Site Policies .

Google Cloud Storage

Scripting Production Transfers

Scripting Production Transfers

Overview

Background On Resumable Transfers

Scripting Data Transfer Tasks

Running Gsutil On Multiple Machines

Authentication required

Signing you in...