Please note that the contents of this offline web site may be out of date. To access the most recent documentation visit the online version .
Note that links that point to online resources are green in color and will open in a new window.
We would love it if you could give us feedback about this material by filling this form (You have to be online to fill it)



Handling Datastore Errors

Nick Johnson
December 2009

This is one of a series of in-depth articles discussing App Engine's datastore. To see the other articles in the series, see Related links .

Inevitably, a very small percentage of datastore requests will result in errors. We are constantly working to minimize the occurrence of errors, but your application needs to be able to handle them when they do occur in order to present the best experience to users.

This article will explain why errors occur, and what you can do when they do, in order to minimize disruption to your users.

Note: If your app receives an exception when submitting a transaction, it does not always mean that the transaction failed.

You can receive the following exceptions in cases where transactions have been committed and eventually will be applied successfully:

.
  • In Java, DatastoreTimeoutException , ConcurrentModificationException , or DatastoreFailureException .
  • Whenever possible, make your datastore transactions idempotent so that if you repeat a transaction, the end result will be the same.

    Causes of errors

    There are two major reasons that datastore errors occur.

    Timeouts due to write contention

    The first type of timeout occurs when you attempt to write to a single entity group too quickly. Writes to a single entity group are serialized by the App Engine datastore, and thus there's a limit on how quickly you can update one entity group. In general, this works out to somewhere between 1 and 5 updates per second; a good guideline is that you should consider rearchitecting if you expect an entity group to have to sustain more than one update per second for an extended period. Recall that an entity group is a set of entities with the same ancestor—thus, an entity with no children is its own entity group, and this limitation applies to writes to individual entities, too. For details on how to avoid datastore contention, see Avoiding datastore contention . Timeout errors that occur during a transaction will be raised as a appengine.ext.db.TransactionFailedError instead of a Timeout .

    The most common way this limitation gets encountered is when you update an entity with every request—for example, counting the number of views to a page on your site. There are several approaches you can employ to avoid this: The most common is sharded counters . Another approach is to make the updates in memcache, flushing it to the datastore periodically. This risks losing some updates, but greatly improves the efficiency of updates.

    With the introduction of the Task Queue, another option is to create a task queue item to do the update later; this allows you to make it past high traffic periods without degrading the user experience. In exceptional circumstances, the task queue can also return a transient error, which you also need to handle.

    Timeouts due to datastore issues

    A very small number of datastore operations—generally less than 1 in 3,000—will result in a timeout in normal operation. This is due to the distributed nature of Bigtable , which the datastore is built on: occasionally your datastore request will happen to occur just as the data it concerns is being moved between servers or is otherwise briefly unavailable. This typically happens for one of several reasons:

    Some things your app does can increase the occurrence of tablet unavailability. For example, if you're inserting large amounts of data, that will cause tablet splits, which causes brief bursts of unavailability. Likewise, deleting large amounts of data will result in brief periods of unavailability as tablets are merged.

    Datastore errors due to the above reasons are highly clustered: when a tablet is being moved, split, or merged, it's generally unavailable for anywhere from a few hundred milliseconds to a second or two, and during that period, all reads and writes for that data will fail. During that time, your requests may return immediately with a timeout error. (Because the tablet is currently not loaded, Bigtable returns an error immediately, which the datastore treats the same as a regular timeout.) As a result, the exponential backoff strategy we covered above is advisable—retrying repeatedly as fast as you can will simply waste CPU time. In future, we may provide a way to distinguish regular timeouts from tablet unavailability.

    Another related cause of timeouts is known as "hot tablets." Each tablet is hosted on only one Bigtable tablet server at a time, which means that one server is responsible for handling every read and write for the row range covered by that tablet. Too high a rate of updates to the same tablet can cause timeouts as the tablet server struggles to keep up with the requests for that tablet. Bigtable is fairly smart about splitting hot tablets to spread the load, but if all the updates are for a single row, or are consecutive, this isn't enough to relieve the load.

    The most common example of this occurs when you are rapidly inserting a large number of entities of the same kind, with auto-generated IDs. In this case, most inserts hit the same range of the same tablet, and the single tablet server is overwhelmed with writes. Most apps never have to worry about this: it only becomes a problem at write rates of several hundred queries per second and above. If this does affect your app, the easiest solution is to use more evenly distributed IDs instead of the auto-allocated ones. For example, you can use Python's uuid module to generate a uuid for each entity as its key name.

    The Bigtable behaviors we described above all relate to a single tablet, but many datastore operations involve several tablets at once. For example, when you write a new or updated entity to the datastore, in addition to the entity itself being written, the indexes—both built-in and custom—have to be updated, which also requires separate Bigtable writes. When you execute a query, the datastore scans the index—one read—and then fetches the matching entities from the Entities table, which requires a read for each entity being returned, each of which could be on a separate tablet. All of these operations are performed in parallel, so the operation returns quickly, but tablet unavailability for any of them could cause the operation as a whole to time out.

    The 1 in 3,000 figure we originally mentioned, then, is an average: Simpler operations are less likely to cause a timeout than more involved ones, because they touch fewer tablets in Bigtable. Further, a tablet move can cause a whole cluster of correlated timeouts—most of which can be avoided by being smart about backing off and trying again.

    Finally, as with any service, occasional issues and downtime will occur; these can also cause elevated rates of errors in your app. When issues occur with the datastore, they'll be reported on our status site .

    Telling the two apart

    Determining the cause of errors in your app is generally fairly straightforward. If timeouts happen more frequently when updating a particular entity or group of entities, you're likely running into contention issues. If your timeouts are more randomly distributed, it's likely to be only the "background noise" of low level timeouts.

    Handling datastore timeouts

    Internally, all datastore operations are automatically retried if they time out, but if the timeouts persist, the error will be returned to your code in the form of a google.appengine.ext.db.Timeout exception in Python, or a com.google.appengine.api.datastore.DatastoreTimeoutException in Java. For more details about server-side retries, see the Life of a Datastore Write article.

    You have three main options for handling an exception:

    Authentication required

    You need to be signed in with Google+ to do that.

    Signing you in...

    Google Developers needs your permission to do that.