Please note that the contents
of this offline web site may be out of date. To access the most
recent documentation visit the
online version
.
Note that links that point to online resources are
green
in color and will open in a new
window.
We would love it if you could give us feedback about this material
by filling
this form
(You have to be online to fill it)
This is one of a
series
of in-depth articles discussing App Engine's datastore. To see the other articles in the series, see
Related links
.
Inevitably, a very small percentage of datastore requests will result in errors. We are constantly working to minimize the occurrence of errors, but your application needs to be able to handle them when they do occur in order to present the best experience to users.
This article will explain why errors occur, and what you can do when they do, in order to minimize disruption to your users.
Note:
If your app receives an exception when submitting a transaction, it does not always mean that the transaction failed.
You can receive the following exceptions in cases where transactions have been committed and eventually will be applied successfully:
Whenever possible, make your datastore transactions idempotent so that if you repeat a transaction, the end result will be the same.
Causes of errors
There are two major reasons that datastore errors occur.
Timeouts due to write contention
The first type of timeout occurs when you attempt to write to a single entity group too quickly. Writes to a single entity group are serialized by the App Engine datastore, and thus there's a limit on how quickly you can update one entity group. In general, this works out to somewhere between 1 and 5 updates per second; a good guideline is that you should consider rearchitecting if you expect an entity group to have to sustain more than one update per second for an extended period. Recall that an entity group is a set of entities with the same ancestor—thus, an entity with no children is its own entity group, and this limitation applies to writes to individual entities, too. For details on how to avoid datastore contention, see
Avoiding datastore contention
. Timeout errors that occur during a transaction will be raised as a
appengine.ext.db.TransactionFailedError
instead of a
Timeout
.
The most common way this limitation gets encountered is when you update an entity with every request—for example, counting the number of views to a page on your site. There are several approaches you can employ to avoid this: The most common is
sharded counters
. Another approach is to make the updates in memcache, flushing it to the datastore periodically. This risks losing some updates, but greatly improves the efficiency of updates.
With the introduction of the Task Queue, another option is to create a task queue item to do the update later; this allows you to make it past high traffic periods without degrading the user experience. In exceptional circumstances, the task queue can also return a transient error, which you also need to handle.
Timeouts due to datastore issues
A very small number of datastore operations—generally less than 1 in 3,000—will result in a timeout in normal operation. This is due to the distributed nature of
Bigtable
, which the datastore is built on: occasionally your datastore request will happen to occur just as the data it concerns is being moved between servers or is otherwise briefly unavailable. This typically happens for one of several reasons:
The tablet containing some of your data is being moved between Bigtable tablet servers for load-balancing at the time you try to access it.
The tablet containing some of your data is being split. This happens when the tablet becomes excessively large—over about 300MB—or when it receives more traffic than a single tablet server can handle. As a result of this, you will see slightly elevated timeout rates when your application is writing large amounts of data to the datastore.
The tablet is being merged with other tablets. This happens when a lot of data is deleted from your app's datastore.
Some things your app does can increase the occurrence of tablet unavailability. For example, if you're inserting large amounts of data, that will cause tablet splits, which causes brief bursts of unavailability. Likewise, deleting large amounts of data will result in brief periods of unavailability as tablets are merged.
Datastore errors due to the above reasons are highly clustered: when a tablet is being moved, split, or merged, it's generally unavailable for anywhere from a few hundred milliseconds to a second or two, and during that period, all reads and writes for that data will fail. During that time, your requests may return immediately with a timeout error. (Because the tablet is currently not loaded, Bigtable returns an error immediately, which the datastore treats the same as a regular timeout.) As a result, the exponential backoff strategy we covered above is advisable—retrying repeatedly as fast as you can will simply waste CPU time. In future, we may provide a way to distinguish regular timeouts from tablet unavailability.
Another related cause of timeouts is known as "hot tablets." Each tablet is hosted on only one Bigtable tablet server at a time, which means that one server is responsible for handling every read and write for the row range covered by that tablet. Too high a rate of updates to the same tablet can cause timeouts as the tablet server struggles to keep up with the requests for that tablet. Bigtable is fairly smart about splitting hot tablets to spread the load, but if all the updates are for a single row, or are consecutive, this isn't enough to relieve the load.
The most common example of this occurs when you are rapidly inserting a large number of entities of the same kind, with auto-generated IDs. In this case, most inserts hit the same range of the same tablet, and the single tablet server is overwhelmed with writes. Most apps never have to worry about this: it only becomes a problem at write rates of several hundred queries per second and above. If this does affect your app, the easiest solution is to use more evenly distributed IDs instead of the auto-allocated ones. For example, you can use Python's uuid module to generate a uuid for each entity as its key name.
The Bigtable behaviors we described above all relate to a single tablet, but many datastore operations involve several tablets at once. For example, when you write a new or updated entity to the datastore, in addition to the entity itself being written, the indexes—both built-in and custom—have to be updated, which also requires separate Bigtable writes. When you execute a query, the datastore scans the index—one read—and then fetches the matching entities from the Entities table, which requires a read for each entity being returned, each of which could be on a separate tablet. All of these operations are performed in parallel, so the operation returns quickly, but tablet unavailability for any of them could cause the operation as a whole to time out.
The 1 in 3,000 figure we originally mentioned, then, is an average: Simpler operations are less likely to cause a timeout than more involved ones, because they touch fewer tablets in Bigtable. Further, a tablet move can cause a whole cluster of correlated timeouts—most of which can be avoided by being smart about backing off and trying again.
Finally, as with any service, occasional issues and downtime will occur; these can also cause elevated rates of errors in your app. When issues occur with the datastore, they'll be reported on our
status site
.
Telling the two apart
Determining the cause of errors in your app is generally fairly straightforward. If timeouts happen more frequently when updating a particular entity or group of entities, you're likely running into contention issues. If your timeouts are more randomly distributed, it's likely to be only the "background noise" of low level timeouts.
Handling datastore timeouts
Internally, all datastore operations are automatically retried if they time out, but if the timeouts persist, the error will be returned to your code in the form of a
google.appengine.ext.db.Timeout
exception in Python, or a
com.google.appengine.api.datastore.DatastoreTimeoutException
in Java. For more details about server-side retries, see the
Life of a Datastore Write
article.
You have three main options for handling an exception:
Ignore the exception. This is the default, and results in a 500 Internal Server Error being returned to the user.
Catch the exception and return an error response to the user. If you're using the webapp framework, you can do this by extending your handler's
handle_exception
method:
@Override publicvoid doGet(HttpServletRequest req,HttpServletResponse resp) throwsIOException{ try{ // Code that could result in a timeout }catch(DatastoreTimeoutException e){ // Display a timeout-specific error page }catch(Exception e){ // Display a generic 500 Server Error page }finally{ // Code that should be run regardless of whether the request succeeds, // e.g. closing the PersistenceManager } } }
Consider retrying the datastore operation, if it is idempotent. Since App Engine already retries operations for you, it's likely that the timeout exception was raised because of a transient issue with the row(s) in question, and retrying further may not help. In certain circumstances, though, it's worth retrying anyway:
try: timeout_ms =100 whileTrue: try: db.put(entities) break except datastore_errors.Timeout: thread.sleep(timeout_ms) timeout_ms *=2 except apiproxy_errors.DeadlineExceededError: # Ran out of retries—display an error message to the user
Java
// Java low-level API int timeout_ms =100; while(true){ try{ db.put(entities); break; }catch(DatastoreTimeoutException e){ Thread.currentThread().sleep(timeout_ms); timeout_ms *=2; } }
A further option is to examine your use of the datastore: Can you move some of your work out of the datastore? Memcache is one good candidate for this: by caching your data, you can reduce the number of datastore operations you make, and thus the number of opportunities for a datastore timeout to occur. You can also use the Task Queue to do the write at a later time, which has the added benefit that the Task Queue automatically retries failures.