February 27, 2009
Updated April 2011 by Johan Euphrosine
Note: This article uses the Python runtime. Documentation is also available for the Java runtime
Introduction
Often when developing or supporting an App Engine application, it is useful
to be able to manipulate the datastore in ways that are not well suited to
the request/response model that works so well for serving Web Applications.
Previously, doing these sort of operations has entailed workarounds such as
app3
or
App Rocket
. Starting with
release 1.1.9 of the App Engine SDK, however, there's a new way to interact
with the datastore, in the form of the
remote_api
module. This
module allows remote access to the App Engine datastore, using the same APIs
you know and love from writing App Engine Apps.
In this article, we'll introduce you to the remote_api module, describe its basic functionality, and show you how to get an interactive console with access to your app's datastore. Then, we'll give an overview of the limitations of the remote_api module. Finally, we'll walk through a more sophisticated example: an implementation of the 'Map' part of a map/reduce operation, allowing you to execute a function on every entity of a kind.
About the remote_api module
The remote_api module consists of two parts: A 'handler', which you install on the server to handle remote datastore requests, and a 'stub', which you set up on the client to translate datastore requests into calls to the remote handler. remote_api works at the lowest level of the datastore, so once you've set up the stub, you don't have to worry about the fact that you're operating on a remote datastore: With a few caveats, it works exactly the same as if you were accessing the datastore directly.
Installing the handler is easy. Simply add the following lines to your
app.yaml
under the 'handlers' key:
builtins: - remote_api: on
This installs the remote_api handler under the url '
/_ah/remote_api
.
Once you've updated the
app.yaml
file, you'll need to execute
appcfg.py update
for your app to upload the new mapping.
Running the remote api shell
By running
remote_api_shell.py
from the command line, you can
interact with a Python shell that has access to your production
datastore.
Since you will probably want to access modules defined by your app, such as
the model definitions, we need to make sure that your app is in the Python
path. The easiest way to do this is to change directory to your app's root
directory (the one with
app.yaml
in it) before running the remote api shell.
Then, just execute:
python $GAE_SDK_ROOT/remote_api_shell.py -s your_app_id.appspot.com
Replace 'myapp' with the app ID of your app, and you should get a Python interactive console prompt.
For demonstration purposes, we'll use the Guestbook app from the Getting Started documentation. Assuming we're in the root directory for the guestbook app, issue:
>>> import helloworld >>> from google.appengine.ext import db
Now we have access to the contents of the guestbook app, we can issue commands just like we would if we were writing code to run on the server:
>>> # Fetch the most recent 10 guestbook entries >>> entries = helloworld.Greeting.all().order("-date").fetch(10) >>> >>> # Create our own guestbook entry >>> helloworld.Greeting(content="A greeting").put()
In general, the console will act exactly as if you were accessing the datastore directly, but because the script is running on your own machine, you don't have to worry about how long it takes to run, and you can access all the files and resources on your local machine as you normally would!
Limitations of remote_api
The remote_api module goes to great lengths to make sure that as far as possible, it behaves exactly like the native App Engine datastore. In some cases, this means doing things that are less efficient than they might otherwise be. When using remote_api, here's a few things to keep in mind:
Every datastore request requires a round-trip
Since you're accessing the datastore over HTTP, there's a bit more overhead and latency than when you access it locally. In order to speed things up and decrease load, try to limit the number of round-trips you do by batching gets and puts, and fetching batches of entities from queries. This is good advice not just for remote_api, but for using the datastore in general, since a batch operation is only considered to be a single Datastore operation. For example, instead of this:
for key in keys: rec = MyModel.get(key) rec.foo = bar rec.put()
you can do this:
records = MyModel.get(keys) for rec in records: rec.foo = bar db.Put(records)
Both examples have the same effect, but the latter requires only two roundtrips in total, while the former requires two roundtrips for each entity.
Requests to remote_api use quota
Since remote_api operates over HTTP, every datastore call you make incurs quota usage for HTTP requests, bytes in and out, as well as the usual datastore quota you would expect. Bear this in mind if you're using remote_api to do bulk updates.
1 MB API limits apply
As when running natively, the 1MB limit on API requests and responses still applies. If your entities are particularly large, you may need to limit the number you fetch or put at a time to keep below this limit. This conflicts with minimising round-trips, unfortunately, so the best advice is to use the largest batches you can without going over the request or response size limitations. For most entities, this is unlikely to be an issue, however.
Avoid iterating over queries
One common pattern with datastore access is the following:
q = MyModel.all() for entity in q: # Do something with entity
When you do this, the SDK fetches entities from the datastore in batches of 20, fetching a new batch whenever it uses up the existing ones. Because each batch has to be fetched in a separate request by remote_api, it's unable to do this as efficiently. Instead, remote_api executes an entirely new query for each batch, using the offset functionality to get further into the results.
If you know how many entities you need, you can do the whole fetch in one request by asking for the number you need:
entities = MyModel.all().fetch(100) for entity in entities: # Do something with entity
If you don't know how many entities you will want, you can use cursors to efficiently iterate over large result sets. This also allows you to avoid the 1000 entity limit imposed on normal datastore queries:
query = MyModel.all() entities = query.fetch(100) while entities: for entity in entities: # Do something with entity query.with_cursor(query.cursor()) entities = query.fetch(100)
Transactions are less efficient
In order to implement transactions via remote_api, it accumulates information on entities fetched inside the transaction, along with copies of entities that were put or deleted inside the transaction. When the transaction is committed, it sends all of this information off to the App Engine server, where it has to fetch all the entities that were used in the transaction again, verify that they have not been modified, then put and delete all the changes the transaction made and commit it. If there's a conflict, the server rolls back the transaction and notifies the client end, which then has to repeat the process all over again.
This approach works, and exactly duplicates the functionality provided by transactions on the local datastore, but is rather inefficient. By all means use transactions where they are necessary, but try to limit the number and complexity of the transactions you execute in the interest of efficiency.
Putting remote_api to work
Now that we've demonstrated the power of remote_api and outlined its limitations, it's time to put what we've learned to work with a practical tool. Frequently it would be useful to be able to iterate over every entity of a given kind, be it to extract their data, or to modify them and store the updated entities back to the datastore.
In order to achieve this, we're going to implement a simple 'map' framework.
We'll define a class,
Mapper
, that exposes a
map()
method for subclasses to extend, and a couple of fields -
KIND
and
FILTERS
- for them to define what kind to map over, and any
filters to apply.
class Mapper(object): # Subclasses should replace this with a model class (eg, model.Person). KIND = None # Subclasses can replace this with a list of (property, value) tuples to filter by. FILTERS = [] def map(self, entity): """Updates a single entity. Implementers should return a tuple containing two iterables (to_update, to_delete). """ return ([], []) def get_query(self): """Returns a query over the specified kind, with any appropriate filters applied.""" q = self.KIND.all() for prop, value in self.FILTERS: q.filter("%s =" % prop, value) return q def run(self, batch_size=100): """Executes the map procedure over all matching entities.""" q = self.get_query() entities = q.fetch(batch_size) while entities: to_put = [] to_delete = [] for entity in entities: map_updates, map_deletes = self.map(entity) to_put.extend(map_updates) to_delete.extend(map_deletes) if to_put: db.put(to_put) if to_delete: db.delete(to_delete) q.with_cursor(q.cursor()) entities = q.fetch(batch_size)
As you can see, there's not much to it. First, we define a convenience
method,
get_query()
, that returns a query that matches the kind
and filters specified in the class definition. This method could optionally be
overridden by a subclass, for example to support varying the filters at
runtime, as long as it uses only equality filters. Then, we define an instance
method,
run()
, which iterates over every matching entity in
batches, calling the
map()
function on each one, and updating or
deleting the entity as appropriate.
One caveat to our Mapper: The map process does not work from a snapshot of
the datastore. So if you return new entities from
map()
that
themselves meet the criteria for mapping, you may get them passed in to the
map()
function later in the process. Whether or not they do
depends on where their key sorts compared to the current record's key. As a
general rule, if you're going to create new entities of the same type in a
map()
function, you need some way to distinguish them from the
original entities so you don't process them a second time.
In order to use this class, we define a subclass that implements the
map()
function. In this example, we're going to add the phrase
'Bar!' to any guestbook entry that contains the phrase 'foo':
class GuestbookUpdater(Mapper): KIND = Greeting def map(self, entity): if entity.content.lower().find('foo') != -1: entity.content += ' Bar!' return ([entity], []) return ([], [])
Then, we instantiate our class and call run():
mapper = MyMapper() mapper.run()
You can try this out for yourself easily: Just enter the code in the interactive console we set up earlier.
Finally, here's a practical - though trivial - example of where our new framework can be useful: Deleting all the entities of a given kind.
class MyModelDeleter(Mapper): KIND = MyModel def map(self, entity): return ([], [entity])
Simple! Because Mapper takes care to always access
KIND
and
FILTERS
as instance variables, we can even generalize this to
allow you to select the kind and filters at runtime:
class BulkDeleter(Mapper): def __init__(self, kind, filters=None): self.KIND = kind if filters: self.FILTERS = filters def map(self, entity): return ([], [entity])
Of course, this is only the start of what you can do with remote_api and the Mapper framework. If you have a novel use you've come up with, please post it to the group - we'd love to hear about it.