December 2008
Updated January 2013 by Danny Hermes
This is part two of a five-part series on effectively scaling your App Engine-based apps. To see the other articles in the series, see Related links .
This article was originally written for SDK version 1.1.7 . The method for paging described here has been obsolete as of release 1.3.1. In this release, query cursors ( Java | Python ) superseded the techniques described below and are now the recommended method for paging through large datasets. A sample that shows paging with cursors is also included with the code for this article.
One of the typical web functions you may need to do is pagination, that is, displaying large datasets one page at a time. If you have too many items to display on a single web page you will display a subset of them, say 20, with a link to the next page of 20 items. For example, the comments on a blog would be paged if there are a lot of comments.
Fetch
Now you may be tempted to use
fetch(limit, offset=0)
to do your
paging, but this has limitations. From the documentation:
Note: fetch() returns a maximum of 1000 results. If more than 1000 entities match the query, and either no limit is specified or a limit larger than 1000 is used, only the first 1000 results are returned by fetch().
So you can use
fetch()
only if there are going to be less than 1000
entities. Even if you have less than 1000 entities you may still want to avoid it as the time
it takes grows as the value of offset increases. From the documentation:
The query has performance characteristics that correspond linearly with the offset amount plus the limit.
Paging on a property
If your application needs more than that then you will need to do something different. We will use a suggestion box as our working example. We will start with a naive implementation and slowly evolve it into a good working example of paging.
PAGE_SIZE = 10 class Suggestion(ndb.Model): suggestion = ndb.StringProperty() when = ndb.DateTimeProperty(auto_now_add=True)
The first requirement for paging is having an indexed property that you can run an inequality filter against. In this case let's construct a way to page through the suggestions in reverse chronological order. To get the first set of 10 suggestions:
def get(self): suggestions = Suggestion.query().order(-Suggestion.when).fetch(PAGE_SIZE) # ..render template..
But we need to know if we should provide a
next
link to the
next page if there happens to be enough data for another page, so
let's request
PAGE_SIZE + 1
entities, and if we actually get back
PAGE_SIZE + 1
entities then we know there are enough entities for
another page:
def get(self): next_result = None suggestions = Suggestion.query().order(-Suggestion.when).fetch(PAGE_SIZE + 1) if len(suggestions) == PAGE_SIZE + 1: next_result = suggestions[-1].when suggestions = suggestions[:PAGE_SIZE] # ..render template..
Now when we generate our HTML for the page of suggestions we can
use the value of
next_result
to determine whether or not to present
a
next
link. We also need a way to handle a request for the next
page, which we do by adding a query parameter
bookmark
that
contains the value of
when
that we should start the next page
at:
def get(self): next_result = None bookmark = self.request.get('bookmark') query = Suggestion.query().order(-Suggestion.when) if bookmark: query = query.filter(Suggestion.when <= bookmark) suggestions = query.fetch(PAGE_SIZE + 1) if len(suggestions) == PAGE_SIZE + 1: next_result = suggestions[-1].when suggestions = suggestions[:PAGE_SIZE] # ..render template..
If you are paging and using a property that is unique across all
entities then you're done and the above code should work for
you. Unfortunately the code presented above has a problem for
this particular example because the value of the
when
field isn't
guaranteed to be unique across all the entities, that is, it's
possible that more than one suggestion could come in at the exact
same time, as in this example:
offset |
suggestion
|
when
|
---|---|---|
... | ... | ... |
9 | Stock Jolt | 2008-10-26 04:38:00 |
10 | Allow dogs in the office | 2008-10-26 03:35:58 |
11 | Allow cats in the office | 2008-10-26 03:35:58 |
12 | Buy some multicolored exercise balls | 2008-10-26 01:10:03 |
13 | ... | ... |
If we present a page with the 10 most recent suggestions
you would end up with
2008-10-26 04:38:02
as the place to pick up the next 10 in the list.
Unfortunately #10 and #11 have the same value for
when
and you would get overlap,
the item at spot #10 appears as the last element on
the first page and also as the first element on the second page.
You can see how this problem would get worse the more duplicates
there were.
While we want to page in reverse chronological order, the
when
field may not be unique. What we need is a way to guarantee that it
is unique. One way to guarantee uniqueness is to have a single counter
that we increment each time we add a Suggestion entity to the
datastore. The value of the counter is appended to each
when
value
and guarantees they are unique:
offset |
suggestion
|
when
|
---|---|---|
... | ... | ... |
9 | Stock Jolt | 2008-10-26 04:38:00|09 |
10
|
Allow dogs in the office |
2008-10-26 03:35:58|10
|
11
|
Allow cats in the office | 2008-10-26 03:35:58|11 |
12 | Buy some multicolored exercise balls | 2008-10-26 01:10:03|12 |
13 | ... | ... |
Let's update our model to reflect this change. We will have a
created
property which is the date and time that the Suggestion
was created and change
creation_token
to be a
StringProperty
since
it will now contain the date and time concatenated with another
string that makes it unique. The
creation_token
property is the one we will
sort by when paginating.
class Suggestion(ndb.Model): suggestion = ndb.StringProperty() created = ndb.DateTimeProperty(auto_now_add=True) creation_token = ndb.StringProperty()
If the rate at which you add suggestions is low then this is an
acceptable solution, but if they are coming in quickly then the
counter ends up becoming a bottleneck as it can only be updated so
quickly. We need a way to generate uniqueness but also allow quick
updates, and if you've read the article on counter sharding then this is a
familiar problem with a familiar solution. In this case we are
going to shard our counters over the users that are adding
suggestions and use the user id as part of the unique value we add
to each value of
creation_token
.
So we need a
Contributor
:
class Contributor(ndb.Model): count = ndb.IntegerProperty(default=0) @classmethod @ndb.transactional def unique_id(cls, email): """Increments contributor suggestion count and creates unique ID.""" contributor = cls.get_by_id(email) if contributor == None: contributor = cls(id=email) contributor.count += 1 contributor.put() return '{}|{:d}'.format(email, contributor.count)
Now this looks good:
offset |
suggestion
|
creation_token
|
---|---|---|
... | ... | ... |
9 | Stock Jolt | 2008-10-26 04:38:00|[email protected]|1 |
10 | Allow dogs in the office | 2008-10-26 03:35:58|[email protected]|1 |
11 | Allow cats in the office | 2008-10-26 03:35:58|[email protected]|2 |
12 | Buy some multicolored exercise balls | 2008-10-26 01:10:03|[email protected]|3 |
13 | ... | ... |
But we do have one more problem, and that is that we now expose the
email address of the person making the suggestion in our
creation_token
value, which will get exposed in our
next
link URI. To get around
that final problem we can obfuscate the data we append to each
creation_token
value by putting it through an MD5 hash:
... @classmethod @ndb.transactional def unique_id(cls, email): """Increments contributor suggestion count and creates unique ID.""" ... md5_hash = hashlib.md5('{}|{:d}'.format(email, contributor.count)) return md5_hash.hexdigest()
offset |
suggestion
|
creation_token
|
---|---|---|
... | ... | ... |
9 | Stock Jolt | 2008-10-26 04:38:00|aee15ab24b7b3718596e3acce04fba85 |
10 | Allow dogs in the office | 2008-10-26 03:35:58|404a3235076f6651914358680acf3cb5 |
11 | Allow cats in the office | 2008-10-26 03:35:58|7574b989df099d4e2b95619c9cf0c2a0 |
12 | Buy some multicolored exercise balls | 2008-10-26 01:10:03|c675e87cc990a718979afecc93a77bc1 |
13 | ... | ... |
Code
All the source for this article can be found in the in our samples repository, and it contains a full example of creating a unique index and also a full example of paging with cursors, which was enabled as of release 1.3.1.
Summary
In summary, to do paging you need a property that has a unique
value for each entity and is indexed in the direction you want to
page. Always fetch
N + 1
entities to determine in you have enough
data for a
next
page. And finally, if you don't have uniqueness
in the property then you can use a single counter, or counters
sharded over some attribute such as the user, to make that property
unique.
Further Reading
- Watch Brett Slatkin's presentation at Google I/O: Building Scalable Web Applications with Google App Engine