Retrieving, Deleting, and Reindexing (Updating) Documents

Learning objectives

Retrieving a Document by Its Document ID

You may sometimes need to retrieve a document by its document ID, instead of using a query. In the example application, this comes up in the context of product review creation, where a new review triggers some Datastore-related bookkeeping, after which the relevant product document is updated with its new average rating.

You can retrieve a document with the Index.list_documents method. Pass the document ID as the start_doc_id parameter, and limit the returned list to size 1 for efficiency. If the specified document ID is not in the index, the first document in the index will be returned instead, so you’ll need to check that this case did not occur:


@classmethod
def getDoc(cls, doc_id):
  """Return the document with the given doc id. One way to do this is via
  the list_documents method, as shown here. If the doc id is not in the
  index, the first doc in the index will be returned instead, so we need
  to check for that case."""
  if not doc_id:
    return None
  try:
    index = cls.getIndex()
    response = index.list_documents(
        start_doc_id=doc_id, limit=1, include_start_doc=True)
    if response.results and response.results[0].doc_id == doc_id:
      return response.results[0]
    return None
  except search.InvalidRequest: # catches ill-formed doc ids
    return None

Deleting a Document from an Index

To delete a document from an index, simply pass its document ID to the index’s remove() method. Don’t forget to catch a search.Error exception:


@classmethod
def removeDocById(cls, doc_id):
  """Remove the doc with the given doc id."""
  try:
    cls.getIndex().remove(doc_id)
  except search.Error:
    logging.exception("Error removing doc id %s.", doc_id)

You can delete all documents from a given index via the index’s list_documents method. For efficiency, set the ids_only parameter to True , meaning that the returned document objects will contain only their IDs and not the document fields, which you don’t need here. Remove each document based on the returned ID:


@classmethod
def deleteAllInIndex(cls):
  """Delete all the docs in the given index."""
  docindex = cls.getIndex()

  try:
    while True:
      # until no more documents, get a list of documents,
      # constraining the returned objects to contain only the doc ids,
      # extract the doc ids, and delete the docs.
      document_ids = [document.doc_id
                      for document in docindex.list_documents(ids_only=True)]
      if not document_ids:
        break
      docindex.remove(document_ids)
  except search.Error:
    logging.exception("Error removing documents:")

Notice that this method loops until there are no more documents in the index. This is because list_documents returns at most only 1000 documents at a time (the default limit is 100), so multiple calls may be needed to clear the whole index.

Note: Index deletion doesn’t change the index’s schema: the schema information is updated monotonically when documents are added, and is not edited when they are removed.

Document Reindexing

To update or change an indexed document, simply add a new document object to the index with the same document ID. If the index already contains a document with that ID, the existing document will be updated and reindexed; if no document already exists in the index with that ID, the new document will simply be added with the given ID.

The example application uses product IDs from the sample product data as the document IDs. (If you look at the code, you’ll notice that it also uses the product IDs as Product entity IDs in the Datastore). Because the document IDs are the same as the product IDs, it’s easy to reindex the product data if it changes. Because we get the product IDs from the data source, we can update the indexed documents without needing to retrieve them first.

To step through this process, first take a look at the files data/sample_data_books.csv and data/sample_data_books_update.csv from the example application. These contain the application’s sample product data. When the user clicks the link Delete all datastore and index product data, then load in sample product data, all of the data in data/sample_data_books.csv is imported, first deleting any existing index contents. For the purposes of this discussion, the pertinent part of the process is that when a new document is created, its ID is set to the product ID:


d = search.Document(doc_id=product_id, fields=docfields)

The document is then added to the product index.

Next, if the user clicks Demo loading product update data, the data in data/sample_data_books_update.csv is added to the index. Some of the entries in this file update existing book documents, since their product IDs correspond to existing documents. Other entries in this file define new books—by definition-since no existing documents have their product IDs.

Since we’re using the product IDs as document IDs, we can create new documents from this data, setting their document IDs to the product IDs as above, and simply add the documents. We don’t need to know whether any documents with these product IDs already exist. If documents with those same product IDs do exist, they will be updated with the new content and reindexed; if not, new documents are indexed.

(If you look at the example application code, you’ll notice that this is not quite the whole story: in some situations, there is information in an existing document that we need to retain and set in the updated document, so in those cases we do need to access the old document if it exists.)

Summary and Review

In this lesson, we’ve seen how to retrieve documents by document IDs and how to delete and update them.

This lesson concludes the Deeper Look at the Python Search API class. In the course of this class and its precursor , you’ve accumulated the basic toolkit for building applications that use the Search API. Try creating a simple application of your own, or making additional modifications to the example app!
You can get help on Stack Overflow using the google-app-engine tag , or in the App Engine Google Group .

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License , and code samples are licensed under the Apache 2.0 License . For details, see our Site Policies .

Last updated April 3, 2014.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.

Google Developers Academy