Amy Unruh, Oct 2012
Google Developer Relations
This lesson covers the basics of using the Search API: indexing content and making queries on an index. In it, you’ll learn how to
App Engine’s Search API operates through an
Index
object. This object lets you store data via an index document, retrieve documents using search queries, modify documents, and delete documents.
Each index has an
index name
and, optionally, a
namespace.
The name uniquely identifies the index within a given namespace. It must be a visible, printable ASCII string not starting with
!
. Whitespace characters are excluded. You can create multiple
Index
objects, but any two such objects that have the same index name in the same namespace reference the same index.
You can use namespaces and indexes to organize your documents. For the example product search application, all the product documents are in one index, with another index containing information about store locations. We can filter a query on the product category if we want to search for, say, only books.
In your code, you create an
Index
object by specifying the index name:
from google.appengine.api import search
index = search.Index(name=‘productsearch1’)
or
index = search.Index(name=‘yourindex’, namespace=‘yournamespace’)
The underlying document index will be created at first access if it does not already exist; you don’t have to create it explicitly.
You can’t currently delete indexes, though you can delete documents from them, as will be described in the next class, A Deeper Look at the Python Search API .
Documents
hold an index’s searchable content. A document is a container for structuring indexable data. From a technical point of view, a
Document
object represents a uniquely identified collection of
fields,
identified by a document ID. Fields are named, typed values. Documents do not have “kinds” in the same sense as
Datastore entities
.
In our example application, for instance, our product categories are books and HD televisions. (The store has a rather limited selection of products!) Each product document in the example application always includes the following core fields, defined by
docs.Product
class variables:
CATEGORY
(set to
books
or
hd_televisions
)
PID
(product ID)
PRODUCT_NAME
DESCRIPTION
PRICE
AVG_RATING
UPDATED
(date of last update)
Figure 1 : Product document fields.
The books and HD televisions categories each have some additional fields of their own. For books, the extra fields are:
title
author
publisher
pages
isbn
For HD televisions, they are:
brand
tv_type
size
The application itself enforces an application-level semantic consistency for documents of each product type. That is, all product documents will always include the same core fields, all books have the same set of additional fields, and so on. However, a search index doesn’t impose any cross-document schematic consistency on the fields that are used, so there is no explicit concept of querying for “product” documents specifically.
Each document field has a unique
field type.
The type can be any of the following (defined in the Python module
search
):
TextField
: A plain text string.
HtmlField
: HTML-formatted text. If your string is HTML, use this field type, as the Search API can take the markup into account when creating result snippets and in document scoring.
AtomField
: A string treated as a single token. A query will not match if it includes only a substring rather than the full field value.
NumberField
: A numeric (integer or floating-point) value.
DateField
: A date with no time component.
GeoField
: A geographical location, denoted by a
GeoPoint
object specifying latitude and longitude coordinates.
For text fields (
TextField
,
HtmlField
, and
AtomField
), the values should be Unicode strings.
To construct a
Document
object, you build a list of its fields, define its document ID if desired, and then pass this information to the
Document constructor
.
Our example application uses the
TextField
,
AtomField
,
NumberField
, and
DateField
field types for product documents.
The core product fields (those which are included in all product documents) look like this, where we assume the value arguments of the constructors below are set to appropriate values:
from google.appengine.api import search
...
fields = [
search.TextField(name=docs.Product.PID, value=pid), # the product id
# The 'updated' field is set to the current date.
search.DateField(name=docs.Product.UPDATED,
value=datetime.datetime.now().date()),
search.TextField(name=docs.Product.PRODUCT_NAME, value=name),
search.TextField(name=docs.Product.DESCRIPTION, value=description),
# The category names are atomic
search.AtomField(name=docs.Product.CATEGORY, value=category),
# The average rating starts at 0 for a new product.
search.NumberField(name=docs.Product.AVG_RATING, value=0.0),
search.NumberField(name=docs.Product.PRICE, value=price) ]
Note that the category field is typed as
AtomField
. Atom fields are useful for things like categories, where exact matches are desired; Text fields are better for strings like titles or descriptions. One of our example categories is
hd televisions
. If we search for just
televisions
, we will not get a match (assuming that that string is not contained in another product field). But, if we search for the full field string,
hd televisions
, we will match on the category field.
The example application also includes fields specific to individual product categories. These are added to the field list as well, depending on the category. For example, for the television category, there are additional fields for
size
(a number field),
brand
, and
tv_type
(text fields). Books have a different set of fields.
Given the field list, we can create a document object. For each product document, we’ll set its document ID to be the predefined unique ID of that product:
d = search.Document(doc_id=product_id, fields=fields)
This design has some advantages for us (as we’ll discuss in the follow-on class to this one), but if we didn’t specify the document ID, one would be generated for us automatically when the document is added to an index.
The Search API supports Geosearch on documents that include fields of type
GeoField
. If your documents contain such fields, you can query an index for matches based on distance comparisons.
A location is defined by the
GeoPoint
class, which stores latitude and longitude coordinates. The latitude specifies the angular distance, in degrees, north or south of the equator. The longitude specifies the angular distance, again in degrees, east or west of the prime meridian. For example, the location of the Opera House in Sydney is defined by
GeoPoint(-33.857, 151.215)
. To store a geopoint in a document, you need to add a
GeoField
field with a
GeoPoint
object set as its value.
Here is how the fields for the store location documents in the product search application are constructed:
from google.appengine.api import search
...
geopoint = search.GeoPoint(latitude, longitude)
fields = [search.TextField(name=docs.Store.STORE_NAME, value=storename),
search.TextField(name=docs.Store.STORE_ADDRESS, value=store_address),
search.GeoField(name=docs.Store.STORE_LOCATION, value=geopoint) ]
Before you can query a document’s contents, you must add the document to an index, using the
Index
object’s
add()
method. Indexing allows the document to be searched with the Search API’s query language and query options.
You can specify your own document ID when constructing a document. The document ID must be a visible, printable ASCII string not starting with ‘!’. Whitespace characters are excluded. (As we’ll see later, if you index a document using the ID of an existing document, that existing document will be reindexed). If you don’t specify a document ID, a unique numeric ID will be generated automatically when the document is added to the index.
You can add documents one at a time, or alternatively you can add a list of documents in batch, which is more efficient. Here’s how to construct a document, given a fields list, and add it to an index:
from google.appengine.api import search
# Here we do not specify a document ID, so one will be auto-generated on add.
d = search.Document(fields=fields)
try:
add_result = search.Index(name=INDEX_NAME).add(d)
except search.Error:
# ...
You should catch and handle any exceptions resulting from the
add()
, which will be of type
search.Error
.
If you want to specify the document ID, pass it to the
Document
constructor like this:
d = search.Document(doc_id=doc_id, fields=fields)
You can get the ID(s) of the document(s) that were added, via the
id
properties of the list of
search.AddResult
objects returned from the
add()
operation:
doc_id = add_result[0].id
Adding documents to an index makes the document content searchable. You can then perform full-text search queries over the documents in the index.
There are two ways to submit a search query. Most simply, you can pass a query string to the
Index
object’s
search()
method. Alternatively, you can create a
Query
object and pass that to the
search()
method. Constructing a query object allows you to specify query, sort, and result presentation options for your search.
In this lesson, we’ll look at how to construct simple queries using both approaches. Recall that some search queries are not fully supported on the Development Web Server (running locally), so you’ll need to run them using a deployed application.
A
query string
can be any Unicode string that can be parsed by the Search API’s
query language
. Once you’ve constructed a query string, pass it to the
Index.search()
method. For example:
from google.appengine.api import search
# a query string like this comes from the client
query = "stories"
try:
index = search.Index(INDEX_NAME)
search_results = index.search(query)
for doc in search_results:
# process doc ..
except search.Error:
# ...
A
Query
object gives you more control over your query options than does a query string. In this example, we first construct a
QueryOptions
object. Its arguments specify that the query should return
doc_limit
number of results. (If you’ve looked at the product search application code, you’ll see more complex
QueryOption
objects; we’ll look at these in the following class,
A Deeper Look at the Python Search API
). Next we construct the
Query
object using the query string and the
QueryOptions
object. We then pass the
Query
object to the
Index.search()
method, just as we did above with the query string.
from google.appengine.api import search
# a query string like this comes from the client
querystring = “stories”
try:
index = search.Index(INDEX_NAME)
search_query = search.Query(
query_string=querystring,
options=search.QueryOptions(
limit=doc_limit))
search_results = index.search(search_query)
except search.Error:
# ...
Once you’ve submitted a query, matching search results are returned to the application in an iterable
SearchResults
object. This object includes the number of results found, the actual results returned, and an optional
query cursor
object.
The returned documents can be accessed by iterating on the
SearchResults
object. The number of results returned is the length of the object’s
results
property. The
number_found
property is set to the number of hits found. Iterating on the returned object gives you the returned documents, which you can process as you like:
try:
search_results = index.search("stories")
returned_count = len(search_results.results)
number_found = search_results.number_found
for doc in search_results:
doc_id = doc.doc_id
fields = doc.fields
# etc.
except search.Error:
# ...
In this lesson, we’ve learned the basics of creating indexed documents and querying their contents. To check your knowledge, try recreating these steps yourself in your own simple application:
Index
object.
TextField
type) and construct a
Document
object with that field list. Add the document to the index.
In the next lesson , we'll take a closer look at Search API indexes.