Standard Mapreduce Input Readers and Output Writers

Experimental!

Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce. We will inform the community when this feature is no longer experimental.

The App Engine Mapreduce library provides these standard input readers and output writers:

About Readers and Writers

The standard data input readers are designed to read in data from specific storage, such as blobstore or datastore and then supply the data to the mapper function; the standard output writers write data from the reducer function to a specific storage, for example, datastore or blobstore. You don't instantiate, invoke, or write to the input readers or output writers; all of the interaction with the readers and writers is done for you by the MapreducePipeline object. You simply tell your MapreducePipeline object what reader to use and what output writer to use, and you also provide the MapreducePipeline with the reader and writer parameters.

The illustration below is a representation of a MapreducePipeline object with its constructor specifying a word count job, corresponding mapper and reducer functions, and the input reader and output writer to be used. Notice the "mapper_params" and "reducer_params". Those parameters are actually for the reader and writer, respectively. Notice also how the reader and writer are specified, using the Mapreduce library.

Standard Input Readers and Output Writers

The following table describes possible settings in the mapreduce.yaml file.

Reader or Writer Name	Description	Parameters
BlobstoreLineInputReader	Reads a line (\n) delimited text file one line at a time from Blobstore. It calls the mapper function once with each line, passing to the mapper a tuple comprised of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example: (byte_offset, line_value).	`blob_keys` Either a string containing the blob key, or a list containing multiple blob keys, specifying the data to be read by the reader.
BlobstoreZipInputReader	Iterates over all of the compressed files within the specified zipfile in Blobstore. It calls the mapper function once for each file, passing it the tuple comprised of the zipfile.ZipInfo entry for the file, and a parameterless function that your mapper calls to return the complete body of the file as a string. For example, (zipinfo, file_callable). The following snippet shows how your mapper might extract each file's data in each iteration: def word_count_map(data): """Word count map function.""" (entry, text_fn) = data text = text_fn()	`blob_key` A string containing the blob key specifying the zip file data to be read by the reader.
BlobstoreZipLineInputReader	Iterates over all of the compressed files, each of which must contain line (\n) delimited data, within the specified zipfile in Blobstore. It calls the mapper function once for each line in each file, passing a tuple consisting of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example, (byte_offset, line_value).	`blob_keys` Either a string containing the blob key, or a list containing multiple blob keys, specifying the zip file data to be read by the reader.
BlobstoreOutputWriter	Writes data from the reducer function to Blobstore, automatically assigning a filename. To retrieve the filename, you must use the completed mapreduce pipeline, as demonstrated by the StoreOutput function in the Mapreduce Made Easy demo.	`mime_type` MIME content type of the output blob. For example, `"text/plain"` .
DatastoreInputReader	Iterates and returns all instances of the specified entity (entity_kind) from the datastore, automatically advancing to the next unread entities. Each iteration returns the number of entities specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.	This reader has several parameters: `entity_kind` The datastore kind to map over. `namespace` The namespace that will be searched for entity_kinds. `batch_size` The number of entities to read from the datastore with each batch get. Default is 50.
DatastoreKeyInputReader	Iterates and returns all keys of the entities in the datastore of the specified entity_kind, automatically advancing to the next unread keys. Each iteration returns the number of keys specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.	This reader has several parameters: `entity_kind` The datastore kind whose keys are to be returned. `namespace` The namespace that will be searched for entity_kinds. `batch_size` The number of keys to read from the datastore with each batch get. Default is 50.
FileOutputWriter	Writes output data to Blobstore or Google Cloud Storage, automatically assigning a filename. To retrieve the filename, you must use the completed MapreducePipeline , as demonstrated by the StoreOutput function, which can be found in the file main.py which is part of the Mapreduce Made Easy demo.	This writer has several parameters: `filesystem` The type of output storage: `blobstore` or `gs` . `mime_type` The MIME content type of the written data. For example, `text/plain` . `gs_bucket_name` For a gs filesystem, the bucket name and directory. For example, `mybucket/dir1/dir2` . `output_sharding` Controls the number of output files. Only `input` is supported, which means the number of output files equals the number of input shards.
NamespaceInputReader	Iterates over and returns the available namespaces.	This reader has several parameters: `namespace_range` The range of namespaces that will be iterated over. `batch_size` The number of namespaces to return in each iteration. Default is 10.
RecordsReader	Reads a list of files obtained via the Files API in records format, yielding each record as a string in each iteration.	`files` Either a string containing the file to be read or a list containing multiple strings of files to be read.

About Customized Readers and Writers

The standard input readers and output writers should suffice for most use cases. If you need a reader that handles a different input source and format or a writer that writes to a different location and output format than the standard ones, contact Google to determine whether Google can add these to the standard readers and writers.

Alternatively, for those who want to write their own reader or writer, you can take a look at the open source code for readers and writers to see how to do this.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License , and code samples are licensed under the Apache 2.0 License . For details, see our Site Policies .

Last updated May 6, 2014.

Python

Standard Mapreduce Input Readers and Output Writers

Experimental!

About Readers and Writers

Standard Input Readers and Output Writers

About Customized Readers and Writers

Authentication required

Signing you in...