Please note that the contents of this offline web site may be out of date. To access the most recent documentation visit the online version .
Note that links that point to online resources are green in color and will open in a new window.
We would love it if you could give us feedback about this material by filling this form (You have to be online to fill it)



Transforming Data

Sometimes you can prepare your data for BigQuery by performing a simple action such as exporting data in JSON format, but sometimes you may need to transform your data in a more complex way. For example, your data might be stored in XML format, and preparing the data would require more than just a simple query. This topic describes common scenarios where you may need to transform your data before loading it into BigQuery .

Contents

Transforming XML data

Large datasets are often represented using XML. BigQuery doesn't support directly loading XML files, but XML files can be easily converted to an equivalent JSON format or flat CSV structure.

When dealing with large XML files, it's important to not use a DOM-based parser on the XML file, as the parser will attempt to load the entire file as an in-memory tree. Instead, we suggest using a pull-based parser or a SAX parser, which are more efficient. Parsing a large XML file is simple when using Python.

The following code examples show a sample of 100 GB of Wikipedia revision history data, and a Python code example that efficiently parses the data.

    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org
    /xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
      <siteinfo>
        <sitename>Wikipedia</sitename>
        <base>http://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.20wmf10</generator>
        <case>first-letter</case>
        ….
      </siteinfo>
      <page>
        <title>AccessibleComputing</title>
        <id>10</id>
        <revision>
          <id>233192</id>
          <contributor>
            <id>99</id>
          </contributor>
        </revision>
        <revision>
          …
        </revision>
      </page>
    </mediawiki>
from lxml import etree
# open file, loop through all  elements
xml_file = open(sys.argv[1],"r")
for _, element in etree.iterparse(xml_file, tag='{http://www.mediawiki.org/xml/export-0.6/}page'):
  title = element.find('{http://www.mediawiki.org/xml/export-0.6/}title').text
  id = element.find('{http://www.mediawiki.org/xml/export-0.6/}id').text
  revisions = list(element.iter('{http://www.mediawiki.org/xml/export-0.6/}revision'))
  wp_namespace = element.find('{http://www.mediawiki.org/xml/export-0.6/}ns')
  for revision_element in revisions:
    # we're confident that we always have a revision ID
    revision_id = revision_element.find('{http://www.mediawiki.org/xml/export-0.6/}id').text
    comment = revision_element.find('{http://www.mediawiki.org/xml/export-0.6/}comment')
    #comment_text and #contributor_id are optional
    if comment is not None and comment.text is not None:
      comment_text = comment.text
    contributor_id = contributor.find('{http://www.mediawiki.org/xml/export-0.6/}id')
    if contributor_id is not None:
      contributor_id_text = contributor_id.text

Back to top

Transforming values

It's sometimes convenient to transform individual fields of source data from one format to another before loading into BigQuery. For example, string data might be better expressed as a numerical value in order to produce aggregate queries.

Example: Convert timestamps from ISO 8601 format to POSIX format

A common human-readable timestamp format is " ISO 8601 format " (YYYY-MM-DDTHH:MM:SS). While it's possible to store this data in BigQuery as a string fieldtype, it's inconvenient to perform queries that return data over a calculated range of time. In these cases, it's useful to convert these fields into an integer representation of the same data, such as POSIX time format before loading the data into BigQuery.

The following example demonstrates how to use GNU awk to transform a file containing ISO 8601 timestamps into a new file containing POSIX format timestamp.

# Example source data
201473587,737,2008,urn:nimsp:transaction,3254081d1bd4dd611e4b48d439ddd34c,"","",f,50.00,2008-10-19T10:13:29,"BOYD, RYAN"

# Use GNU awk to convert the timestamp field to posix using mktime()
cat iso8601.csv | gawk -F , 'BEGIN {FS=",";OFS=",";} {timestamp = $10; gsub(/[\-\:T]/, " ", timestamp); $10 = mktime(timestamp);} { print }' > posix.csv

# Result data
201473587,737,2008,urn:nimsp:transaction,3254081d1bd4dd611e4b48d439ddd34c,"","",f,50.00,1224436409,"BOYD, RYAN"

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.