Loading Data Into BigQuery

Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job , or stream records individually .

Load jobs support two data sources:

Objects in Google Cloud Storage
Data sent with the job or streaming insert

Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema, as described in Data formats . Each individual load job can load data from multiple sources, configured with the sourceUris property .

It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.

Access control

Loading data into BigQuery requires the following access levels .

Product	Access
BigQuery	`WRITE` access for the dataset that contains the destination table. For more information, see access control .
Google Cloud Storage	`READ` access for the object in Google Cloud Storage, if loading data from Google Cloud Storage. For more information, see Access Control - Google Cloud Storage .

Quota policy

The following limits apply for loading data into BigQuery .

Daily limit: 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)

Maximum File Sizes:

File Type	Compressed	Uncompressed
CSV	1 GB	With new-lines in strings: 4 GB Without new-lines in strings: 1 TB
JSON	1 GB	1 TB

Maximum size per load job: 1 TB across all input files
Maximum number of files per load job: 10,000
There are several additional limits that are specific to BigQuery's supported data formats. For more information, see preparing data for BigQuery .

Additional limits

The following additional limits apply for loading data into BigQuery.

Maximum columns per table: 10,000
Data format limits : Depending on which format you use to load your data, additional limits may apply. For more information, see Data formats .

Loading data from Google Cloud Storage

To load data from Google Cloud Storage:

Upload your data to Google Cloud Storage.
The easiest way to upload your data to Google Cloud Storage is to use the Google Developers Console . Be sure that you upload your data to a project the BigQuery service is activated on.
Create a load job pointing to the source data in Google Cloud Storage. The source URIs must be fully-qualified, in the format gs://<bucket>/<object> .
Check the job status.
Call jobs.get( <jobId> ) with the ID of the job returned by the initial request, and check for status.state = DONE ; if the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. If the request failed, no table will have been created or data added. If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

Example

The following Python client example loads CSV data from a Google Cloud Storage bucket and prints the results on the command line.

# Python example
# Loads the table from Google Cloud Storage and prints the table.
def loadTable(service, projectId, datasetId, targetTableId, sourceCSV):
  try:
    jobCollection = service.jobs()
    jobData = {
      'projectId': projectId,
      'configuration': {
          'load': {
            'sourceUris': [sourceCSV],
            'schema': {
              'fields': [
                {
                  'name': 'Name',
                  'type': 'STRING'
                },
                {
                  'name': 'Age',
                  'type': 'INTEGER'
                },
                {
                  'name': 'Weight',
                  'type': 'FLOAT'
                },
                {
                  'name': 'IsMagic',
                  'type': 'BOOLEAN'
                }
              ]
            },
            'destinationTable': {
              'projectId': projectId,
              'datasetId': datasetId,
              'tableId': targetTableId
            },
          }
        }
      }

    insertResponse = jobCollection.insert(projectId=projectId,
                                         body=jobData).execute()

    # Ping for status until it is done, with a short pause between calls.
    import time
    while True:
      job = jobCollection.get(projectId=projectId,
                                 jobId=insertResponse['jobReference']['jobId']).execute()
      if 'DONE' == job['status']['state']:
          print 'Done Loading!'
          return

      print 'Waiting for loading to complete...'
      time.sleep(10)

    if 'errorResult' in job['status']:
      print 'Error loading table: ', pprint.pprint(job)
      return

  except HttpError as err:
    print 'Error in loadTable: ', pprint.pprint(err.resp)

Loading data by sending a POST request

You can load data into BigQuery by sending a multipart/related POST request to the following URL:

https://www.googleapis.com/upload/bigquery/v2/projects/<projectId>/jobs

The request body must be in the following format:

--xxx
Content-Type: application/json; charset=UTF-8

{
  "configuration": {
    "load": {
      "sourceFormat": "<required for JSON files>",
      "schema": {
        "fields": [
          {"name":"f1", "type":"STRING"},
          {"name":"f2", "type":"INTEGER"}
        ]
      },
      "destinationTable": {
        "projectId": "projectId",
        "datasetId": "datasetId",
        "tableId": "tableId"
      }
    }
  }
}
--xxx
Content-Type: application/octet-stream

<your data>
--xxx--

The request body contains several key parts:

Encapsulation Boundary (e.g. "xxx")
When you create a multipart request that contains different sets of data in a single body, you need to specify a boundary parameter to indicate where each dataset starts and end. The boundary string must be prefixed with two dashes. To indicate the end of the entire body, you must specify the boundary string in the following format:
```
--<boundary>--
```
You must also specify this boundary string in the overall document's Content-Type header in the following format:
```
Content-Type: multipart/related; boundary=xxx
```
Note that the boundary string cannot occur within the actual data of the request.
Load job configuration
The request body contains two parts. The load configuration should be the first part of the request, with a Content-Type header of application/json . This section creates a load job using the JSON fields described in the job resource documentation.
The data of the request
The second part of the request contains data to be loaded into BigQuery. The part must always have a Content-Type header of application/octet-stream .

import sys import json from apiclient.discovery import build from oauth2client.file import Storage from oauth2client.client import AccessTokenRefreshError from oauth2client.client import OAuth2WebServerFlow from oauth2client.tools import run from apiclient.errors import HttpError import httplib2 FLOW = OAuth2WebServerFlow( client_id='xxxxxxx.apps.googleusercontent.com', client_secret='shhhhhhhhhhhh', scope='https://www.googleapis.com/auth/bigquery', user_agent='my-program-name/1.0') def loadTable(http, service): projectId = raw_input("Choose your project ID: ") datasetId = raw_input("Choose a dataset ID: ") tableId = raw_input("Choose a table name to load the data to: ") url = "https://www.googleapis.com/upload/bigquery/v2/projects/" + projectId + "/jobs" newSchemaFile = raw_input("What is your schema? ") schema = open(newSchemaFile, 'r') # Create the body of the request, separated by a boundary of xxx newresource = ('--xxx\n' + 'Content-Type: application/json; charset=UTF-8\n' + '\n' + '{\n' + ' "configuration": {\n' + ' "load": {\n' + ' "schema": {\n' ' "fields": ' + schema.read() + '\n' + ' },\n' + ' "destinationTable": {\n' + ' "projectId": "' + projectId + '",\n' + ' "datasetId": "' + datasetId + '",\n' + ' "tableId": "' + tableId + '"\n' + ' }\n' + ' }\n' + ' }\n' + '}\n' + '--xxx\n' + 'Content-Type: application/octet-stream\n' + '\n') newDataFile = raw_input("What is your data? ") # Append data from the specified file to the request body f = open(newDataFile, 'r') newresource += f.read() # Signify the end of the body newresource += ('--xxx--\n') headers = {'Content-Type': 'multipart/related; boundary=xxx'} resp, content = http.request(url, method="POST", body=newresource, headers=headers) if resp.status == 200: jsonResponse = json.loads(content) jobReference = jsonResponse['jobReference']['jobId'] import time while True: jobCollection = service.jobs() getJob = jobCollection.get(projectId=projectId, jobId=jobReference).execute() currentStatus = getJob['status']['state'] if 'DONE' == currentStatus: print "Done Loading!" return else: print 'Waiting to load...' print 'Current status: ' + currentStatus print time.ctime() time.sleep(10) def main(argv): # If the credentials don't exist or are invalid, run the native client # auth flow. The Storage object will ensure that if successful the good # credentials will get written back to a file. storage = Storage('bigquery2.dat') # Choose a file name to store the credentials. credentials = storage.get() if credentials is None or credentials.invalid: credentials = run(FLOW, storage) # Create an httplib2.Http object to handle our HTTP requests and authorize it # with our good credentials. http = httplib2.Http() http = credentials.authorize(http) service = build('bigquery','v2', http=http) loadTable(http, service) if __name__ == '__main__': main(sys.argv)

Loading data from other Google services

App Engine Datastore data to BigQuery

BigQuery supports direct loading of App Engine Datastore backup data dumps. This feature is currently experimental, and is in a "trusted tester" phase. Sign up as a potential trusted tester to learn more about this feature.

If your application requires custom data transformation, it is possible to build your own Datastore to BigQuery data pipeline. For more information, see Extracting and Transforming App Engine Datastore Data

App Engine log files to BigQuery

log2bq is a Python-based App Engine application that provides handlers for moving App Engine log data into BigQuery via Google Cloud Storage.

There are also some open source tools for loading App Engine log files. One example tool is Mache , which is an open source Java App Engine framework for exporting App Engine logs into Google BigQuery.

Cloud Storage access and storage logs

Google Cloud Storage provides access and storage log files in CSV formats which can be directly imported into BigQuery for analysis. In order to access these logs, you must set up log delivery and enable logging . The schemas are available online, in JSON format, for both the storage access logs and storage bucket data . More information is available in the Cloud Storage access logs and storage data documentation .

In order to load storage and access logs into BigQuery from the command line, use a command such as:

bq load --schema=cloud_storage_usage_schema.json my_dataset.usage_2012_06_18_v0 gs://my_logs/bucket_usage_2012_06_18_14_v0

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License , and code samples are licensed under the Apache 2.0 License . For details, see our Site Policies .

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.

Google BigQuery