Please note that the contents of this offline web site may be out of date. To access the most recent documentation visit the online version .
Note that links that point to online resources are green in color and will open in a new window.
We would love it if you could give us feedback about this material by filling this form (You have to be online to fill it)



Querying Data

The document describes how to run synchronous and asynchronous queries from the BigQuery API .

Contents

Overview

Queries are written in BigQuery's SQL dialect . BigQuery supports both synchronous and asynchronous query methods. Both methods are handled by a job , but the "synchronous" method exposes a timeout value that waits until the job has finished before returning.

All query results are saved to a table , which can be either persistent or temporary:

  • A persistent table can be a new or existing table in any dataset in which you have WRITE privileges
  • A temporary table is a randomly named table saved in a special dataset; The table has a lifetime of approximately 24 hours. Temporary tables are not available for sharing, and are not visible using any of the standard list or other table manipulation methods.

The following table summarizes the possible combinations of query type and results table lifetime:

Results Table Lifetime
Query Type
Synchronous
Asynchronous
Permanent Table Not available in a single call. You can call jobs.insert and specify a results table, then call jobs.getQueryResults on that job, which will wait for the query to finish, then return results. details...

Create a query job and specify the table into which store the results. This can be a new table, or an existing table that you can either overwrite or append results to. details...

  • To run the query - Call jobs.insert( query_type_job ) with a results table specified. Method returns a job ID.
  • To view the results - Poll the job for DONE status; when done, call tabledata.list to get table data, or run additional queries against the new table.
Temporary Table

Data stored in a temporary table available only to the current user. Table lifetime is about 24 hours. details...

  • To run the query -Call jobs.query.
  • To view the results - If the query finishes within the specified timeout, the first page of results is returned by job.query. For additional pages, or if the query has expired, or to examine the results again, call jobs.getQueryResults.

Create a query job but do not specify an output table. Results are stored in a temporary table with a lifetime of about 24 hours. details...

  • To run the query - Call jobs.insert( query_type_job ) without specifying a results table, and store the returned job ID.
  • To view the results - Call jobs.getQueryResults with the job ID.

Additional limits

The following additional limits apply for querying data.

  • Maximum tables per query: 1,000
  • Maximum query length: 256 KB

Back to top

Synchronous queries

All synchronous query results are saved to an anonymous table with a lifetime of about 24 hours. If you want to save the data for longer than that, re-run the query asynchronously and specify a permanent destination table. Synchronous queries are handled by a job, as all queries are, and will appear in your jobs history.

  1. Call jobs.query with your query string. This method takes an optional timeout period; if set to 0 or not specified, the method will default to 10 seconds.
    • If the query returns within the specified timeout period , the method will return the first page of results. For additional results, call jobs.getQueryResults as described next.
    • If the query timeout expires before the query has finished , the method will return jobComplete=false ,, and you must call jobs.getQueryResults as described next. The query job will continue to run even after the timeout period until it finishes, either successfully or because an error has occurred.
  2. [If necessary] Call jobs.getQueryResults to page through additional results, to get results from a query that exceeded its timeout, or to examine any query results until the temporary table is deleted. This method lets you specify a start row, and also takes a timeout that behaves the same as the jobs.query timeout to allow waiting if the job is not yet complete.

Synchronous Query Examples

Java

This sample uses the Google APIs Client Library for Java .

  /**
   * Runs a synchronous BigQuery query and displays the result.
   *
   * @param bigquery An authorized BigQuery client
   * @param projectId The current project id
   * @param query A String containing a BigQuery SQL statement
   * @param out A PrintStream for output, normally System.out
   */
  static void runQueryRpcAndPrint(
      Bigquery bigquery, String projectId, String query, PrintStream out) throws IOException {
    QueryRequest queryRequest = new QueryRequest().setQuery(query);
    QueryResponse queryResponse = bigquery.jobs().query(projectId, queryRequest).execute();
    if (queryResponse.getJobComplete()) {
      printRows(queryResponse.getRows(), out);
      if (null == queryResponse.getPageToken()) {
        return;
      }
    }
    // This loop polls until results are present, then loops over result pages.
    String pageToken = null;
    while (true) {
      GetQueryResultsResponse queryResults = bigquery.jobs()
          .getQueryResults(projectId, queryResponse.getJobReference().getJobId())
          .setPageToken(pageToken).execute();
      if (queryResults.getJobComplete()) {
        printRows(queryResults.getRows(), out);
        pageToken = queryResults.getPageToken();
        if (null == pageToken) {
          return;
        }
      }
    }
  }

  private static void printRows(List<TableRow> rows, PrintStream out) {
    if (rows != null) {
      for (TableRow row : rows) {
        for (TableCell cell : row.getF()) {
          // Data.isNull() is the recommended way to check for the 'null object' in TableCell.
          out.printf("%s, ", Data.isNull(cell.getV()) ? "null" : cell.getV());
        }
        out.println();
      }
    }
  }

Python

This sample uses the Google APIs Client Library for Python .

# Run a synchronous query, save the results to a table, overwriting the
# existing data, and print the first page of results.
# Default timeout is to wait until query finishes.
def runSyncQuery (service, projectId, datasetId, timeout=0):
  try:
    print 'timeout:%d' % timeout
    jobCollection = service.jobs()
    queryData = {'query':'SELECT word,count(word) AS count FROM publicdata:samples.shakespeare GROUP BY word;',
                 'timeoutMs':timeout}

    queryReply = jobCollection.query(projectId=projectId,
                                     body=queryData).execute()

    jobReference=queryReply['jobReference']

    # Timeout exceeded: keep polling until the job is complete.
    while(not queryReply['jobComplete']):
      print 'Job not yet complete...'
      queryReply = jobCollection.getQueryResults(
                          projectId=jobReference['projectId'],
                          jobId=jobReference['jobId'],
                          timeoutMs=timeout).execute()

    # If the result has rows, print the rows in the reply.
    if('rows' in queryReply):
      print 'has a rows attribute'
      printTableData(queryReply, 0)
      currentRow = len(queryReply['rows'])

      # Loop through each page of data
      while('rows' in queryReply and currentRow < queryReply['totalRows']):
        queryReply = jobCollection.getQueryResults(
                          projectId=jobReference['projectId'],
                          jobId=jobReference['jobId'],
                          startIndex=currentRow).execute()
        if('rows' in queryReply):
          printTableData(queryReply, currentRow)
          currentRow += len(queryReply['rows'])

  except AccessTokenRefreshError:
    print ("The credentials have been revoked or expired, please re-run"
    "the application to re-authorize")

  except HttpError as err:
    print 'Error in runSyncQuery:', pprint.pprint(err.content)

  except Exception as err:
    print 'Undefined error' % err

Asynchronous queries

Asynchronous queries are run by calling bigquery.jobs.insert and including the configuration.query property in the request. The method returns immediately with a job ID, and you must either request the job periodically and check status, or call jobs.getQueryResults, which will return when the job is complete.

Asynchronous queries are always saved to a table: this can be a new table, an existing table, or a temporary table. You can choose whether to append or overwrite data in an existing table, and whether to create a new table if none exists by that name.

  1. Create a query job by calling bigquery.jobs.insert . If saving results to a permanent table, populate the destinationTable property with target table information; if saving the results to a temporary table, omit that property. The method returns immediately. Save the jobId returned in the response
  2. Check job status by calling bigquery.jobs.get with the jobId by the insert request. Check the status.state value to see if the job has completed successfully; if not, examine status.errorResult to learn why the job failed.
  3. Once the job reports a successful result, browse through the table data of the results table that you specified. If you got an error message, examine the error and rerun as appropriate.

Tip: An alternative to steps 2 and 3, polling job status and calling tabledata.list, is to call jobs.getQueryResults( jobId ) with no timeout, which will wait until the job is done and then return the query results at any row index that you like.

Asynchronous Query Examples

Java

This sample uses the Google APIs Client Library for Java .

  static void runQueryJobAndPrint(
      Bigquery bigquery, String projectId, String query, PrintStream out)
      throws IOException, InterruptedException {
    Job queuedJob = startQuery(bigquery, projectId, query, createJobId(query));
    pollUntilQueryDone(bigquery, queuedJob.getJobReference(), 1);
    printAllRows(bigquery, queuedJob, out);
  }

  private static void printAllRows(Bigquery bigquery, Job queuedJob, PrintStream out)
      throws IOException {
    String pageToken = null;
    while (true) {
      // queuedJob.getConfiguration().getQuery().getDestinationTable() points to a temporary table
      // automatically created to hold results.
      TableDataList tableDataList = bigquery.tabledata().list(
          queuedJob.getConfiguration().getQuery().getDestinationTable().getProjectId(),
          queuedJob.getConfiguration().getQuery().getDestinationTable().getDatasetId(),
          queuedJob.getConfiguration().getQuery().getDestinationTable().getTableId())
          .setPageToken(pageToken).execute();
      printRows(tableDataList.getRows(), out);
      pageToken = tableDataList.getPageToken();
      if (null == pageToken) {
        break;
      }
    }
  }

  /**
   * Creates and starts a job that queries a table.
   *
   * @param bigquery an authorized BigQuery client
   * @param projectId a String containing the current Project ID
   * @param query the actual query string
   * @param jobId unique identifier for the job being inserted
   */
  private static Job startQuery(Bigquery bigquery, String projectId, String query, String jobId)
      throws IOException {
    System.out.format("\nInserting Query Job: %s\n", query);
    JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query);
    JobConfiguration config = new JobConfiguration().setQuery(queryConfig);
    Job job = new Job().setId(jobId).setConfiguration(config);
    Job queuedJob = bigquery.jobs().insert(projectId, job).execute();
    System.out.format("\nJob ID of Query Job is: %s\n", jobId);
    return queuedJob;
  }

  /**
   * Polls BigQuery until the job's status is "DONE".
   *
   * @param bigquery an authorized BigQuery client
   * @param jobReference a JobReference, containing the job and project id
   * @param pauseSeconds seconds to wait between poll attempts
   */
  private static void pollUntilQueryDone(
      Bigquery bigquery, JobReference jobReference, int pauseSeconds)
      throws IOException, InterruptedException {
    // Variables that track the total query time
    long startTime = System.currentTimeMillis();
    long elapsedTime;

    while (true) {
      Job pollJob =
          bigquery.jobs().get(jobReference.getProjectId(), jobReference.getJobId()).execute();
      elapsedTime = System.currentTimeMillis() - startTime;
      System.out.format("Job status (%dms) %s: %s\n", elapsedTime, jobReference.getJobId(),
          pollJob.getStatus().getState());
      if (pollJob.getStatus().getState().equals("DONE")) {
        System.out.format(
            "Job done, processed %s bytes.\n", pollJob.getStatistics().getTotalBytesProcessed());
        return;
      }
      // Pause execution for pauseSeconds before polling job status again, to
      // reduce unnecessary calls to the BigQuery API and lower overall
      // application bandwidth.
      Thread.sleep(pauseSeconds * 1000);
    }
  }

Python

This sample uses the Google APIs Client Library for Python .


def runAsyncQuery (service, projectId):
  try:
    jobCollection = service.jobs()
    queryString = 'SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;'
    jobData = {
      'configuration': {
        'query': {
          'query': queryString,
        }
      }
    }

    insertResponse = jobCollection.insert(projectId=projectId,
                                         body=jobData).execute()

    # Get query results. Results will be available for about 24 hours.
    currentRow = 0
    queryReply = jobCollection.getQueryResults(
                      projectId=projectId,
                      jobId=insertResponse['jobReference']['jobId'],
                      startIndex=currentRow).execute()

    while(('rows' in queryReply) and currentRow < queryReply['totalRows']):
      printTableData(queryReply, currentRow)
      currentRow += len(queryReply['rows'])
      queryReply = jobCollection.getQueryResults(
                        projectId=projectId,
                        jobId=queryReply['jobReference']['jobId'],
                        startIndex=currentRow).execute()

  except HttpError as err:
    print 'Error in runAsyncTempTable:', pprint.pprint(err.resp)

  except Exception as err:
    print 'Undefined error' % err
  

Back to top

Interactive and batch queries

By default, BigQuery runs interactive queries, meaning that the query is executed as soon as possible. Interactive queries count towards your concurrent rate limit and your daily rate limit .

BigQuery also offers batch queries. BigQuery queues each batch query on your behalf, and starts the query as soon as idle resources are available, usually within a few minutes.If BigQuery hasn't started the query within 3 hours, we change the job priority to interactive . Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once.

To signify a batch query in your application, include the priority field in your code, as described in the following code example.

Java

This sample uses the Google APIs Client Library for Java .

  static void runBatchedQueryAndPrint(
      Bigquery bigquery, String projectId, String query, PrintStream out)
      throws IOException, InterruptedException {
    Job queuedJob = startBatchQuery(bigquery, projectId, query, createJobId(query));
    // Thread.sleep(30 * 60 * 1000); // Wait 30 minutes before pollling for a batched job.
    pollUntilQueryDone(bigquery, queuedJob.getJobReference(), 300);
    printAllRows(bigquery, queuedJob, out);
  }

  /**
   * A job id is an unique identifier for jobs submitted to BigQuery. It can be automatically
   * created, but it is a recommended practice to set them manually for posterior tracking, and
   * preventing duplicate jobs.
   */
  private static String createJobId(String query) {
    return String.format("job_%d_%d", System.currentTimeMillis(), query.hashCode());
  }

  /**
   * Creates and starts a job that queries a table, with BATCH priority.
   *
   * @param bigquery an authorized BigQuery client
   * @param projectId a String containing the current Project ID
   * @param query the actual query string
   * @param jobId unique identifier for the job being inserted
   */
  private static Job startBatchQuery(
      Bigquery bigquery, String projectId, String query, String jobId) throws IOException {
    System.out.format("\nInserting Batched Query Job: %s\n", query);
    JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query);
    queryConfig.setPriority("BATCH"); // Set query priority to batch.
    JobConfiguration config = new JobConfiguration().setQuery(queryConfig);
    Job job = new Job().setId(jobId).setConfiguration(config);
    Job queuedJob = bigquery.jobs().insert(projectId, job).execute();
    System.out.format("\nJob ID of Query Job is: %s\n", jobId);
    return queuedJob;
  }

Python

This sample uses the Google APIs Client Library for Python .

def runAsyncQueryBatch(service, projectId):
  try:
    jobCollection = service.jobs()
    queryString = 'SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;'
    jobData = {
      'configuration': {
        'query': {
          'query': queryString,
          'priority': 'BATCH' # Set priority to BATCH
        }
      }
    }

    insertResponse = jobCollection.insert(projectId=projectId,
                                         body=jobData).execute()

    import time
    while True:
      status = jobCollection.get(projectId=projectId, jobId=insertResponse['jobReference']['jobId']).execute()
      currentStatus = status['status']['state']

      if 'DONE' == currentStatus:
        currentRow = 0
        queryReply = jobCollection.getQueryResults(
                       projectId=projectId,
                       jobId=insertResponse['jobReference']['jobId'],
                       startIndex=currentRow).execute()

        while(('rows' in queryReply) and currentRow < queryReply['totalRows']):
          printTableData(queryReply, currentRow)
          currentRow += len(queryReply['rows'])
          queryReply = jobCollection.getQueryResults(
                         projectId=projectId,
                         jobId=queryReply['jobReference']['jobId'],
                         startIndex=currentRow).execute()
      else:
        print 'Waiting for the query to complete...'
        print 'Current status: ' + currentStatus
        print time.ctime()
        time.sleep(10)

  except HttpError as err:
    print 'Error in runAsyncTempTable:', pprint.pprint(err.resp)

  except Exception as err:
    print 'Undefined error: %s' % err

Query caching

BigQuery caches query results on a best-effort basis for increased performance. You aren't charged for cached queries, but cached queries are subject to the same quota policies as non-cached queries. BigQuery caches all queries that don't specify a destination table, including both interactive and batch queries .

Opting out of query caching

If you don't want your query to pull from the cache, you can take one of the following steps:

  • If using the BigQuery web UI , click the Enable Options button and ensure that Use Cached Results is unchecked.

  • If using the BigQuery API , set the useQueryCache property to false .

Ensuring cached query results

If you use the jobs.insert() function to run a query, you can ensure that the job returns a query result from the cache, if it exists, by setting the createDisposition property of the job configuration to CREATE_NEVER .

If the query result doesn't exist in the cache, a NOT_FOUND error returns.

Checking to see if BigQuery returned a cached result

There are two ways to determine if BigQuery returned a cached result:

  • If using the BigQuery web UI , the result string does not contain information about the number of processed bytes, and displays the word "cached".

  • If using the BigQuery API , the cacheHit property in the query result is set to true .

Back to top

Returning large query results

Normally, queries have a maximum response size . If you plan to run a query that might return larger results, you can set allowLargeResults to true in your job configuration.

Queries that return large results will take longer to execute, even if the result set is small, and are subject to additional limitations:

  • You must specify a destination table.
  • You can't specify a top-level ORDER BY clause.
  • Window functions can't return large query results.

Back to top

Views

A view is a virtual table defined by a SQL query. You can query views in the browser tool, or by using a query job.

BigQuery supports up to four levels of nested views; if there are more than four levels, an INVALID_INPUT error returns. Querying a view requires the READER role for all datasets that contain the tables in the view chain. For more information about roles, see access control .

You can create views in the following ways:

You can edit a view by calling tables.patch() in the BigQuery API.

Back to top

Table decorators

Normally, BigQuery performs a full column scan when running a query. You can use table decorators to perform a more cost-effective query of a subset of your data. For more information, see table decorators .

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.