BigQuery connector
Zipstack Cloud features a powerful SQL querying engine on top of many types of connectors, including those from Trino, some custom connectors and connectors from the open source Airbyte project. The underlying native connectors are Trino's connectors. Additionally, some parts of the documentation for these connectors have been adapted from the connector documentation found in Trino's open source project.
The BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.
BigQuery Storage API
The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary. It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Column Filtering
The new API allows column filtering to only read the data you are interested in. Backed by a columnar datastore, it can efficiently stream data without reading all columns.
Dynamic Sharding
The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.
Requirements
To connect to BigQuery, you need:
To enable the BigQuery Storage Read API.
Network access from your Zipstack Cloud to the Google Cloud API service endpoint. This endpoint uses HTTPS, or port 443.
To configure BigQuery so that the Zipstack Cloud have permissions in BigQuery.
To set up authentication use a service account JSON key. Set
credentials-keyin the catalog properties. It should contain the contents of the JSON file, encoded using base64.
Configuration
Multiple GCP projects
The BigQuery connector can only access a single GCP project.Thus, if you
have data in multiple GCP projects, You need to create several catalogs,
each pointing to a different GCP project. For example, if you have two
GCP projects, one for the sales and one for analytics, you can create
two data sources named sales and analytics
Configuring partitioning
By default the connector creates one partition per 400MB in the table
being read (before filtering). This should roughly correspond to the
maximum number of readers supported by the BigQuery Storage API. This
can be configured explicitly with the parallelism property.
BigQuery may limit the number of partitions based on server constraints.
Reading from views
The connector has a preliminary support for reading from BigQuery views. Please note there are a few caveats:
Reading from views is disabled by default. In order to enable it, set the
views-enabledconfiguration property totrue.BigQuery views are not materialized by default, which means that the connector needs to materialize them before it can read them. This process affects the read performance.
The materialization process can also incur additional costs to your BigQuery bill.
By default, the materialized views are created in the same project and dataset. Those can be configured by the optional
view-materialization-projectandview-materialization-datasetproperties, respectively. The service account must have write permission to the project and the dataset in order to materialize the view.
Configuration properties
| Property | Description | Default |
|---|---|---|
project-id | The Google Cloud Project ID where the data reside | Taken from the service account |
parent-project-id | The project ID Google Cloud Project to bill for the export | Taken from the service account |
parallelism | The number of partitions to split the data into | The number of executors |
views-enabled | Enables the connector to read from views and not only tables. Please read this section before enabling this feature. | false |
view-expire-duration | Expire duration for the materialized view. | 24h |
view-materialization-project | The project where the materialized view is going to be created | The view’s project |
view-materialization-dataset | The dataset where the materialized view is going to be created | The view’s dataset |
skip-view-materialization | Use REST API to access views instead of Storage API. BigQuery BIGNUMERIC and TIMESTAMP types are unsupported. | false |
views-cache-ttl | Duration for which the materialization of a view will be cached and reused. Set to 0ms to disable the cache. | 15m |
metadata.cache-ttl | Duration for which metadata retrieved from BigQuery is cached and reused. Set to 0ms to disable the cache. | 0ms |
max-read-rows-retries | The number of retries in case of retryable server issues | 3 |
credentials-key | The base64 encoded credentials key | None. See the requirements section. |
credentials-file | The path to the JSON credentials file | None. See the requirements section. |
case-insensitive-name-matching | Match dataset and table names case-insensitively | false |
query-results-cache.enabled | Enable query results cache | false |
experimental.arrow-serialization.enabled | Enable using Apache Arrow serialization when reading data from BigQuery. Please read this section before enabling this feature. | false |
Type mapping
Because Trino and BigQuery each support types that the other does not,
this connector modifies some types <type-mapping-overview> when
reading or writing data. Data types may not map the same way in both
directions between Trino and the data source. Refer to the following
sections for type mapping in each direction.
BigQuery type to Trino type mapping
The connector maps BigQuery types to the corresponding Trino types according to the following table:
| BigQuery type | Trino type | Notes |
|---|---|---|
BOOLEAN | BOOLEAN | |
INT64 | BIGINT | INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT are aliases for INT64 in BigQuery. |
FLOAT64 | DOUBLE | |
NUMERIC | DECIMAL(P,S) | The default precision and scale of NUMERIC is (38, 9). |
BIGNUMERIC | DECIMAL(P,S) | Precision > 38 is not supported. The default precision and scale of BIGNUMERIC is (77, 38). |
DATE | DATE | |
DATETIME | TIMESTAMP(6) | |
STRING | VARCHAR | |
BYTES | VARBINARY | |
TIME | TIME(6) | |
TIMESTAMP | TIMESTAMP(6) WITH TIME ZONE | Time zone is UTC |
GEOGRAPHY | VARCHAR | In Well-known text (WKT) format |
ARRAY | ARRAY | |
RECORD | ROW |
: BigQuery type to Trino type mapping
No other types are supported.
Trino type to BigQuery type mapping
The connector maps Trino types to the corresponding BigQuery types according to the following table:
| Trino type | BigQuery type | Notes |
|---|---|---|
BOOLEAN | BOOLEAN | |
VARBINARY | BYTES | |
DATE | DATE | |
DOUBLE | FLOAT | |
BIGINT | INT64 | INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT are aliases for INT64 in BigQuery. |
DECIMAL(P,S) | NUMERIC | The default precision and scale of NUMERIC is (38, 9). |
VARCHAR | STRING | |
TIMESTAMP(6) | DATETIME |
: Trino type to BigQuery type mapping
No other types are supported.
System tables
For each Trino table which maps to BigQuery view there exists a system
table which exposes BigQuery view definition. Given a BigQuery view
example_view you can send query
SELECT * example_view$view_definition to see the SQL which defines
view in BigQuery.
Special columns
In addition to the defined columns, the BigQuery connector exposes partition information in a number of hidden columns:
$partition_date: Equivalent to_PARTITIONDATEpseudo-column in BigQuery$partition_time: Equivalent to_PARTITIONTIMEpseudo-column in BigQuery
You can use these columns in your SQL statements like any other column. They can be selected directly, or used in conditional statements. For example, you can inspect the partition date and time for each record:
SELECT *, "$partition_date", "$partition_time"
FROM example.web.page_views;
Retrieve all records stored in the partition
_PARTITIONDATE = '2022-04-07':
SELECT *
FROM example.web.page_views
WHERE "$partition_date" = date '2022-04-07';
Two special partitions __NULL__ and __UNPARTITIONED__ are not
supported.
SQL support
The connector provides read and write access to data and metadata in the
BigQuery database. In addition to the
globally available <sql-globally-available> and
read operation <sql-read-operations> statements, the connector
supports the following features:
/sql/insert/sql/truncate/sql/create-table/sql/create-table-as/sql/drop-table/sql/create-schema/sql/drop-schema/sql/comment
Table functions
The connector provides specific table functions </functions/table> to
access BigQuery.
query(varchar) -> table
The query function allows you to query the underlying BigQuery
directly. It requires syntax native to BigQuery, because the full query
is pushed down and processed by BigQuery. This can be useful for
accessing native features which are not available in Trino or for
improving query performance in situations where running a query natively
may be faster.
::: note ::: title Note :::
Polymorphic table functions may not preserve the order of the query
result. If the table function contains a query with an ORDER BY
clause, the function result may not be ordered as expected.
:::
For example, query the example catalog and group and concatenate all
employee IDs by manager ID:
SELECT
*
FROM
TABLE(
example.system.query(
query => 'SELECT
manager_id, STRING_AGG(employee_id)
FROM
company.employees
GROUP BY
manager_id'
)
);
FAQ
What is the Pricing for the Storage API?
See the BigQuery pricing documentation.