Hudi connector
Zipstack Cloud features a powerful SQL querying engine on top of many types of connectors, including those from Trino, some custom connectors and connectors from the open source Airbyte project. The underlying native connectors are Trino's connectors. Additionally, some parts of the documentation for these connectors have been adapted from the connector documentation found in Trino's open source project.
The Hudi connector enables querying Hudi tables.
Requirements
To use the Hudi connector, you need:
Network access from the Zipstack Cloud to the Hudi storage.
Access to the Hive metastore service (HMS).
Network access from the Zipstack Cloud to the HMS.
Configuration
The connector requires a Hive metastore for table metadata and supports
the same metastore configuration properties as the
Hive connector. At a minimum, hive.metastore.uri
must be configured. The connector recognizes Hudi tables synced to the
metastore by the Hudi sync
tool.
To create a catalog (data source) that uses the Hudi connector, update the hive.metastore.uri with the URI of your
Hive metastore Thrift service:
hive.metastore.uri=thrift://example.net:9083
Additionally, following configuration properties can be set depending on the use-case.
| Property name | Description | Default |
|---|---|---|
hudi.metadata-enabled | Fetch the list of file names and sizes from metadata rather than storage. | false |
hudi.columns-to-hide | List of column names that are hidden from the query output. It can be used to hide Hudi meta fields. By default, no fields are hidden. | |
hudi.parquet.use-column-names | Access Parquet columns using names from the file. If disabled, then columns are accessed using the index. Only applicable to Parquet file format. | true |
parquet.optimized-reader.enabled | Whether batched column readers must be used when reading Parquet files for improved performance. Set this property to false to disable the optimized parquet reader by default. The equivalent catalog session property is parquet_optimized_reader_enabled. | true |
parquet.optimized-nested-reader.enabled | Whether batched column readers must be used when reading ARRAY, MAP and ROW types from Parquet files for improved performance. Set this property to false to disable the optimized parquet reader by default for structural data types. The equivalent catalog session property is parquet_optimized_nested_reader_enabled. | true |
hudi.min-partition-batch-size | Minimum number of partitions returned in a single batch. | 10 |
hudi.max-partition-batch-size | Maximum number of partitions returned in a single batch. | 100 |
hudi.size-based-split-weights-enabled | Unlike uniform splitting, size-based splitting ensures that each batch of splits has enough data to process. By default, it is enabled to improve performance. | true |
hudi.standard-split-weight-size | The split size corresponding to the standard weight (1.0) when size-based split weights are enabled. | 128MB |
hudi.minimum-assigned-split-weight | Minimum weight that a split can be assigned when size-based split weights are enabled. | 0.05 |
hudi.max-splits-per-second | Rate at which splits are queued for processing. The queue is throttled if this rate limit is breached. | Integer.MAX_VALUE |
hudi.max-outstanding-splits | Maximum outstanding splits in a batch enqueued for processing. | 1000 |
Supported file types
The connector supports Parquet file type.
SQL support
The connector provides read access to data in the Hudi table that has
been synced to Hive metastore. The
globally available <sql-globally-available> and
read operation <sql-read-operations> statements are supported.
Supported query types
Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector.
| Table type | Supported query type |
|---|---|
| Copy on write | Snapshot queries |
| Merge on read | Read-optimized queries |
Examples queries
In the queries below, stock_ticks_cow is a Hudi copy-on-write table
that we refer in the Hudi
quickstart documentation.
Here are some sample queries:
USE example.example_schema;
SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';
symbol | _col1 |
-----------+----------------------+
GOOG | 2018-08-31 10:59:00 |
(1 rows)
SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';
dt | symbol |
------------+--------+
2018-08-31 | GOOG |
(1 rows)
SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
dt | _col1 |
------------+--------+
2018-08-31 | 99 |
(1 rows)