Skip to main content

Hive connector with Azure Storage

Zipstack Cloud features a powerful SQL querying engine on top of many types of connectors, including those from Trino, some custom connectors and connectors from the open source Airbyte project. The underlying native connectors are Trino's connectors. Additionally, some parts of the documentation for these connectors have been adapted from the connector documentation found in Trino's open source project.

info

Please reach out to [email protected] if you need Hive/COS. This requires provisioning Zipstack Cloud with extra modules/properties.

All the SQL examples in this document assumes that you have created a data source named hive. Replace hive with the name you used to create a hive data source

The hive can be configured to query Azure Standard Blob Storage and Azure Data Lake Storage Gen2 (ABFS). Azure Blobs are accessed via the Windows Azure Storage Blob (WASB). This layer is built on top of the HDFS APIs and is what allows for the separation of storage from the cluster.

Trino supports both ADLS Gen1 and Gen2. With ADLS Gen2 now generally available, we recommend using ADLS Gen2. Learn more from the official documentation.

Hive connector configuration for Azure Storage credentials

To configure Trino to use the Azure Storage credentials, set the following configuration properties in the catalog properties file. It is best to use this type of configuration if the primary storage account is linked to the cluster.

The specific configuration depends on the type of storage and uses the properties from the following sections in the catalog properties file.

For more complex use cases, such as configuring multiple secondary storage accounts using Hadoop's core-site.xml, see the hive-azure-advanced-config options.

WASB storage

Property nameDescription
azure.wasb-storage-accountStorage account name of Azure Blob Storage
azure.wasb-access-keyThe decrypted access key for the Azure Blob Storage

ADLS Gen2 / ABFS storage

ABFS Access Key

Property nameDescription
azure.abfs-storage-accountThe name of the ADLS Gen2 storage account
azure.abfs-access-keyThe decrypted access key for the ADLS Gen2 storage account

ABFS Service Principal OAuth

Property nameDescription
azure.abfs.oauth.endpointThe service principal / application’s OAuth 2.0 token endpoint (v1).
azure.abfs.oauth.client-idThe service principal’s client/application ID.
azure.abfs.oauth.secretA client secret for the service principal.

When using a service principal, it must have the Storage Blob Data Owner, Contributor, or Reader role on the storage account you are using, depending on which operations you would like to use.

ADLS Gen1

While it is advised to migrate to ADLS Gen2 whenever possible, if you still choose to use ADLS Gen1 you need to include the following properties in your catalog configuration.

note

Credentials for the filesystem can be configured using ClientCredential type. To authenticate with ADLS Gen1 you must create a new application secret for your ADLS Gen1 account's App Registration, and save this value because you won't able to retrieve the key later. Refer to the Azure documentation for details.

ADLS properties

Property nameDescription
azure.adl-client-idClient (Application) ID from the App Registrations for your storage account
azure.adl-credentialValue of the new client (application) secret created
azure.adl-refresh-urlOAuth 2.0 token endpoint url
azure.adl-proxy-hostProxy host and port in host:port format. Use this property to connect to an ADLS endpoint via a SOCKS proxy.

Advanced configuration

All of the configuration properties for the Azure storage driver are stored in the Hadoop core-site.xml configuration file. When there are secondary storage accounts involved, we recommend configuring Trino using a core-site.xml containing the appropriate credentials for each account.

The path to the file must be configured in the catalog properties file:

config.resources=<path_to_hadoop_core-site.xml>

One way to find your account key is to ask for the connection string for the storage account. The abfsexample.dfs.core.windows.net account refers to the storage account. The connection string contains the account key:

az storage account  show-connection-string --name abfswales1
{
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfsexample;AccountKey=examplekey..."
}

When you have the account access key, you can add it to your core-site.xml or Java cryptography extension (JCEKS) file. Alternatively, you can have your cluster management tool to set the option fs.azure.account.key.STORAGE-ACCOUNT to the account key value:

<property>
<name>fs.azure.account.key.abfsexample.dfs.core.windows.net</name>
<value>examplekey...</value>
</property>

For more information, see Hadoop Azure Support: ABFS.

Accessing Azure Storage data

URI scheme to reference data

Consistent with other FileSystem implementations within Hadoop, the Azure Standard Blob and Azure Data Lake Storage Gen2 (ABFS) drivers define their own URI scheme so that resources (directories and files) may be distinctly addressed. You can access both primary and secondary storage accounts linked to the cluster with the same URI scheme. Following are example URIs for the different systems.

ABFS URI:

abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/<file_name>

ADLS Gen1 URI:

adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/<file_name>

Azure Standard Blob URI:

wasb[s]://<container>@<account_name>.blob.core.windows.net/<path>/<path>/<file_name>

Querying Azure Storage

You can query tables already configured in your Hive metastore used in your Hive catalog. To access Azure Storage data that is not yet mapped in the Hive metastore, you need to provide the schema of the data, the file format, and the data location.

For example, if you have ORC or Parquet files in an ABFS file_system, you need to execute a query:

-- select schema in which the table is to be defined, must already exist
USE hive.default;

-- create table
CREATE TABLE orders (
orderkey bigint,
custkey bigint,
orderstatus varchar(1),
totalprice double,
orderdate date,
orderpriority varchar(15),
clerk varchar(15),
shippriority integer,
comment varchar(79)
) WITH (
external_location = 'abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/',
format = 'ORC' -- or 'PARQUET'
);

Now you can query the newly mapped table:

SELECT * FROM orders;

Writing data

Prerequisites

Before you attempt to write data to Azure Storage, make sure you have configured everything necessary to read data from the storage.

Create a write schema

If the Hive metastore contains schema(s) mapped to Azure storage filesystems, you can use them to write data to Azure storage.

If you don't want to use existing schemas, or there are no appropriate schemas in the Hive metastore, you need to create a new one:

CREATE SCHEMA hive.abfs_export
WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/<path>');

Write data to Azure Storage

Once you have a schema pointing to a location where you want to write the data, you can issue a CREATE TABLE AS statement and select your desired file format. The data will be written to one or more files within the abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/my_table namespace. Example:

CREATE TABLE hive.abfs_export.orders_abfs
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;