

# Getting started with Trino
<a name="emr-trino-getting-started"></a>

The procedures in this section show you how to set up an Amazon EMR cluster in order to query metastore data sources with Trino. These metastores, which include the AWS Glue Data Catalog, store metadata and database objects and manage access permissions. The procedures cover prerequisites, recommended configuration settings, creating connectors, and running queries on metastore tables.

**Topics**
+ [Complete prerequisite steps for using Amazon EMR with Trino](emr-trino-getting-started-pre.md)
+ [Launch an Amazon EMR cluster with Trino](emr-trino-getting-started-launch.md)
+ [Connect to the primary node for the Amazon EMR cluster and run queries](emr-trino-getting-started-connect.md)

# Complete prerequisite steps for using Amazon EMR with Trino
<a name="emr-trino-getting-started-pre"></a>

If you haven't used AWS, or if you haven't created an Amazon EMR cluster, complete these prerequisite steps before you create an Amazon EMR cluster with Trino.

## AWS environment set up
<a name="emr-trino-getting-started-account"></a>

Complete these steps to configure your AWS account if you haven't already:

1. Sign up for an AWS account, if you don't have one already. For more information, see [Create an AWS account](https://docs.aws.amazon.com/accounts/latest/reference/manage-acct-creating.html) in the *AWS Account Management Reference Guide*.

1. Sign in to your account as an administrative user.

1. Create a group and assign users to it.

1. Create an Amazon EC2 key pair, which you can use later to secure communication between resources with SSH. This step is required if you plan to connect to the primary node to perform tasks. For more information, see [Connect to the Amazon EMR cluster primary node using SSH](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html).

# Launch an Amazon EMR cluster with Trino
<a name="emr-trino-getting-started-launch"></a>

The following describes the correct configuration choices when you create a cluster with Trino.

## Using a Hive connector to make data available for querying
<a name="emr-trino-getting-started-connect-hive"></a>

You can configure a Trino connector for a Hive metastore for the purpose of querying metastore data from your cluster. A metastore is an abstraction layer that makes file-based content or data available as tables, so it's easy to query. You have to configure a connector in Amazon EMR to make the Hive metastore tables available to the cluster. The following procedure shows you how to do this:

1. Choose AWS Glue in the console and create a table, based on your source data in Amazon S3. A table in the AWS Glue Data Catalog is the metadata definition for the data. It makes sense in this context to create the table manually, creating columns as you like, from your source data. For more information about creating tables in AWS Glue from semi-structured data in Amazon S3, see [Creating tables using the console](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#console-tables) in the *AWS Glue User Guide*.

1. Set your configuration as part of cluster creation. Select the **Configurations** tab. Configurations are optional specifications for your cluster. When you enter a configuration, add JSON like the following sample, which instructs Trino to use the AWS Glue Data Catalog as its external Hive metastore for table metadata:

   ```
   {
       "classification": "trino-connector-hive",
       "properties": {
           "hive.metastore": "glue"
       }
   }
   ```

   Alternatively, you can apply configurations in the **Software settings** section when you create a cluster.

   Additionally, you can set up other connector types, such as for connecting with Apache Iceberg. For more information, see [Use an Iceberg cluster with Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-trino-cluster.html) in the *Amazon EMR Release Guide*. Configuring additional settings is optional.

To continue the getting-started steps, see [Connect to the primary node for the Amazon EMR cluster and run queries](emr-trino-getting-started-connect.md).

## Create a cluster with Trino
<a name="emr-trino-getting-started-launch-cluster-settings"></a>

The following describes the correct configuration choices when you create a cluster that you want to use with Trino.

**Important**  
Before you create your cluster, complete AWS Glue Data Catalog configuration as your Hive metastore, which we recommend for getting started. For more information, see [Using a Hive connector to make data available for querying](#emr-trino-getting-started-connect-hive).

1. In the AWS console, select Amazon EMR from the services. When you choose Amazon EMR, if you have existing clusters, your **EMR on EC2** clusters are listed.

1. Choose **Create cluster**. From here, you start the process for building a cluster.

1. Give your cluster a name and choose an **Amazon EMR release**. You can choose the most current release for the tutorial.

1. Choose the **Trino** bundle, which has the Trino application pre-selected. Bundles are set up for convenience when you know the purpose for the cluster ahead of time. Otherwise, you can simply select the check box for Trino.

1. For **Cluster configuration**, choose **Uniform instance groups**. Go ahead and remove additional instance groups.

1. Choose an **Instance type**. Generally we recommend you choose an instance type with at least 16 GiB memory. Also, for **Cluster scaling and provisioning** choose **Set cluster size manually**.

1. At this point, set your Hive metastore configuration to point to AWS Glue. This is detailed in the section [Using a Hive connector to make data available for querying](#emr-trino-getting-started-connect-hive). Complete this before you build the cluster.

1. Choose **Create cluster**. It can take a few minutes to finish.

   The steps here don't cover all of the configuration steps in detail. More information about setting up a cluster is available at [Plan, configure and launch Amazon EMR clusters](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html).

**Note**  
Don't select both Presto and Trino for use on the same cluster. Running them together isn't supported. It's also recommended that if you run Trino, you don't run any other applications on the cluster, such as Spark.

# Connect to the primary node for the Amazon EMR cluster and run queries
<a name="emr-trino-getting-started-connect"></a>

## Provision test data and configure permissions
<a name="emr-trino-getting-started-pre-data"></a>

You can test Amazon EMR with Trino by using AWS Glue Data Catalog and its Hive metastore. These prerequisite steps describe how to set up test data, if you haven't done so:

1. Create an SSH key to use for communication encryption, if you haven't already.

1. You can choose from several file systems to store data and log files. To start, create an Amazon S3 bucket. Give the bucket a unique name. When you create it, specify the encryption key that you created.
**Note**  
Choose the same region to create both your storage bucket and the Amazon EMR cluster.

1. Choose the bucket you created. Choose **Create folder** and give the folder a memorable name. When you create the folder, choose a security configuration. You can choose the security settings for the parent, or make the security settings more specialized.

1. Add test data to your folder. For the purposes of this tutorial, using a .csv of comma-separated records works well for completing this use case.

1. After you add data to an Amazon S3 bucket, configure a table in AWS Glue to provide an abstraction layer for querying the data.

## Connect and run queries
<a name="emr-trino-getting-started-run"></a>

The following describes how you connect to and run queries on a cluster running Trino. Before you do this, make sure you set up the Hive metastore connector, which is described in the previous procedure, so that metastore tables are visible.

1. We recommend using EC2 Instance Connect to connect to your cluster, because it provides a secure connection. Choose **Connect to the Primary node using SSH** from the cluster summary. The connection requires that the security group has an inbound rule to allow connections through port 22 to clients in the subnet. You also must use the user **hadoop** when connecting.

1. Start the Trino CLI by running `trino-cli`. This provides for you to run commands and query data with Trino.

1. Run `show catalogs;`. Check that the **hive** catalog is listed. This provides a list of catalogs available, which contain data stores or system settings.

1. To see the schemas available, run `show schemas in hive;`. From here, you can run `use schema-name;` and include the name of your schema. Then you can run `show tables;` to list tables.

1. Query a table by running a command like `SELECT * FROM table-name`, using the name of a table in your schema. If you already ran the `USE` statement to connect to a specific schema, you don't have to use two-part notation such as *schema*.*table*.