

# Configuring an external metastore for Hive
<a name="emr-metastore-external-hive"></a>

By default, Hive records metastore information in a MySQL database on the primary node's file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the primary node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an *external metastore* that exists outside the cluster.

You have two options for an external metastore:
+ AWS Glue Data Catalog (Amazon EMR release 5.8.0 or later only).

  For more information, see [Using the AWS Glue Data Catalog as the metastore for Hive](emr-hive-metastore-glue.md).
+ Amazon RDS or Amazon Aurora.

  For more information, see [Using an external MySQL database or Amazon Aurora](emr-hive-metastore-external.md).

**Note**  
If you're using Hive 3 and encounter too many connections to Hive metastore, configure the parameter `datanucleus.connectionPool.maxPoolSize` to have a smaller value or increase the number of connection the database server can handle. The increased number of connections is due to the way Hive computes the maximum number of JDBC connections. To calculate the optimal value for performance, see [Hive Configuration Properties](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-datanucleus.connectionPool.maxPoolSize.1).

# Using the AWS Glue Data Catalog as the metastore for Hive
<a name="emr-hive-metastore-glue"></a>

Using Amazon EMR release 5.8.0 or later, you can configure Hive to use the AWS Glue Data Catalog as its metastore. We recommend this configuration when you require a persistent metastore or a metastore shared by different clusters, services, applications, or AWS accounts.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see [Populating the AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html) in the *AWS Glue Developer Guide*.

Separate charges apply for AWS Glue. There is a monthly rate for storing and accessing the metadata in the Data Catalog, an hourly rate billed per minute for AWS Glue ETL jobs and crawler runtime, and an hourly rate billed per minute for each provisioned development endpoint. The Data Catalog allows you to store up to a million objects at no charge. If you store more than a million objects, you are charged USD\$11 for each 100,000 objects over a million. An object in the Data Catalog is a table, partition, or database. For more information, see [Glue Pricing](https://aws.amazon.com/glue/pricing).

**Important**  
If you created tables using Amazon Athena or Amazon Redshift Spectrum before August 14, 2017, databases and tables are stored in an Athena-managed catalog, which is separate from the AWS Glue Data Catalog. To integrate Amazon EMR with these tables, you must upgrade to the AWS Glue Data Catalog. For more information, see [Upgrading to the AWS Glue Data Catalog](https://docs.aws.amazon.com/athena/latest/ug/glue-upgrade.html) in the *Amazon Athena User Guide*.

## Specifying AWS Glue Data Catalog as the metastore
<a name="emr-hive-glue-configure"></a>

You can specify the AWS Glue Data Catalog as the metastore using the AWS Management Console, AWS CLI, or Amazon EMR API. When you use the CLI or API, you use the configuration classification for Hive to specify the Data Catalog. In addition, with Amazon EMR 5.16.0 and later, you can use the configuration classification to specify a Data Catalog in a different AWS account. When you use the console, you can specify the Data Catalog using **Advanced Options** or **Quick Options**.

------
#### [ Console ]

**To specify AWS Glue Data Catalog as the Hive metastore with the console**

1. Sign in to the AWS Management Console, and open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Application bundle**, choose **Core Hadoop**, **HBase**, or **Custom**. If you customize your cluster, make sure that you select Hive or HCatalog as one of your applications.

1. Under **AWS Glue Data Catalog settings**, select the **Use for Hive table metadata** check box.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To specify the AWS Glue Data Catalog as the Hive metastore with the AWS CLI**

For more information about specifying a configuration classification using the AWS CLI and EMR API, see [Configure applications](emr-configure-apps.md).
+ Specify the value for `hive.metastore.client.factory.class` using the `hive-site` configuration classification as shown in the following example:

  ```
  [
    {
      "Classification": "hive-site",
      "Properties": {
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
      }
    }
  ]
  ```

  On EMR release versions 5.28.0, 5.28.1, 5.29.0, or 6.x, if you're creating a cluster using the AWS Glue Data Catalog as the metastore, set the `hive.metastore.schema.verification` to `false`. This prevents Hive and HCatalog from validating the metastore schema against MySQL. Without this configuration, the primary instance group will become suspended after reconfiguration on Hive or HCatalog. 

  ```
  [
    {
      "Classification": "hive-site",
      "Properties": {
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "hive.metastore.schema.verification": "false"
      }
    }
  ]
  ```

  If you already have a cluster on EMR release version 5.28.0, 5.28.1, or 5.29.0, you can set the primary instance group `hive.metastore.schema.verification` to `false` with following information:

  ```
     
      Classification = hive-site
      Property       = hive.metastore.schema.verification
      Value          = false
  ```

  To specify a Data Catalog in a different AWS account, add the `hive.metastore.glue.catalogid` property as shown in the following example. Replace `acct-id` with the AWS account of the Data Catalog.

  ```
  [
    {
      "Classification": "hive-site",
      "Properties": {
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "hive.metastore.schema.verification": "false",
        "hive.metastore.glue.catalogid": "acct-id"
      }
    }
  ]
  ```

------

## IAM permissions
<a name="emr-hive-glue-permissions"></a>

The EC2 instance profile for a cluster must have IAM permissions for AWS Glue actions. In addition, if you enable encryption for AWS Glue Data Catalog objects, the role must also be allowed to encrypt, decrypt and generate the AWS KMS key used for encryption.

### Permissions for AWS Glue actions
<a name="emr-hive-glue-permissions-actions"></a>

If you use the default EC2 instance profile for Amazon EMR, no action is required. The `AmazonElasticMapReduceforEC2Role` managed policy that is attached to the `EMR_EC2_DefaultRole` allows all necessary AWS Glue actions. However, if you specify a custom EC2 instance profile and permissions, you must configure the appropriate AWS Glue actions. Use the `AmazonElasticMapReduceforEC2Role` managed policy as a starting point. For more information, see [Service role for cluster EC2 instances (EC2 instance profile)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html) in the *Amazon EMR Management Guide*.

### Permissions for encrypting and decrypting AWS Glue Data Catalog
<a name="emr-hive-glue-permissions-encrypt"></a>

Your instance profile needs permission to encrypt and decrypt data using your key. You do *not* need to configure these permissions if both of the following statements apply:
+ You enable encryption for AWS Glue Data Catalog objects using managed keys for AWS Glue.
+ You use a cluster that's in the same AWS account as the AWS Glue Data Catalog.

Otherwise, you must add the following statement to the permissions policy attached to your EC2 instance profile. 

For more information about AWS Glue Data Catalog encryption, see [Encrypting your data catalog](https://docs.aws.amazon.com/glue/latest/dg/encrypt-glue-data-catalog.html) in the *AWS Glue Developer Guide*.

### Resource-based permissions
<a name="emr-hive-glue-permissions-resource"></a>

If you use AWS Glue in conjunction with Hive, Spark, or Presto in Amazon EMR, AWS Glue supports resource-based policies to control access to Data Catalog resources. These resources include databases, tables, connections, and user-defined functions. For more information, see [AWS Glue Resource Policies](https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html) in the *AWS Glue Developer Guide*.

When using resource-based policies to limit access to AWS Glue from within Amazon EMR, the principal that you specify in the permissions policy must be the role ARN associated with the EC2 instance profile that is specified when a cluster is created. For example, for a resource-based policy attached to a catalog, you can specify the role ARN for the default service role for cluster EC2 instances, *EMR\$1EC2\$1DefaultRole* as the `Principal`, using the format shown in the following example:

```
arn:aws:iam::acct-id:role/EMR_EC2_DefaultRole
```

The *acct-id* can be different from the AWS Glue account ID. This enables access from EMR clusters in different accounts. You can specify multiple principals, each from a different account.

## Considerations when using AWS Glue Data Catalog
<a name="emr-hive-glue-considerations-hive"></a>

Consider the following items when using the AWS Glue Data Catalog as the metastore with Hive:
+ Adding auxiliary JARs using the Hive shell is not supported. As a workaround, use the `hive-site` configuration classification to set the `hive.aux.jars.path` property, which adds auxiliary JARs into the Hive classpath.
+ [Hive transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions) are not supported.
+ Renaming tables from within AWS Glue is not supported.
+ When you create a Hive table without specifying a `LOCATION`, the table data is stored in the location specified by the `hive.metastore.warehouse.dir` property. By default, this is a location in HDFS. If another cluster needs to access the table, it fails unless it has adequate permissions to the cluster that created the table. Furthermore, because HDFS storage is transient, if the cluster terminates, the table data is lost, and the table must be recreated. We recommend that you specify a `LOCATION` in Amazon S3 when you create a Hive table using AWS Glue. Alternatively, you can use the `hive-site` configuration classification to specify a location in Amazon S3 for `hive.metastore.warehouse.dir`, which applies to all Hive tables. If a table is created in an HDFS location and the cluster that created it is still running, you can update the table location to Amazon S3 from within AWS Glue. For more information, see [Working with Tables on the AWS Glue Console](https://docs.aws.amazon.com/glue/latest/dg/console-tables.html) in the *AWS Glue Developer Guide*. 
+ Partition values containing quotes and apostrophes are not supported, for example, `PARTITION (owner="Doe's").`
+ [Column statistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) are supported for emr-5.31.0 and later.
+ Using [Hive authorization](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization) is not supported. As an alternative, consider using [AWS Glue Resource-Based Policies](https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html). For more information, see [Use Resource-Based Policies for Amazon EMR Access to AWS Glue Data Catalog](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles-glue.html).
+ [Hive constraints](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints) are not supported.
+ [Cost-based Optimization in Hive](https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive) is not supported.
+ Setting `hive.metastore.partition.inherit.table.properties` is not supported. 
+ Using the following metastore constants is not supported: `BUCKET_COUNT, BUCKET_FIELD_NAME, DDL_TIME, FIELD_TO_DIMENSION, FILE_INPUT_FORMAT, FILE_OUTPUT_FORMAT, HIVE_FILTER_FIELD_LAST_ACCESS, HIVE_FILTER_FIELD_OWNER, HIVE_FILTER_FIELD_PARAMS, IS_ARCHIVED, META_TABLE_COLUMNS, META_TABLE_COLUMN_TYPES, META_TABLE_DB, META_TABLE_LOCATION, META_TABLE_NAME, META_TABLE_PARTITION_COLUMNS, META_TABLE_SERDE, META_TABLE_STORAGE, ORIGINAL_LOCATION`.
+ When you use a predicate expression, explicit values must be on the right side of the comparison operator, or queries might fail.
  + **Correct**: `SELECT * FROM mytable WHERE time > 11`
  + **Incorrect**: `SELECT * FROM mytable WHERE 11 > time`
+ Amazon EMR versions 5.32.0 and 6.3.0 and later support using user-defined functions (UDFs) in predicate expressions. When using earlier versions, your queries may fail because of the way Hive tries to optimize query execution.
+ [Temporary tables](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-TemporaryTables) are not supported.
+ We recommend creating tables using applications through Amazon EMR rather than creating them directly using AWS Glue. Creating a table through AWS Glue may cause required fields to be missing and cause query exceptions.
+ In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when AWS Glue Data Catalog is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property `aws.glue.partition.num.segments` in `hive-site` configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1. For more information, see [AWS Glue Segment Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html#aws-glue-api-catalog-partitions-Segment).

# Using an external MySQL database or Amazon Aurora
<a name="emr-hive-metastore-external"></a>

To use an external MySQL database or Amazon Aurora as your Hive metastore, you override the default configuration values for the metastore in Hive to specify the external database location, either on an Amazon RDS MySQL instance or an Amazon Aurora PostgreSQLinstance.

**Note**  
Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently, unless you are writing to different partitions of the same metastore table.

The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.

**To create a metastore located outside of the EMR cluster**

1. Create a MySQL or Aurora PostgreSQL database. If you use PostgreSQL, you must configure it after you've provisioned your cluster. Only MySQL is supported at cluster creation. For information about the differences between Aurora MySQL and Aurora PostgreSQL, see [Overview of Amazon Aurora MySQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMySQL.Overview.html) and [Working with Amazon Aurora PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraPostgreSQL.html). For information about how to create an Amazon RDS database in general, see [https://aws.amazon.com/rds/](https://aws.amazon.com/rds/).

1. Modify your security groups to allow JDBC connections between your database and the **ElasticMapReduce-Master** security group. For information about how to modify your security groups for access, see [Working with Amazon EMR-managed security groups](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man-sec-groups.html).

1. Set JDBC configuration values in `hive-site.xml`:
**Important**  
If you supply sensitive information, such as passwords, to the Amazon EMR configuration API, this information is displayed for those accounts that have sufficient permissions. If you are concerned that this information could be displayed to other users, create the cluster with an administrative account and limit other users (IAM users or those with delegated credentials) to accessing services on the cluster by creating a role which explicitly denies permissions to the `elasticmapreduce:DescribeCluster` API key.

   1. Create a configuration file called `hiveConfiguration.json` containing edits to `hive-site.xml` as shown in the following example.

       Replace *hostname* with the DNS address of your Amazon RDS instance running the database, and *username* and *password* with the credentials for your database. For more information about connecting to MySQL and Aurora database instances, see [Connecting to a DB instance running the MySQL database engine](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToInstance.html) and [Connecting to an Athena DB cluster](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Connect.html) in the *Amazon RDS User Guide*. `javax.jdo.option.ConnectionURL` is the JDBC connect string for a JDBC metastore. `javax.jdo.option.ConnectionDriverName` is the driver class name for a JDBC metastore.

      The MySQL JDBC drivers are installed by Amazon EMR. 

      The value property can not contain any spaces or carriage returns. It should appear all on one line.

      ```
      [
          {
            "Classification": "hive-site",
            "Properties": {
              "javax.jdo.option.ConnectionURL": "jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true",
              "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
              "javax.jdo.option.ConnectionUserName": "username",
              "javax.jdo.option.ConnectionPassword": "password"
            }
          }
        ]
      ```

   1. Reference the `hiveConfiguration.json` file when you create the cluster as shown in the following AWS CLI command. In this command, the file is stored locally, you can also upload the file to Amazon S3 and reference it there, for example, `s3://DOC-EXAMPLE-BUCKET/hiveConfiguration.json`.
**Note**  
Linux line continuation characters (\$1) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

      ```
      aws emr create-cluster --release-label emr-7.12.0 --instance-type m5.xlarge --instance-count 2 \
      --applications Name=Hive --configurations file://hiveConfiguration.json --use-default-roles
      ```

1. Connect to the primary node of your cluster. 

   For information about how to connect to the primary node, see [Connect to the primary node using SSH](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html) in the *Amazon EMR Management Guide*.

1. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:

   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS table_name
   (
   key int,
   value int
   )
   LOCATION s3://DOC-EXAMPLE-BUCKET/hdfs/
   ```

1. Add your Hive script to the running cluster.

Your Hive cluster runs using the metastore located in Amazon RDS. Launch all additional Hive clusters that share this metastore by specifying the metastore location. 