

# Amazon EMR on EC2 connections in Amazon SageMaker Unified Studio
Amazon EMR on EC2

Whenever you are working with a project, you can manage that project's Amazon EC2 resources and view both monitoring and logging data for those resources. You can create and configure Amazon EMR on EC2 clusters, as well as terminate and remove those clusters. When clusters are running, data regarding their metrics is automatically sent to CloudWatch, while logging data is preserved in the Spark UI.

# Adding a new Amazon EMR on EC2 cluster in Amazon SageMaker Unified Studio
Adding a new Amazon EMR on EC2 cluster

As a data worker, you can make use of Amazon EMR on EC2 by adding existing or new Amazon EMR on EC2 clusters as compute instances to a project in the Amazon SageMaker Unified Studio Studio. Within a project, you can use both existing and new Amazon EMR on EC2 clusters. 

Before you can create a new Amazon EMR on EC2 cluster, your admin must enable blueprints. On-demand creation isn't supported for Amazon EMR on EC2 in quick setup. 

After your Admin has enabled blueprints:

1. From inside the project management view, select **Compute** from the navigation bar. 

1. In the Compute panel, select the **Data processing** tab.

1. To create a new Amazon EMR on EC2 cluster, select the **Add compute** dropdown menu and then choose **New compute**.

1. In the **Add compute** modal, you can select the type of compute you would like to add to your project. Select **Create new compute resources**.

1. Select **Amazon EMR on EC2 cluster**.

1. The **Add compute** dialog box allows you to specify the name of the Amazon EMR on EC2 cluster, provide a description, and choose a release of EMR (such as EMR 7.5) that you want to install on your cluster. 

1. After configuring these settings, select **Add compute**. After some time, your Amazon EMR on EC2 cluster will be added to your project.

# Adding an existing Amazon EMR on EC2 cluster in Amazon SageMaker Unified Studio
Adding an existing Amazon EMR on EC2 cluster

As a data worker, you can make use of Amazon EMR on EC2 by adding existing or new Amazon EMR on EC2 clusters as compute instances to a project in the Amazon SageMaker Unified Studio Studio. Within a project, you can use both existing and new Amazon EMR on EC2 clusters. 

Before you can connect to an Amazon EMR on EC2 cluster, you must complete the following prerequisites:
+ Your Amazon SageMaker Unified Studio admin must enable blueprints. On-demand creation isn't supported for Amazon EMR on EC2 in quick setup. In addition, if you are connecting to an Amazon EMR on EC2 cluster that is not runtime-role enabled, the admin must configure specific blueprints as described in the section below.
+ You must have a project created in Amazon SageMaker Unified Studio. If you are connecting to an Amazon EMR on EC2 cluster that is not runtime-role enabled, you must create a project that includes specific blueprint configurations in the project profile.
+ The admin that owns the Amazon EMR resource you want to connect to must complete a set of prerequisite steps to grant you access to the resource. 

More details on each of these steps is found in the sections below.

## Prerequisite steps for you and your Amazon SageMaker Unified Studio admin


Amazon EMR on EC2 clusters can be runtime-role enabled or not runtime-role enabled. You can connect to both kinds of Amazon EMR on EC2 clusters in Amazon SageMaker Unified Studio. However, to use clusters that are not runtime-role enabled, you and your Amazon SageMaker Unified Studio admin must prepare to use a project with specific configurations.

**Note**  
If you are connecting to clusters that are runtime-role enabled, you can proceed to the section for prerequisite steps for Amazon EMR admins without completing the steps in this section.
+ You can use runtime-role enabled clusters to specify different IAM roles for individual jobs or steps within a cluster, with fine-grained access control tailored to specific job needs. 
+ Clusters that are not runtime-role enabled have limited granular access control for jobs. Instead, all jobs on the cluster use the same set of permissions.

Amazon EMR clusters with runtime roles enabled are considered more secure because they allow for fine-grained access control at the job level, meaning each individual job running on the cluster can be assigned a specific IAM role with only the necessary permissions to access the data and resources it needs.

To prepare to use clusters that are not runtime-role enabled, complete the following additional steps:

**Note**  
 Amazon EMR clusters that are not runtime-role enabled must have in-transit encryption enabled in order to be connected to Amazon SageMaker Unified Studio. To ensure that the Amazon EMR cluster meets this requirement, verify with your Amazon EMR admin that the cluster has a security configuration with in-transit encryption enabled. For more information, see [Create a security configuration with the Amazon EMR console or with the AWS CLI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html) in the Amazon EMR Management Guide. 

1. The Amazon SageMaker Unified Studio admin must configure the tooling configurations in the blueprints for a project profile so that **allowConnectionToUserGovernedEmrClusters** is set to **True** in the Amazon SageMaker Unified Studio management console. For more information, see the Amazon SageMaker Unified Studio Administrator Guide.

1. You create a project using the project profile that your admin modified in step 1.

For more information about runtime roles, see [Runtime roles for Amazon EMR steps](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html) in the Amazon EMR Management Guide.

**Note**  
For clusters without runtime roles, Amazon SageMaker Unified Studio cannot provide governance on the clusters, and applications running on these clusters will not be isolated between projects or honor fine-grained access control based on project data permissions.  
Additionally, all project resources are inaccessible to the cluster unless additional permissions are granted to the IAM instance profile role attached to the Amazon EC2 instance.

## Prerequisite steps for Amazon EMR admins


Before you can add an existing Amazon EMR on EC2 resource to your project in Amazon SageMaker Unified Studio, the admin that owns that resource must grant access to you by completing the following steps:

**Create an Amazon EMR access role with a trust policy**

1. Get the project role ARN and project ID for the Amazon SageMaker Unified Studio project that you want to grant access to. Project members can get the project role ARN and project ID from the **Project overview** page in their project.
**Note**  
If the Amazon SageMaker Unified Studio project uses a different VPC than the Amazon EMR on EC2 cluster you want to grant access to, you must also get the project VPC information from the project member and complete additional steps to connect the VPCs. For more information, see [VPC to VPC connectivity](https://docs.aws.amazon.com/whitepapers/latest/building-scalable-secure-multi-vpc-network-infrastructure/vpc-to-vpc-connectivity.html) and [Connect VPCs using VPC peering](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peering.html).

1. Make sure that the EMR cluster you want to grant access to has an instance profile role with the `sts:AssumeRole` permission on the runtime role. For more information, see [Runtime roles for Amazon EMR steps](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html#configure-ec2-profile) in the Amazon EMR Management Guide.

1. Go to the AWS IAM console.

1. On the Roles page, choose **Create role**.

1. Choose **Custom trust policy**.

1. Enter information for the trust policy as shown in the example below, and edit it according to the project information you received in step 1.
   + Change `project-role-arn` to be the project role ARN you received from the Amazon SageMaker Unified Studio project member.
   + Change `project-id` to be the project ID you received from the Amazon SageMaker Unified Studio project member.

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": { 
                   "AWS": "project-role-arn"
               },
               "Action": "sts:AssumeRole",
               "Condition": {
                   "StringEquals": {
                       "sts:ExternalId": "project-id"
                   }
               }
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "project-role-arn"
               },
               "Action": [
                   "sts:SetSourceIdentity"
               ],
               "Condition": {
                   "StringLike": {
                       "sts:SourceIdentity": "${aws:PrincipalTag/datazone:userId}"
                   }
               }
           },
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "project-role-arn"
               },
               "Action": "sts:TagSession",
               "Condition": {
                   "StringEquals": {
                       "aws:RequestTag/AmazonDataZoneProject": "project-id",
                       "aws:RequestTag/AmazonDataZoneDomain": "domain-id"
                   }
               }
           }
       ]
   }
   ```

1. Choose **Next**.

1. Under **Role name**, enter a name for the role.

1. (Optional) Enter a description for the role.

1. Choose **Create role**.

**Attach permissions to the role**

1. Select the role you have created in the AWS IAM console.

1. Choose **Add permissions** > **Create inline policy**.

1. Enter information as shown in the example below, and edit it according to the information for your Amazon EMR clusters that you want to grant access to.
   + Change the EMR cluster ARN to be the ARN for the cluster. You can find this on the cluster details page in the Amazon EMR console by selecting the cluster ID of the cluster that you want to share.
**Note**  
You can use an asterisk instead of the Amazon EMR cluster ID if you want to grant access to all clusters instead of just one.
   + Change the certificate path to the one defined in the Amazon EMR security configuration for that cluster in the Amazon EMR console. For more information, see [Specify a security configuration for an Amazon EMR cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-specify-security-configuration.html) in the Amazon EMR Management Guide.

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "EmrAccess",
               "Effect": "Allow",
               "Action": [
                   "elasticmapreduce:ListInstances",
                   "elasticmapreduce:DescribeCluster",
                   "elasticmapreduce:ListBootstrapActions",
                   "elasticmapreduce:GetClusterSessionCredentials" # Skip this for non-runtime role clusters
               ],
               "Resource": "arn:aws:elasticmapreduce:us-east-1:666777888999:cluster/j-AB1CDEFGHIJK" # EMR cluster ARN
           },
           {
               "Sid": "EMRSelfSignedCertAccess",
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject"
               ],
               "Resource": [
                   "arn:aws:s3:::666777888999-us-east-1-sam-dev/my-certs.zip" # Cert path defined in the EMR security configuration
               ]
           },
           {
               "Sid": "EMRSecurityConfigurationAccess",
               "Effect": "Allow",
               "Action": [
                   "elasticmapreduce:DescribeSecurityConfiguration"
               ],
               "Resource": [
                   "*"
               ]
           }
       ]
   }
   ```

1. Choose **Next**.

1. Under **Policy name**, enter a name for the polciy.

1. Choose **Create policy**. You can then see the permissions policy listed on the page for the role you created in the IAM console.

**Send information to project members**

1. Copy the ARN of the EMR access role you created in the IAM console and send it to the Amazon SageMaker Unified Studio project member you want to grant access to.

1. Copy the Amazon EMR cluster ARN that you added to the permissions policy and send it to the Amazon SageMaker Unified Studio project member you want to grant access to.

1. From the Amazon EMR on EC2 cluster details page in the Amazon EMR console, copy the EC2 instance profile string and search for it on the Roles page in the IAM console to find the role that contains the Amazon EC2 instance profile ARN.

1. Select the name of the role that contains the instance profile ARN to open the role details page, then copy the ARN and send it to the Amazon SageMaker Unified Studio project member you want to grant access to.

After the Amazon EMR admin has completed these steps, project members are able to add a connection to the Amazon EMR on EC2 cluster as a compute resource in Amazon SageMaker Unified Studio.

## Adding the Amazon EMR on EC2 compute resource


1. From inside the project management view in Amazon SageMaker Unified Studio, select **Compute** from the navigation bar. 

1. On the Compute page, select the **Data processing** tab.

1. Choose **Add compute**, then choose **Connect to existing compute resources**. 

1. In the **Add compute** modal, you can select the type of compute resource you would like to add to your project. Select **EMR on EC2 cluster**.

1. To add a connection to an existing Amazon EMR on EC2 cluster, you must have the correct permissions to access the Amazon EMR on EC2 cluster. You can select the **Copy project information** button to copy the data that the Amazon EMR admin will need to grant the data worker access. If you haven't already, send the project role ARN and the project ID to your admin.
**Note**  
The Amazon EMR admin will also need the project ID, which is the penultimate string in the project ARN. To view and copy the project ID, go to the **Project overview** page of your project.

1. After the account administrator has granted you access according to the prerequisite steps above, you can specify the ARNs associated with the cluster. You must fill in the **Access role ARN**, **EMR on EC2 cluster ARN**, **Compute name**, and the **Instance profile role ARN**.

1. Choose **Add compute**. Your Amazon EMR on EC2 instance is then added to your project.

After you have added a cluster to a project, you are able to see the cluster in the list on the **Data processing** tab in the Compute panel. You can then view the cluster details by selecting the cluster you want.

# Using an Amazon EMR on EC2 cluster


After connecting to an Amazon EMR on EC2 cluster, you can begin using the cluster. To get started, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the compute connection. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. On the **Compute** page, choose the name of the compute you want to initialize. This takes you to a page with details about the cluster. Make a note of the name of the compute.

1. Choose **Actions > Open JupyterLab IDE**.

1. In the first cell, choose a connection type that you want to use from the dropdown list of connection types. Then choose the name of the compute from the dropdown list of compute options.

1. Choose the **Run** icon.

Your cluster is now initialized and configured to be a compute resource in your Amazon SageMaker Unified Studio project.

# Monitoring Amazon EMR on EC2 clusters in Amazon SageMaker Unified Studio
Monitoring Amazon EMR on EC2 clusters

You can monitor the performance of your Amazon EMR on EC2 clusters to ensure optimal resource use and efficient job execution. Information on metrics is automatically collected and sent to Amazon CloudWatch during operation of an Amazon EMR cluster.

You can see [CloudWatch metrics](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) for a specific cluster by selecting the cluster you're interested in from the list of clusters under the Cluster tab. Selecting a cluster will bring you to the Detail view for that cluster. After you've selected a cluster, select the **Monitoring** tab.

You will be able to see a grid view of the CloudWatch Metrics for the cluster you selected.

You can see information presented through different views by using the **Dashboard View** drop-down menu: Cluster Overview, Primary Node Group, Core Node Group, Task Node Group. You can also adjust the time range.

# Configuring trusted identity propagation
Configuring TIP

You or your admin add an inline policy to the instance profile role to enable trusted identity propagation for that cluster in Amazon SageMaker Unified Studio. Before doing this, make sure you have followed the steps to [add a new EMR on EC2 cluster to your project](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/adding-new-emr-on-ec2-clusters.html).

**Note**  
Trusted identity propagation is supported for EMR on EC2 clusters that you create using Amazon SageMaker Unified Studio.

To find the name of the instance profile role for an EMR on EC2 cluster, complete the following steps:

1. Navigate to the project that contains the compute connection. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. On the **Compute** page, go to the **Data processing** tab.

1. Choose the name of the compute you want to configure TIP for. This takes you to a page with details about the cluster. The instance profile role is on this page and the admin can then search for it in the IAM console.

As an admin user who could edit IAM policies in the account that owns the project, add the following inline policy to the instance profile role.

```
{
    "Statement": [
        {
            "Sid": "IdCPermissions",
            "Effect": "Allow",
            "Action": [
                "sso-oauth:CreateTokenWithIAM",
                "sso-oauth:IntrospectTokenWithIAM",
                "sso-oauth:RevokeTokenWithIAM"
            ],
            "Resource": "*"
        }, 
        {
            "Sid": "AllowAssumeRole",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": [
                "instance-profile-role-ARN"
            ]
        }
    ]
}
```

After updating the role’s policy, you can use the EMR on EC2 connection to initiate interactive Spark sessions.

# Configuring user background sessions for Amazon EMR on EC2
Configuring user background sessions

**Warning**  
 When user background sessions is enabled for Amazon EMR on EC2, Amazon SageMaker Unified Studio will not terminate interactive sessions. All interactive sessions will be only terminated once all queries are completed. 

 Amazon EMR on EC2 requires additional IAM permissions to enable user background sessions. You must attach the following inline IAM role policy to the IAM role created as the project user role. 

**Note**  
 The project user role for an Amazon SageMaker Unified Studio project is named `datazone_usr_role_{project_id}`. 

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "UserBackgroundSessions",
            "Effect": "Allow",
            "Action": [
                "sso:GetApplicationSessionConfiguration"
            ],
            "Resource": "*"
        }
    ]
}
```

 For more information, see [User background sessions](https://docs.aws.amazon.com/emr/latest/ManagementGuide/user-background-sessions.html) in the Amazon EMR on EC2 management guide. 

# Terminating and removing an Amazon EMR on EC2 cluster


**Warning**  
A terminated EMR Cluster is irrecoverable. Ensure that the resource and any data on HDFS or jupyter notebooks is no longer required prior to removal.

When you no longer need an Amazon EMR on EC2 cluster, the cluster can be terminated and removed.

To remove a cluster:

1. Login to the Amazon SageMaker Unified Studio and navigate to the **Data processing** tab of the Compute section. Select the name of the compute instance you would like to remove.

1. On the compute details page, select the **Terminate and remove** option.

1. A dialog box will appear asking you to confirm that you want to terminate and remove the instance of compute, which in this case is your Amazon EMR on EC2 cluster. Confirm that you want to remove the compute, by typing "confirm" in the text box.

1. Choose **Terminate and remove compute** to begin termination and removal.

1. After a few minutes, your cluster should have been removed.

## Spark History Server


You can use the live Spark UI in a notebook session to view details such as tasks, executors and logs about Spark jobs.

You can explore the Spark History Server for a cluster at any time. To do this, select your cluster from the list of all clusters assigned to a project, which brings up the Detail view for the cluster. On the Detail page view, select the **Applications** tab and choose the '**Spark History Server** link.