

# Setup for SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup"></a>

The following section provides information on how to get set up with the Amazon CloudWatch Observability EKS and SageMaker HyperPod task governance add-ons.

Ensure that you have the minimum permission policy for HyperPod cluster administrators with Amazon EKS, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 

**Topics**
+ [

# Dashboard setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md)
+ [

# Task governance setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md)

# Dashboard setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard"></a>

Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
](#hp-eks-dashboard-prerequisites)
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on setup
](#hp-eks-dashboard-setup)

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
<a name="hp-eks-dashboard-prerequisites"></a>

The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin).
+ Attach the `CloudWatchAgentServerPolicy` IAM policy to your worker nodes. To do so, enter the following command. Replace `my-worker-node-role` with the IAM role used by your Kubernetes worker nodes.

  ```
  aws iam attach-role-policy \
  --role-name my-worker-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  ```

## HyperPod Amazon CloudWatch Observability EKS add-on setup
<a name="hp-eks-dashboard-setup"></a>

Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

------
#### [ Setup using the SageMaker AI console ]

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). 

To manage task governance, use the sample policy:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters",
                "sagemaker:DescribeCluster",
                "sagemaker:ListComputeQuotas",
                "sagemaker:CreateComputeQuota",
                "sagemaker:UpdateComputeQuota",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:DeleteComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "sagemaker:CreateClusterSchedulerConfig",
                "sagemaker:UpdateClusterSchedulerConfig",
                "sagemaker:DeleteClusterSchedulerConfig",
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeCluster",
                "eks:DescribeAccessEntry",
                "eks:ListAssociatedAccessPolicies",
                "eks:AssociateAccessPolicy",
                "eks:DisassociateAccessPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:ListComputeQuotas",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "eks:DescribeCluster",
                "cloudwatch:GetMetricData",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the **Dashboard**, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch **Metrics** costs, after free-tier limit is reached. For more information, see **Metrics** in [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

------
#### [ Setup using the EKS AWS CLI ]

Use the following EKS AWS CLI command to install the add-on:

```
aws eks create-addon --cluster-name cluster-name 
--addon-name amazon-cloudwatch-observability 
--configuration-values "configuration json"
```

Below is an example of the JSON of the configuration values:

```
{
    "agent": {
        "config": {
            "logs": {
                "metrics_collected": {
                    "kubernetes": {
                        "kueue_container_insights": true,
                        "enhanced_container_insights": true
                    },
                    "application_signals": { }
                }
            },
            "traces": {
                "traces_collected": {
                    "application_signals": { }
                }
            }
        },
    },
}
```

------
#### [ Setup using the EKS Console UI ]

1. Navigate to the [EKS console](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose **Add-ons**.

1. Find the **Amazon CloudWatch Observability** add-on and install. Install version >= 2.4.0 for the add-on. 

1. Include the following JSON, Configuration values:

   ```
   {
       "agent": {
           "config": {
               "logs": {
                   "metrics_collected": {
                       "kubernetes": {
                           "kueue_container_insights": true,
                           "enhanced_container_insights": true
                       },
                       "application_signals": { }
                   },
               },
               "traces": {
                   "traces_collected": {
                       "application_signals": { }
                   }
               }
           },
       },
   }
   ```

------

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console **Dashboard** tab.

# Task governance setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance"></a>

This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## Kueue Settings
](#hp-eks-task-governance-kueue-settings)
+ [

## HyperPod Task governance prerequisites
](#hp-eks-task-governance-prerequisites)
+ [

## HyperPod task governance setup
](#hp-eks-task-governance-setup)

## Kueue Settings
<a name="hp-eks-task-governance-kueue-settings"></a>

HyperPod task governance EKS add-on installs [Kueue](https://github.com/kubernetes-sigs/kueue/tree/main/apis/kueue) for your HyperPod EKS clusters. Kueue is a kubernetes-native system that manages quotas and how jobs consume them. 


| EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | 
| --- | --- | 
|  v1.1.3  |  v0.12.0  | 

**Note**  
Kueue v.012.0 and higher don't include kueue-rbac-proxy as part of the installation. Previous versions might have kueue-rbac-proxy installed. For example, if you're using Kueue v0.8.1, you might have kueue-rbac-proxy v0.18.1.

HyperPod task governance leverages Kueue for Kubernetes-native job queueing, scheduling, and quota management, and is installed with the HyperPod task governance EKS add-on. When installed, HyperPod creates and modifies SageMaker AI-managed Kubernetes resources such as `KueueManagerConfig`, `ClusterQueues`, `LocalQueues`, `WorkloadPriorityClasses`, `ResourceFlavors`, and `ValidatingAdmissionPolicies`. While Kubernetes administrators have the flexibility to modify the state of these resources, it is possible that any changes made to a SageMaker AI-managed resource may be updated and overwritten by the service.

The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.

```
  apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8443
      enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: false
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/mxjob"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "pod"
      - "deployment"
      - "statefulset"
      - "leaderworkerset.x-k8s.io/leaderworkerset"
      podOptions:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: [ kube-system, kueue-system ]
    fairSharing:
      enable: true
      preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    resources:
      excludeResourcePrefixes: []
```

For more information about each configuration entry, see [Configuration](https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Configuration) in the Kueue documentation.

## HyperPod Task governance prerequisites
<a name="hp-eks-task-governance-prerequisites"></a>
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs, manage SageMaker HyperPod clusters within your AWS account, and performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 
+ You will need to have your Kubernetes version >= 1.30. For instructions, see [Update existing clusters to the new Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).
+ If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
+ A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on. 

## HyperPod task governance setup
<a name="hp-eks-task-governance-setup"></a>

The following provides information on how to get set up with HyperPod task governance.

------
#### [ Setup using the SageMaker AI console ]

The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.

You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the [HyperPod Amazon CloudWatch Observability EKS add-on setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md#hp-eks-dashboard-setup). If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "eks:DescribeCluster",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on. 

------
#### [ Setup using the Amazon EKS AWS CLI ]

Use the example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html) EKS AWS CLI command to set up the HyperPod task governance Amazon EKS API and console UI using the AWS CLI:

```
aws eks create-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

------

You can view the **Policies** tab in the HyperPod SageMaker AI console if the install was successful. You can also use the following example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html) EKS AWS CLI command to check the status. 

```
aws eks describe-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```