Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Set up Amazon EKS cluster for AI/ML workloads using CLIs
Tip
Register
This section walks you through the steps to create the infrastructure required to run training or inference workloads on Amazon EKS via CLI commands. The steps include creating an EKS cluster, GPU-enabled nodes with EKS Auto Mode or Karpenter, a monitoring stack with Prometheus and Grafana, and Amazon S3 storage for model weights.
See the documentation for EKS Auto Mode and Karpenter
High-level architecture and workflow
The diagram shows the AWS high-level architecture for this section’s setup. The numbered steps on the right indicate the order in which you complete the configuration in the steps below.
Prerequisites
-
kubectl>= 1.35. For setup instructions, see Set up kubectl and eksctl. -
AWS CLI >= 2.27. For setup instructions, see Installing.
-
Helm >= 3.14. For setup instructions, see Setup Helm.
-
jq. For setup instructions, see Download jq. -
eksctl>= 0.227.0. For setup instructions, see Installationin the eksctldocumentation.
Verify your eksctl version:
eksctl version
If you are on a version older than 0.227.0, follow the eksctl installation guide
Set environment variables
Keep the following cluster name and AWS Region consistent throughout these steps. Changing it may cause subsequent commands to target the wrong EKS cluster.
export CLUSTER_NAME=ai-eks-docs export AWS_REGION=us-east-2
Using all available AZs improves fault tolerance and increases the chances of obtaining GPU capacity:
export AZS=$(aws ec2 describe-availability-zones \ --region ${AWS_REGION} \ --query "AvailabilityZones[?ZoneId!='use1-az3' && ZoneId!='usw1-az2' && ZoneId!='cac1-az3'].ZoneName" \ --output text | tr '\t' ',') echo $AZS
Important
The Availability Zones use1-az3, usw1-az2, and cac1-az3 are excluded because Amazon EKS does not support control plane placement in those zonesUnsupportedAvailabilityZoneException.
Expected output:
us-east-2a,us-east-2b,us-east-2c
The AZs in the output will vary per region. This example shows the available AZs for us-east-2 region.
Create cluster and GPU NodePool
This section provides two paths for creating your EKS cluster and GPU-enabled nodes, shown in the following diagram. Choose only one option throughout the guide.
-
EKS Auto Mode — In addition to the core networking, storage, and load balancing add-ons, EKS Auto Mode includes and manages the following capabilities for training and inference workloads: EKS node monitoring agent, automatic node repair, SOCI
snapshotter for fast container pulls, and GPU readiness for the default NodeClass. The NVIDIA device plugin is included in the Bottlerocket accelerated AMI that EKS Auto Mode uses for GPU-enabled nodes. -
Self-managed Karpenter — On an EKS cluster without EKS Auto Mode, you are responsible for installing and configuring the components required for training and inference workloads. This includes networking add-ons (VPC CNI, CoreDNS, kube-proxy), Karpenter, the EKS node monitoring agent, the NVIDIA device plugin, and SOCI snapshotter for fast container pulls.
EKS cluster options: EKS Auto Mode and self-managed Karpenter
In each of the following steps, choose a path (EKS Auto Mode, Karpenter) and follow it throughout. After completing the steps for your chosen path, you’ll have an EKS cluster with a GPU NodePool ready to schedule GPU workloads.
Step 1: Create cluster
Start by creating your EKS cluster and installing the cluster components needed for GPU workloads.
With EKS Auto Mode, a single eksctl create cluster --enable-auto-mode command provisions an EKS cluster that’s ready for GPU workloads.
With self-managed Karpenter, the eksctl create cluster command provisions the core networking add-ons, then additional steps are needed to enable automatic node repair through a Karpenter feature gate, install the EKS node monitoring agent, and install the NVIDIA device plugin.
Warning
For both the EKS Auto Mode and self-managed Karpenter paths, automatic node repair behaves the same way for nodes provisioned by NodePools. Automatic node repair in EKS Auto Mode and Karpenter is a forceful disruption method that bypasses PodDisruptionBudgets, the karpenter.sh/do-not-disrupt annotation, and terminationGracePeriod. Automatic node repair waits 10 minutes before replacing a node with the AcceleratedHardwareReady condition set to False and 30 minutes for other repair conditions.
Step 2: Create dynamic GPU NodePool
Define a NodePool that dynamically provisions G-family GPU instances with generation greater than 4, using Spot capacity with On-Demand as a fallback. The EKS Auto Mode and Karpenter paths both use the same NodePool API with the only difference being the NodeClass it points to. In EKS Auto Mode, the bundled default NodeClass already selects the right AMI and configures SOCI parallel pull, so the NodePool is the only object you create. In self-managed Karpenter, you also need a custom EC2NodeClass that pins the AMI and tunes SOCI.
Validate the NodePool was created:
kubectl get nodepool gpu-inf
Expected output:
NAME NODECLASS NODES READY AGE gpu-inf default 0 True 8s
On the self-managed Karpenter path, the NODECLASS column shows gpu-inf instead of default.
Step 3: Test with a sample Pod
Test your GPU NodePool setup with an nvidia-smi Pod.
cat << EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: nvidia-smi labels: guide: ai-eks-docs spec: tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" containers: - name: nvidia-smi image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1 restartPolicy: OnFailure EOF
Verify the Pod is scheduled and completed successfully.
kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE nvidia-smi 0/1 Completed 0 67s
The STATUS: Completed means the nvidia-smi command ran and exited. Check the Pod logs to see the GPU detected by the node.
kubectl logs nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:2B:00.0 Off | 0 | | N/A 30C P0 81W / 600W | 0MiB / 97887MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
The output shows the GPU model, driver version, CUDA version, and available memory. In this example, Karpenter provisioned a G7e instance which has an NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB of memory. The 30C is the current GPU temperature and P0 means the GPU is in its highest performance state (idle but ready). The 81W / 600W shows current power draw vs max power capacity, and 0MiB / 97887MiB shows current GPU memory used vs total available. Since the Pod just ran nvidia-smi and exited, no workload is using the GPU so memory is at 0 and power is at idle. The NVIDIA GPU driver version (580.126.09) comes from the Bottlerocket AMI, while the CUDA version (13.0) comes from the container image. The GPU model and memory will vary depending on the instance type Karpenter selects. G5 instances have NVIDIA A10G GPUs (24 GB), G6e instances have NVIDIA L40S GPUs (48 GB), and G7e instances have NVIDIA RTX PRO 6000 GPUs (96 GB).
To understand how Karpenter and the Kubernetes scheduler coordinated to provision a node and place the Pod, check the Pod’s lifecycle events:
kubectl describe po nvidia-smi
Expected output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 60s default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. Normal Nominated 59s eks-auto-mode/compute Pod should schedule on: nodeclaim/gpu-inf-vxcnj Normal Scheduled 24s default-scheduler Successfully assigned default/nvidia-smi to i-0fb17a09bc4203164 Warning FailedCreatePodSandBox 21s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7f85e25b220c8fb245187758dbbbc8efb3d40f3e49e13054404880daf4c3b2f0": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to setup network policy Normal Pulling 7s kubelet spec.containers{nvidia-smi}: Pulling image "public.ecr.aws/amazonlinux/amazonlinux:2023-minimal" Normal Pulled 5s kubelet spec.containers{nvidia-smi}: Successfully pulled image "public.ecr.aws/amazonlinux/amazonlinux:2023-minimal" in 1.237s (1.237s including waiting). Image size: 37442701 bytes. Normal Created 5s kubelet spec.containers{nvidia-smi}: Container created Normal Started 5s kubelet spec.containers{nvidia-smi}: Container started
These events show the Pod scheduling sequence: the Pod initially fails to schedule because no GPU nodes exist (FailedScheduling), Karpenter nominates a new NodeClaim (Nominated), the scheduler assigns the Pod once the node is ready (Scheduled), and then the container image is pulled and started. EKS Auto Mode comes with SOCI (Seekable OCI) parallel pull installed and configured out of the box on G, P, and Trn instances. Notice because of SOCI parallel pull, the container image was pulled from ECR in under 2 seconds (1.237s).
A NodeClaim is a request Karpenter creates to provision a specific node. It shows the instance type, capacity type, AZ, and whether the node is ready.
kubectl get nodeclaims
Expected NodeClaim output:
NAME TYPE CAPACITY ZONE NODE READY AGE gpu-inf-xxxxx g7e.2xlarge spot us-east-2a i-0xxxxxxxxxxxx True 2m
The instance type and AZ will vary. Any G-family instance with generation > 4 is eligible.
The FailedCreatePodSandBox warning in kubectl describe pod nvidia-smi is transient and expected. The VPC CNI initializes asynchronously after the node joins, and the kubelet retries automatically. If the Pod stays in ContainerCreating, check node events with kubectl describe node <node-name>.
Tip
If no node appears, check for Insufficient Capacity Errors:
kubectl get events | grep InsufficientCapacityError
Karpenter caches unavailable offerings for 3 minutes. Widening the allowed instance types and AZs in your NodePool increases the chances of landing capacity.
Note
Spot instances launched by Karpenter will not appear in the EC2 Spot Requests console. Karpenter uses the EC2 CreateFleet API with type: instant. The instances appear in the EC2 Instances console with a spot lifecycle.
Step 4: Add reserved capacity to the NodePool (optional)
To use reserved capacity first with Spot/On-Demand fallback, create an ODCR and attach it to your NodeClass, then update the dynamic NodePool from Step 2 to also allow reserved capacity. The reservation API call is the same for both paths; the NodeClass attachment differs because EKS Auto Mode and self-managed Karpenter use different NodeClass kinds.
Warning
The following command results in a charge for the reserved instance type until you cancel it with aws ec2 cancel-capacity-reservation --capacity-reservation-id <id>.
Create the capacity reservation:
CR_AZ="us-east-2a" INSTANCE_TYPE="g6e.4xlarge" aws ec2 create-capacity-reservation \ --instance-type $INSTANCE_TYPE \ --instance-platform Linux/UNIX \ --availability-zone "$CR_AZ" \ --instance-count 1 \ --instance-match-criteria open \ --end-date-type unlimited
If you get an InsufficientInstanceCapacity error, change CR_AZ to a different AZ and retry.
Look up the capacity reservation ID and store it in a shell variable for the following steps:
CAPACITY_RESERVATION_ID=$(aws ec2 describe-capacity-reservations \ --filters "Name=state,Values=active" "Name=instance-type,Values=${INSTANCE_TYPE}" \ --query 'CapacityReservations[0].CapacityReservationId' \ --output text \ --region ${AWS_REGION}) echo "Capacity reservation ID: ${CAPACITY_RESERVATION_ID}"
Then apply the NodeClass and NodePool changes for your path:
Karpenter treats reserved as the most cost-efficient option and launches it first. Once the reservation is full, it falls back to Spot or On-Demand.
After applying the changes, validate that Karpenter prioritizes reserved capacity and falls back to Spot or On-Demand. Deploy a 2-replica Deployment that requests 1 GPU per Pod. The ODCR is for 1 instance, so the first Pod triggers Karpenter to launch a reserved node. The second Pod cannot fit on the reserved node and triggers Karpenter to launch another node from Spot or On-Demand capacity.
cat << 'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: gpu-overflow-test labels: guide: ai-eks-docs spec: replicas: 2 selector: matchLabels: app: gpu-overflow-test template: metadata: labels: app: gpu-overflow-test guide: ai-eks-docs spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: nvidia-smi image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal command: ["sh", "-c", "nvidia-smi && sleep infinity"] resources: limits: nvidia.com/gpu: 1 EOF
Unlike the nvidia-smi test Pod from Step 3 which ran and exited, this Deployment keeps the Pods running (sleep infinity) so they hold the GPU and don’t release the node.
Verify the Pods scheduled on different nodes:
kubectl get pods -l app=gpu-overflow-test -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-overflow-test-59b97944fb-lq56c 1/1 Running 0 2m42s 192.168.186.240 i-057692590480155da <none> <none> gpu-overflow-test-59b97944fb-z4zcx 1/1 Running 0 2m42s 192.168.130.64 i-0521ecd1849fa0578 <none> <none>
The two pods are running, each on a different node.
Check the NodeClaims to see the capacity types:
kubectl get nodeclaims
Expected output:
NAME TYPE CAPACITY ZONE NODE READY AGE gpu-inf-shg5w g6e.xlarge reserved us-east-2a i-0ea91fdeef65b8cb6 True 2m2s gpu-inf-ssnqf g7e.2xlarge spot us-east-2b i-00ccf7ce65cf3f6ca True 112s
The reserved node launched first, followed by a Spot or On-Demand node once the reservation was full.
Clean up the test deployment:
kubectl delete deployment gpu-overflow-test
Monitoring
Install a monitoring stack that collects cluster, node, and GPU metrics into Amazon Managed Service for Prometheus (AMP), and visualize them with Grafana. The kube-prometheus-stack Helm chart deploys Prometheus to scrape and remote-write metrics to AMP, plus a self-managed Grafana for dashboards. The NVIDIA DCGM Exporter adds GPU-specific metrics (utilization, memory, temperature, power, NVLink, tensor activity).
Prometheus, Grafana, and the operator land on non-GPU nodes by default because GPU nodes carry the nvidia.com/gpu:NoSchedule taint. Node-exporter and the DCGM Exporter both run on GPU nodes so we can scrape host and GPU metrics fleet-wide.
If you opened a new terminal, set the cluster name and region:
export CLUSTER_NAME=ai-eks-docs export AWS_REGION=us-east-2
Create the AMP workspace
Create an AMP workspace to store metrics:
aws amp create-workspace \ --alias "amp-ws-${CLUSTER_NAME}" \ --region ${AWS_REGION}
Get the workspace ID:
AMP_WORKSPACE_ID=$(aws amp list-workspaces \ --alias "amp-ws-${CLUSTER_NAME}" \ --query 'workspaces[0].workspaceId' \ --output text \ --region ${AWS_REGION}) echo "AMP Workspace ID: ${AMP_WORKSPACE_ID}"
Get the remote-write endpoint:
AMP_ENDPOINT=$(aws amp describe-workspace \ --workspace-id ${AMP_WORKSPACE_ID} \ --query 'workspace.prometheusEndpoint' \ --output text \ --region ${AWS_REGION}) echo "AMP Endpoint: ${AMP_ENDPOINT}"
Create IAM policy and EKS Pod Identity associations
Create an IAM policy that allows Prometheus to remote-write metrics and Grafana to query them:
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) AMP_POLICY_ARN=$(aws iam create-policy \ --policy-name "${CLUSTER_NAME}-amp-grafana-policy" \ --policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [{\"Sid\": \"AllowAMPReadWrite\", \"Effect\": \"Allow\", \"Action\": [\"aps:ListWorkspaces\", \"aps:DescribeWorkspace\", \"aps:GetMetricMetadata\", \"aps:GetSeries\", \"aps:QueryMetrics\", \"aps:RemoteWrite\", \"aps:GetLabels\"], \"Resource\": \"arn:aws:aps:${AWS_REGION}:${ACCOUNT_ID}:workspace/*\"}, {\"Sid\": \"AllowCloudWatchMetrics\", \"Effect\": \"Allow\", \"Action\": [\"cloudwatch:DescribeAlarmsForMetric\", \"cloudwatch:ListMetrics\", \"cloudwatch:GetMetricData\", \"cloudwatch:GetMetricStatistics\"], \"Resource\": \"*\"}]}" \ --query 'Policy.Arn' \ --output text) echo "AMP Policy ARN: ${AMP_POLICY_ARN}"
Create the monitoring namespace and the service accounts for Prometheus and Grafana:
kubectl create namespace monitoring kubectl create serviceaccount amp-iamproxy-ingest-service-account -n monitoring kubectl create serviceaccount grafana-sa -n monitoring
Create EKS Pod Identity Associations to link the service accounts to the IAM policy:
eksctl create podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace monitoring \ --service-account-name amp-iamproxy-ingest-service-account \ --role-name "${CLUSTER_NAME}-amp-ingest-role" \ --permission-policy-arns ${AMP_POLICY_ARN} \ --region ${AWS_REGION} eksctl create podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace monitoring \ --service-account-name grafana-sa \ --role-name "${CLUSTER_NAME}-grafana-role" \ --permission-policy-arns ${AMP_POLICY_ARN} \ --region ${AWS_REGION}
Verify both EKS Pod Identity associations were created:
eksctl get podidentityassociation --cluster ${CLUSTER_NAME} --region ${AWS_REGION}
Expected output should include both amp-iamproxy-ingest-service-account and grafana-sa in the monitoring namespace.
Install kube-prometheus-stack
Add the Helm repo:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
This values file omits a nodeSelector for Prometheus, Grafana, and the operator: the GPU nodes' nvidia.com/gpu:NoSchedule taint keeps them off GPU nodes, so they land on the system or general-purpose pool by default. Node-exporter uses a wildcard toleration so it runs on every node — including GPU nodes — to collect metrics fleet-wide.
Create the values file:
cat << EOF > /tmp/kube-prometheus-values.yaml prometheus: serviceAccount: create: false name: amp-iamproxy-ingest-service-account prometheusSpec: serviceAccountName: amp-iamproxy-ingest-service-account remoteWrite: - url: "${AMP_ENDPOINT}api/v1/remote_write" sigv4: region: "${AWS_REGION}" queueConfig: maxSamplesPerSend: 1000 maxShards: 200 capacity: 2500 retention: 5h scrapeInterval: 30s evaluationInterval: 30s podMonitorSelectorNilUsesHelmValues: false serviceMonitorSelectorNilUsesHelmValues: false alertmanager: enabled: false grafana: enabled: true serviceAccount: create: false name: grafana-sa grafana.ini: auth.sigv4: enabled: true sidecar: datasources: defaultDatasourceEnabled: false plugins: - grafana-amazonprometheus-datasource additionalDataSources: - name: Amazon-Managed-Prometheus type: grafana-amazonprometheus-datasource access: proxy url: "${AMP_ENDPOINT}" isDefault: true jsonData: sigV4Auth: true defaultRegion: "${AWS_REGION}" sigV4Region: "${AWS_REGION}" editable: true dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: default orgId: 1 folder: 'GPU Monitoring' type: file disableDeletion: false editable: true options: path: /var/lib/grafana/dashboards/default dashboards: default: nvidia-dcgm: gnetId: 25261 revision: 1 datasource: - name: DS_PROMETHEUS value: Amazon-Managed-Prometheus vllm: gnetId: 25263 revision: 1 datasource: - name: DS_PROMETHEUS value: Amazon-Managed-Prometheus prometheus-node-exporter: tolerations: - operator: Exists EOF
Validate the variables were populated correctly:
grep -E "url:|region:|tolerations:" /tmp/kube-prometheus-values.yaml
You should see the full AMP endpoint URL (starting with https://aps-workspaces…), your region, and the node-exporter tolerations: line. If anything is empty, re-export the variables and recreate the file.
Install the chart:
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ -f /tmp/kube-prometheus-values.yaml
Verify the pods are running:
kubectl get pods -n monitoring
Expected output:
NAME READY STATUS RESTARTS AGE kube-prometheus-stack-grafana-7c58f54f77-rftrj 3/3 Running 0 4m kube-prometheus-stack-kube-state-metrics-d68dcbc84-5smxq 1/1 Running 0 4m kube-prometheus-stack-operator-5895df479f-ttm47 1/1 Running 0 4m kube-prometheus-stack-prometheus-node-exporter-t9q7s 1/1 Running 0 4m kube-prometheus-stack-prometheus-node-exporter-x6vfb 1/1 Running 0 4m prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 4m
The stack deploys the following components:
-
Prometheus (StatefulSet): scrapes metrics and remote-writes them to AMP
-
Grafana: dashboards and visualization, pre-configured with the AMP datasource
-
kube-state-metrics: generates metrics about Kubernetes object state (Pod status, resource requests/limits, NodeClaim states)
-
node-exporter (DaemonSet, one per node): collects host-level metrics (CPU, memory, disk, network)
-
operator: manages the Prometheus and Alertmanager custom resources
Alertmanager is disabled in this setup.
Access Grafana
Open a separate terminal and port-forward to access Grafana:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Open http://localhost:3000admin and the password from the following command:
kubectl --namespace monitoring get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
To verify the metrics pipeline is working end to end:
-
Navigate to Connections > Data sources and confirm
Amazon-Managed-Prometheusis listed as the default datasource.Validate the AMP datasource in Grafana
-
Navigate to Drilldown > Metrics and search for the
upmetric. You should see results from your cluster’s scrape targets.Validate the
upmetric in Grafana
If up shows results, the pipeline (cluster → Prometheus → AMP → Grafana) is working.
Deploy the DCGM Exporter for GPU metrics
The kube-prometheus-stack collects node-level CPU and memory metrics but not GPU metrics. The NVIDIA DCGM Exporter adds GPU utilization, memory usage, temperature, power draw, NVLink bandwidth, and tensor activity.
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts helm repo update
Set the GPU node selector key for your path. EKS Auto Mode and self-managed Karpenter use different label keys for GPU manufacturer.
Create the DCGM exporter values file:
cat << EOF > /tmp/dcgm-exporter-values.yaml resources: requests: memory: "512Mi" cpu: "100m" limits: memory: "1Gi" cpu: "500m" serviceMonitor: enabled: true additionalLabels: release: kube-prometheus-stack nodeSelector: ${GPU_NODE_SELECTOR_KEY}: nvidia tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" customMetrics: | # Clocks DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIe DCGM_FI_PROF_PCIE_TX_BYTES, counter, Number of bytes transmitted through PCIe TX (in KB) via NVML. DCGM_FI_PROF_PCIE_RX_BYTES, counter, Number of bytes received through PCIe RX (in KB) via NVML. DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization (the sample period varies depending on the product) DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL, gauge, Decoder utilization (in %). # Errors and violations DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. DCGM_EXP_XID_ERRORS_COUNT, gauge, Value of count of XID errors encountered. DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). # Memory usage DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # Retired pages DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. # NVLink DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes. DCGM_FI_PROF_NVLINK_TX_BYTES, counter, The rate of data transmitted over NVLink not including protocol headers in bytes per second. DCGM_FI_PROF_NVLINK_RX_BYTES, counter, The rate of data received over NVLink not including protocol headers in bytes per second. # DCP metrics DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %). DCGM_FI_DEV_CLOCK_THROTTLE_REASONS, gauge, Current clock throttle reasons (bitmask of DCGM_CLOCKS_THROTTLE_REASON_*). DCGM_FI_DEV_GPU_NVLINK_ERRORS, gauge, Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS. ## NVLink DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload. ## Remapped rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors. DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors. DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, whether remapping of rows has failed. ## Profiling metrics DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %). DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %). DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %). # ECC DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. EOF
The customMetrics field overrides the DCGM exporter’s default metric set with an extended one that includes NVLink bandwidth, tensor activity, PCIe throughput, ECC errors, and thermal throttling. For inference workloads these help you understand whether the GPU compute units are fully utilized, whether the GPU is idle between requests due to low batch sizes, whether data transfer between CPU and GPU is a bottleneck, whether thermal throttling is causing latency spikes, and how much GPU memory headroom remains for larger batches.
Install the DCGM exporter:
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \ --namespace monitoring \ -f /tmp/dcgm-exporter-values.yaml
The tolerations allow the exporter to run on the GPU-tainted nodes you provisioned in Step 2. The serviceMonitor with the release: kube-prometheus-stack label ensures Prometheus discovers and scrapes it automatically.
Verify the DCGM exporter DaemonSet:
kubectl get daemonset dcgm-exporter -n monitoring
Once a GPU node is running, you should see one ready Pod. To validate DCGM metrics, navigate to Drilldown > Metrics in Grafana and search for DCGM_.
Validate DCGM metrics in Grafana
To view the dashboard, navigate to Dashboards > GPU Monitoring > NVIDIA DCGM Exporter Dashboard.
NVIDIA DCGM Exporter Dashboard in Grafana
Model weights S3 bucket
Create an Amazon S3 bucket for storing model weights and configure an EKS Pod Identity Association so workload pods can read and write to it.
If you opened a new terminal, set the cluster name and region:
export CLUSTER_NAME=ai-eks-docs export AWS_REGION=us-east-2
Create the S3 bucket
Create the bucket with a random suffix to avoid name collisions:
BUCKET_SUFFIX=$(head -c 4 /dev/urandom | od -An -tx1 | tr -d ' \n') MODEL_BUCKET="${CLUSTER_NAME}-models-${BUCKET_SUFFIX}" aws s3 mb s3://${MODEL_BUCKET} --region ${AWS_REGION}
S3 buckets created after January 2023 have server-side encryption (AES256) and public access blocking enabled by default.
Configure EKS Pod Identity for S3 access
Create a model-storage-sa ServiceAccount in the default namespace, an IAM policy scoped to the model bucket, and an EKS Pod Identity Association that links them. Workload pods that set serviceAccountName: model-storage-sa will be able to read and write to the bucket.
kubectl create serviceaccount model-storage-sa
Create the IAM policy:
POLICY_ARN=$(aws iam create-policy \ --policy-name "${CLUSTER_NAME}-model-storage-policy" \ --policy-document "{\"Version\": \"2012-10-17\", \"Statement\": [{\"Effect\": \"Allow\", \"Action\": [\"s3:GetObject\", \"s3:PutObject\", \"s3:ListBucket\", \"s3:DeleteObject\"], \"Resource\": [\"arn:aws:s3:::${MODEL_BUCKET}\", \"arn:aws:s3:::${MODEL_BUCKET}/*\"]}]}" \ --query 'Policy.Arn' \ --output text) echo "Policy ARN: ${POLICY_ARN}"
Note
This policy grants s3:DeleteObject and s3:PutObject for the validation step. For production inference pods that only read model weights, remove s3:PutObject and s3:DeleteObject to follow least-privilege.
Create the EKS Pod Identity Association. eksctl creates the IAM role with the correct trust policy and links it to the ServiceAccount:
eksctl create podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace default \ --service-account-name model-storage-sa \ --role-name "${CLUSTER_NAME}-model-storage-role" \ --permission-policy-arns ${POLICY_ARN} \ --region ${AWS_REGION}
Verify the association:
eksctl get podidentityassociation --cluster ${CLUSTER_NAME} --region ${AWS_REGION}
The output should include the model-storage-sa association in the default namespace.
Run a one-off Pod with the AWS CLI image, using the model-storage-sa ServiceAccount, to confirm EKS Pod Identity is wired up and S3 access works:
cat << EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: s3-test labels: guide: ai-eks-docs spec: serviceAccountName: model-storage-sa containers: - name: aws-cli image: public.ecr.aws/aws-cli/aws-cli:2.27.0 command: - sh - -c - | echo "=== Caller Identity ===" aws sts get-caller-identity echo "" echo "=== S3 Write Test ===" echo "pod identity works" | aws s3 cp - s3://${MODEL_BUCKET}/test.txt echo "" echo "=== S3 List Test ===" aws s3 ls s3://${MODEL_BUCKET}/ echo "" echo "=== S3 Delete Test ===" aws s3 rm s3://${MODEL_BUCKET}/test.txt restartPolicy: Never EOF
Wait for the Pod to complete and check the logs:
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/s3-test --timeout=300s kubectl logs s3-test
Expected output:
=== Caller Identity ===
{
"UserId": "AROA...:eks-ai-eks-docs-model-s-...",
"Account": "123456789012",
"Arn": "arn:aws:sts::123456789012:assumed-role/ai-eks-docs-model-storage-role/eks-ai-eks-docs-model-s-..."
}
=== S3 Write Test ===
upload: - to s3://ai-eks-docs-models-01234567/test.txt
=== S3 List Test ===
2026-05-04 12:00:00 19 test.txt
=== S3 Delete Test ===
delete: s3://ai-eks-docs-models-01234567/test.txtThe caller identity confirms the Pod assumed the ${CLUSTER_NAME}-model-storage-role role via EKS Pod Identity. The S3 commands confirm read and write access.
Clean up the test Pod:
kubectl delete pod s3-test
Next steps
With your cluster ready, you can proceed to Load & Serve Model to deploy a large language model and interact with the inference endpoint.
Cleanup
Tip
If you plan to continue with the next sections of this guide, skip the full cleanup. Only run it when you are done.
export CLUSTER_NAME=ai-eks-docs export AWS_REGION=us-east-2
kubectl delete pod nvidia-smi --ignore-not-found kubectl delete deployment gpu-overflow-test --ignore-not-found
If you created an ODCR, cancel it first:
INSTANCE_TYPE="g6e.4xlarge" CAPACITY_RESERVATION_ID=$(aws ec2 describe-capacity-reservations \ --filters "Name=state,Values=active" "Name=instance-type,Values=${INSTANCE_TYPE}" \ --query 'CapacityReservations[0].CapacityReservationId' \ --output text \ --region ${AWS_REGION}) aws ec2 cancel-capacity-reservation --capacity-reservation-id ${CAPACITY_RESERVATION_ID}
Important
Cancelling a reservation does not terminate running instances. They continue at standard On-Demand rates until terminated. Delete the deployment first to drain the reserved node before cancelling.
Look up the IAM policy ARN:
AMP_POLICY_ARN=$(aws iam list-policies \ --scope Local \ --query "Policies[?PolicyName=='${CLUSTER_NAME}-amp-grafana-policy'].Arn" \ --output text) echo "AMP Policy ARN: ${AMP_POLICY_ARN}"
Look up the AMP workspace ID:
AMP_WORKSPACE_ID=$(aws amp list-workspaces \ --alias "amp-ws-${CLUSTER_NAME}" \ --query 'workspaces[0].workspaceId' \ --output text \ --region ${AWS_REGION}) echo "AMP Workspace ID: ${AMP_WORKSPACE_ID}"
Uninstall the DCGM exporter Helm release:
helm uninstall dcgm-exporter -n monitoring
Uninstall the kube-prometheus-stack Helm release:
helm uninstall kube-prometheus-stack -n monitoring
Delete the EKS Pod Identity association for the Prometheus ingest service account:
eksctl delete podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace monitoring \ --service-account-name amp-iamproxy-ingest-service-account \ --region ${AWS_REGION}
Delete the EKS Pod Identity association for the Grafana service account:
eksctl delete podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace monitoring \ --service-account-name grafana-sa \ --region ${AWS_REGION}
Delete the IAM policy used by Prometheus and Grafana:
aws iam delete-policy --policy-arn ${AMP_POLICY_ARN}
Delete the AMP workspace:
aws amp delete-workspace --workspace-id ${AMP_WORKSPACE_ID} --region ${AWS_REGION}
Delete the monitoring namespace:
kubectl delete namespace monitoring
Look up the model bucket name:
MODEL_BUCKET=$(aws s3api list-buckets \ --query "Buckets[?starts_with(Name, '${CLUSTER_NAME}-models-')].Name | [0]" \ --output text) echo "Model bucket: ${MODEL_BUCKET}"
Look up the IAM policy ARN:
POLICY_ARN=$(aws iam list-policies \ --scope Local \ --query "Policies[?PolicyName=='${CLUSTER_NAME}-model-storage-policy'].Arn" \ --output text) echo "Policy ARN: ${POLICY_ARN}"
Delete the S3 model bucket and all of its objects:
aws s3 rb s3://${MODEL_BUCKET} --force
Delete the EKS Pod Identity association:
eksctl delete podidentityassociation \ --cluster ${CLUSTER_NAME} \ --namespace default \ --service-account-name model-storage-sa \ --region ${AWS_REGION}
Delete the IAM policy:
aws iam delete-policy --policy-arn ${POLICY_ARN}
Delete the Kubernetes ServiceAccount:
kubectl delete serviceaccount model-storage-sa
kubectl delete nodepool gpu-inf --ignore-not-found kubectl delete nodeclass gpu-inf --ignore-not-found kubectl delete ec2nodeclass gpu-inf --ignore-not-found eksctl delete cluster --name=$CLUSTER_NAME --region=$AWS_REGION