View a markdown version of this page

Set up Amazon EKS cluster for AI/ML workloads - Amazon EKS

Help improve this page

To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.

Set up Amazon EKS cluster for AI/ML workloads

Tip

Register for upcoming Amazon EKS AI/ML workshops.

This section guides you through creating an Amazon EKS cluster ready to run inference workloads, including the compute with GPUs, monitoring stack, and Amazon S3 storage for model weights, along with the necessary AWS IAM permissions.

Architecture overview

The setup creates the following infrastructure:

  • EKS cluster with GPU-enabled nodes — A Karpenter-managed NodePool that dynamically provisions G-family GPU instances using Spot capacity with On-Demand fallback.

  • Monitoring stack — Prometheus scrapes cluster, node, and GPU metrics and remote-writes them to Amazon Managed Service for Prometheus (AMP). Grafana provides dashboards for visualization. The NVIDIA DCGM Exporter adds GPU-specific metrics including utilization, memory, temperature, power draw, NVLink bandwidth, and tensor activity.

  • Model weights S3 bucket — An Amazon S3 bucket for storing model weights, with an EKS Pod Identity association that grants workload pods read/write access.

Cluster compute options

The guide provides two paths for setting up your cluster. Choose one and follow it consistently through all steps.

  • EKS Auto Mode — A single command provisions an EKS cluster with EKS Auto Mode enabled. All of the required components are provided out-of-the-box including Karpenter-based auto-scaling, the EKS node monitoring agent, fast container pulls with SOCI, and the NVIDIA device plugin.

  • Self-managed Karpenter — You install and configure each component explicitly: Karpenter via eksctl, automatic node repair through its feature gate, the EKS node monitoring agent as an EKS add-on, and the NVIDIA device plugin via Helm. You also create a custom EC2NodeClass that uses the EKS-optimized NVIDIA AL2023 AMIs and configures SOCI.

What you’ll set up

Step Description

Create cluster

Provision the EKS control plane and cluster-level components needed for GPU workloads.

Create dynamically provisioned GPU nodes

Define a dynamic GPU NodePool that provisions G-family GPU instances as workloads are scheduled.

Test with a sample pod

Validate the setup end-to-end by running an nvidia-smi pod that triggers Karpenter to provision a GPU-enabled node.

Add reserved capacity (optional)

Attach an On-Demand Capacity Reservation (ODCR) to your NodeClass for reserved-first with Spot/On-Demand fallback.

Install monitoring

Deploy kube-prometheus-stack (Prometheus + Grafana) with remote-write to AMP, plus the NVIDIA DCGM Exporter for GPU metrics.

Create model weights bucket

Create an S3 bucket and configure EKS Pod Identity so workload pods can read and write model weights.

Get started

For step-by-step instructions using the AWS CLI, see Set up Amazon EKS cluster for AI/ML workloads using CLIs.