Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Manage hardware devices on Amazon EKS
Amazon EKS supports two Kubernetes mechanisms for managing specialized hardware devices in EKS clusters: Dynamic Resource Allocation (DRA) and device plugins. Both mechanisms enable workloads to access hardware accelerators such as NVIDIA GPUs and AWS Trainium chips, and high-performance network devices such as Elastic Fabric Adapter (EFA). It’s recommended to use DRA drivers for new deployments with Kubernetes versions 1.34 and later when using EKS managed node groups or self-managed nodes, as DRA provides richer device selection, topology-aware scheduling, and device sharing capabilities that are not possible with device plugins.
Reference the Kubernetes documentation for Dynamic Resource Allocation
Dynamic Resource Allocation vs device plugins
Kubernetes device plugins have been the primary mechanism for exposing specialized hardware to Kubernetes workloads. Device plugins advertise devices as extended resources (for example, nvidia.com/gpu or aws.amazon.com/neuroncore) that you request in container resource requests and limits. While device plugins are widely supported and used, they have limitations:
-
Devices are requested as opaque integer counts with no attribute-based filtering.
-
No support for device sharing between containers or Pods.
-
No expressive topology-aware allocation across device types.
-
Custom scheduler extensions are often required for intelligent placement.
Dynamic Resource Allocation (DRA) is a Kubernetes feature made generally available in Kubernetes version 1.34 that addresses these limitations. With DRA, device drivers publish rich device attributes to the Kubernetes scheduler through ResourceSlice objects. You request devices using ResourceClaim and ResourceClaimTemplate objects that reference DeviceClass categories.
DRA enables:
-
Attribute-based device selection using Common Expression Language (CEL)
expressions. -
Topology-aware allocation that ensures devices are co-located on the same PCIe switch or NUMA domain.
-
Device sharing between multiple containers or Pods through shared
ResourceClaimreferences. -
Constraint-based scheduling that aligns different device types
DRA drivers for Amazon EKS
The following DRA drivers are commonly used for managing specialized hardware devices in Amazon EKS clusters.
- EFA DRA driver
-
The EFA DRA driver (DRANET
) manages Elastic Fabric Adapter (EFA) device allocation with topology-aware scheduling that pairs EFA interfaces with their topologically-local GPUs or Neuron devices, and supports device sharing between Pods. For more information, see Manage EFA devices on Amazon EKS. - Neuron DRA driver
-
The Neuron DRA driver manages AWS Trainium and AWS Inferentia2 device allocation with topology-aware scheduling, connected device subset allocation, and Logical NeuronCore (LNC) configuration, without requiring custom scheduler extensions.
- NVIDIA DRA driver
-
The NVIDIA DRA driver for GPUs
enables flexible allocation and dynamic reconfiguration of NVIDIA GPUs, including support for ComputeDomainresources for Multi-Node NVLink (MNNVL) workloads on EC2 Grace-Blackwell instances. For more information on usingComputeDomainswith EC2 Grace-Blackwell instances, see Use P6e-GB200 UltraServers with Amazon EKS.
Device plugins for Amazon EKS
The following device plugins are commonly used for managing specialized hardware devices in Amazon EKS clusters.
- EFA device plugin
-
The EFA device plugin discovers all available EFA devices on each node and advertises EFA devices as
vpc.amazonaws.com/efaextended resources. - Neuron device plugin
-
The Neuron device plugin
exposes Neuron hardware as aws.amazon.com/neuroncoreandaws.amazon.com/neuronextended resources. It discovers available Neuron devices on each node, advertises them as allocatable resources, and manages their lifecycle. - NVIDIA device plugin
-
The NVIDIA device plugin
advertises NVIDIA GPUs as nvidia.com/gpuextended resources and tracks the health of GPUs.
Considerations
Before using DRA drivers on Amazon EKS, review the following considerations:
-
DRA is available on Amazon EKS with Kubernetes version 1.33 and above, but it is recommended for Kubernetes versions 1.34 and later due to an upstream Kubernetes issue
. Your cluster control plane and nodes must be running a Kubernetes version that supports DRA. -
DRA is not currently compatible with Karpenter or EKS Auto Mode provisioned compute. You must use EKS managed node groups or self-managed nodes with DRA drivers.
-
DRA drivers and device plugins for the same device type must not run simultaneously on the same node. Uninstall the device plugin before installing the corresponding DRA driver, or deploy them on separate nodes. See upstream Kubernetes KEP-5004
for updates on DRA driver and device plugin compatibility. -
DRA uses different Kubernetes API resources (
ResourceClaim,ResourceClaimTemplate,DeviceClass) than device plugins (resource.limits,resource.requests). Migrating from device plugins to DRA requires updating your workload specifications. -
Device plugins remain fully supported for all Kubernetes versions. If your cluster runs a Kubernetes version earlier than 1.34, or if you use Karpenter or EKS Auto Mode, continue using device plugins. The NVIDIA DRA driver is not supported on Bottlerocket; use the NVIDIA device plugin on Bottlerocket nodes. The EFA and Neuron DRA drivers are supported on Bottlerocket.
DRA ResourceClaim vs ResourceClaimTemplate
When using DRA, you request devices through ResourceClaim or ResourceClaimTemplate objects. These two resource types serve different purposes and have different lifecycle behaviors.
- ResourceClaim
-
A
ResourceClaimis a named Kubernetes object that you create independently of any Pod. You reference it in a Pod specification by name using theresourceClaimNamefield. AResourceClaimhas the following characteristics:-
It must exist in the cluster before any Pod that references it is created. If the claim does not exist, the Pod remains in a pending state.
-
It persists until you explicitly delete it, regardless of whether any Pods reference it.
-
Multiple Pods can reference the same
ResourceClaim, which enables device sharing. All Pods that reference the same claim share access to the same allocated devices and are scheduled to the same node.Use a
ResourceClaimwhen you need multiple Pods to share access to the same devices, or when you need a claim to exist beyond the lifetime of a single Pod.
-
- ResourceClaimTemplate
-
A
ResourceClaimTemplatedefines a template that Kubernetes uses to automatically generate a uniqueResourceClaimfor each Pod. You reference it in a Pod specification using theresourceClaimTemplateNamefield. TheResourceClaimTemplateitself is not bound to any Pod — it is a reusable template that persists independently. AResourceClaimTemplatehas the following characteristics:-
Kubernetes creates a new
ResourceClaimfor each Pod that references the template. Each Pod gets its own separate set of devices. -
Each generated
ResourceClaimis bound to the lifecycle of the Pod that triggered its creation. When the Pod is deleted, the associated generatedResourceClaimis also deleted. TheResourceClaimTemplateitself is not affected and continues to generate new claims for future Pods.Use a
ResourceClaimTemplatewhen each Pod in a workload needs its own dedicated devices with similar configurations. For example, use aResourceClaimTemplatefor Pods in a Job that uses parallel execution where each Pod needs its own GPU or EFA devices.
-
The following table summarizes the differences between ResourceClaim and ResourceClaimTemplate.
| Behavior | ResourceClaim | ResourceClaimTemplate |
|---|---|---|
|
Creation |
You create it manually before Pods reference it |
Kubernetes generates a claim automatically per Pod |
|
Lifecycle |
Persists until you delete it |
The template persists until you delete it. Each generated |
|
Device sharing across Pods |
Supported. Multiple Pods can reference the same claim. |
Not supported. Each Pod gets a separate claim. |
|
Pod specification field |
|
|
For examples of using ResourceClaim objects to share EFA devices between Pods, see Share EFA devices between multiple Pods. For examples of using ResourceClaimTemplate objects with topology-aware allocation, see Topology-aware EFA and GPU/Neuron device allocation.