

 **Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Load & Serve Models on Amazon EKS
<a name="ml-inference-load-serve-model"></a>

**Tip**  
 [Register](https://aws-experience.com/emea/smb/events/series/get-hands-on-with-amazon-eks?trk=4a9b4147-2490-4c63-bc9f-f8a84b122c8c&sc_channel=el&tag=generative%20ai) for upcoming Amazon EKS AI/ML workshops.

The steps in this section deploy a large language model (LLM) on Amazon EKS, serve it with vLLM, and interact with the inference endpoint.

The walkthrough uses the following tools:
+  [vLLM](https://docs.vllm.ai/en/latest/) — A high-throughput inference engine optimized for LLM serving and GPU memory management.
+  [Run:ai Model Streamer](https://github.com/run-ai/runai-model-streamer) — Streams model weights directly from Amazon S3 to GPU memory, reducing load time from minutes to seconds.
+  [Open WebUI](https://openwebui.com/) — A self-hosted chat frontend that connects to vLLM’s OpenAI-compatible API.

This section uses the [Ministral-3-8B-Instruct-2512 model](https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512), but you can deploy any AI model that vLLM supports. For a list of supported models, see [Supported models](https://docs.vllm.ai/en/latest/models/supported_models/#text-generation) in the vLLM documentation.

**Important**  
Use the cluster you created in the [Set up Amazon EKS cluster for AI/ML workloads](ml-cluster-setup.md) section. The instructions in this walkthrough work for both EKS Auto Mode and self-managed Karpenter.

![Architecture diagram showing LLM inference workflow with vLLM on Amazon EKS](http://docs.aws.amazon.com/eks/latest/userguide/images/ml-inference-load-serve-model-arch.png)


The architecture diagram shows the end-to-end flow:

1. Model weights are downloaded from Hugging Face to Amazon S3.

1. vLLM streams the model directly from S3 to GPU memory using Run:ai Model Streamer.

1. Users send inference requests to the vLLM endpoint.

When you complete these steps, you have a vLLM inference endpoint that you can use to interact with a Ministral model through a chat frontend application.

## Prerequisites
<a name="_prerequisites"></a>

Complete the steps in the [Cluster setup section](ml-cluster-setup.md).

If you opened a new terminal, set the cluster name and region you used in the [Cluster Setup via CLI](ml-cluster-setup-cli.md) section:

```
export CLUSTER_NAME=ai-eks-docs
export AWS_REGION=us-east-2
```

Look up the model weights bucket you created in the [Model weights S3 bucket](ml-cluster-setup-cli.md#cluster-setup-cli-model-bucket) step:

```
MODEL_BUCKET=$(aws s3api list-buckets \
  --query "Buckets[?starts_with(Name, '${CLUSTER_NAME}-models-')].Name | [0]" \
  --output text)
echo "Model bucket: ${MODEL_BUCKET}"
```

## Step 1: Download the model from Hugging Face
<a name="_step_1_download_the_model_from_hugging_face"></a>

In this step, you deploy a Kubernetes Job that downloads the model from Hugging Face and uploads it to the S3 bucket that you created in the prerequisites section.

To download the model, apply the following Job manifest:

### Model download Job manifest
<a name="ml-inference-step1-model-download-yaml"></a>

```
cat << EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
  namespace: default
  labels:
    guide: ai-eks-docs
spec:
  backoffLimit: 10
  activeDeadlineSeconds: 3600
  ttlSecondsAfterFinished: 86400
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: model-storage-sa
      containers:
      - name: downloader
        image: python:3.11-slim
        command: ["/bin/bash", "-c"]
        args:
        - |
          set -e
          pip install -q huggingface_hub boto3
          echo "Downloading Ministral-3-8B-Instruct-2512 from Hugging Face..."
          python3 -c "from huggingface_hub import snapshot_download; snapshot_download('mistralai/Ministral-3-8B-Instruct-2512', local_dir='/tmp/mistral', allow_patterns=['*.json', '*.txt', '*.md', 'consolidated.safetensors'], ignore_patterns=['model-*.safetensors', 'model.safetensors.index.json'])"
          echo "Uploading to S3 bucket: \${MODEL_BUCKET}"
          python3 << 'PYTHON'
          import boto3
          import os
          from pathlib import Path

          s3 = boto3.client('s3')
          bucket = os.environ.get('MODEL_BUCKET')
          local_dir = Path("/tmp/mistral")

          for file_path in local_dir.rglob("*"):
              if file_path.is_file():
                  if '.cache' in file_path.parts:
                      continue
                  s3_key = f"Ministral-3-8B-Instruct-2512/{file_path.relative_to(local_dir)}"
                  print(f"Uploading {file_path.name}...")
                  s3.upload_file(str(file_path), bucket, s3_key)
          print("Upload complete!")
          PYTHON
        env:
        - name: MODEL_BUCKET
          value: "${MODEL_BUCKET}"
        - name: HF_HUB_DISABLE_XET
          value: "1"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
EOF
```

Wait for the Job to complete. The model weights (consolidated.safetensors) are approximately 10.4 GB, and this step typically takes 3-5 minutes.

```
kubectl wait --for=condition=complete job/model-download --timeout=600s
```

Expected output:

```
job.batch/model-download condition met
```

Verify that the model weights were uploaded to S3:

```
aws s3 ls s3://$(kubectl get job model-download -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="MODEL_BUCKET")].value}')/Ministral-3-8B-Instruct-2512/ --recursive
```

Expected output:

```
2026-05-18 10:29:53      20311 Ministral-3-8B-Instruct-2512/README.md
2026-05-18 10:29:53       2361 Ministral-3-8B-Instruct-2512/SYSTEM_PROMPT.txt
2026-05-18 10:29:53       1903 Ministral-3-8B-Instruct-2512/config.json
2026-05-18 10:29:54 10420633176 Ministral-3-8B-Instruct-2512/consolidated.safetensors
2026-05-18 10:29:53        131 Ministral-3-8B-Instruct-2512/generation_config.json
2026-05-18 10:29:53       1185 Ministral-3-8B-Instruct-2512/params.json
2026-05-18 10:29:53        976 Ministral-3-8B-Instruct-2512/processor_config.json
2026-05-18 10:29:53   16753777 Ministral-3-8B-Instruct-2512/tekken.json
2026-05-18 10:29:53   17077402 Ministral-3-8B-Instruct-2512/tokenizer.json
2026-05-18 10:29:53      21168 Ministral-3-8B-Instruct-2512/tokenizer_config.json
```

The consolidated.safetensors file contains the model weights (approximately 10.4 GB). The remaining files are configuration and tokenizer files that vLLM requires to serve the model.

## Step 2: Deploy the inference container
<a name="_step_2_deploy_the_inference_container"></a>

In this section, you deploy vLLM as a Kubernetes Deployment to serve the model you uploaded to Amazon S3.

This section uses [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/tree/master) (DLCs), which are Docker images preinstalled with deep learning frameworks and optimized for performance on AWS infrastructure. DLCs include security patches, validated framework versions, and optimized GPU driver configurations.

This deployment uses the following AWS DLC for [vLLM 0.21.0](https://gallery.ecr.aws/deep-learning-containers/vllm) with SOCI support:

```
public.ecr.aws/deep-learning-containers/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-ec2-v1.0-soci
```

The image tag indicates vLLM 0.21.0 with GPU support, Python 3.12, CUDA 13.0, Ubuntu 22.04, optimized for EC2-based workloads, and SOCI-enabled for faster container startup.

This manifest creates a Deployment that runs vLLM on a GPU node and streams the model directly from S3 into GPU memory using Run:ai Model Streamer. The manifest also creates a ClusterIP Service that exposes the vLLM endpoint on port 8000 for in-cluster access.

Apply the manifest:

### vLLM Deployment and Service YAML
<a name="ml-inference-step2-vllm-deployment-yaml"></a>

```
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference-app
  labels:
    guide: ai-eks-docs
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference-app
  template:
    metadata:
      labels:
        app: vllm-inference-app
        guide: ai-eks-docs
    spec:
      serviceAccountName: model-storage-sa
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        karpenter.sh/nodepool: gpu-inf
      containers:
      - name: vllm-inference
        image: public.ecr.aws/deep-learning-containers/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-ec2-v1.0-soci
        ports:
        - containerPort: 8000
        args:
        - "--model=s3://${MODEL_BUCKET}/Ministral-3-8B-Instruct-2512/"
        - "--host=0.0.0.0"
        - "--port=8000"
        - "--tensor-parallel-size=1"
        - "--gpu-memory-utilization=0.9"
        - "--max-model-len=8192"
        - "--max-num-seqs=1"
        - "--load-format=runai_streamer"
        - "--enforce-eager"
        - "--tokenizer_mode=mistral"
        - "--config_format=mistral"
        - "--enable-auto-tool-choice"
        - "--tool-call-parser=mistral"
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "40Gi"
            cpu: "8"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference-svc
  namespace: default
  labels:
    app: vllm-inference-app
spec:
  selector:
    app: vllm-inference-app
  ports:
  - name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
EOF
```

Check that the vLLM pod is in the Ready state:

```
kubectl get pod -l app=vllm-inference-app -w
```

Expected output:

```
NAME                                  READY   STATUS    RESTARTS   AGE
vllm-inference-app-65df5fddc8-5kmjm   1/1     Running   0          86s
```

It may take \~2 minutes for the container image to pull and for vLLM to stream model weights from S3 into GPU memory. Wait until the pod shows `1/1` in the READY column before you proceed.

The combination of EKS, SOCI, and Run:ai Model Streamer enables fast pod startup. To check the startup time for each stage, view the pod events:

```
kubectl describe pod -l app=vllm-inference-app | grep -A 20 "Events:"
```

Expected output:

```
Events:
  Type     Reason            Age   From                   Message
  ----     ------            ----  ----                   -------
  Warning  FailedScheduling  86s   default-scheduler      0/2 nodes are available: 2 node(s) had untolerated taint(s).
  Normal   Nominated         85s   eks-auto-mode/compute  Pod should schedule on: nodeclaim/gpu-inf-kqkq6
  Normal   Scheduled         55s   default-scheduler      Successfully assigned default/vllm-inference-app-d9d54586d-csmd7 to i-04f8792414384d2d3
  Normal   Pulling           52s   kubelet                Pulling image "public.ecr.aws/deep-learning-containers/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-ec2-v1.0-soci"
  Normal   Pulled            4s    kubelet                Successfully pulled image "public.ecr.aws/deep-learning-containers/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-ec2-v1.0-soci" in 48.376s (48.376s including waiting). Image size: 8802823997 bytes.
  Normal   Created           4s    kubelet                Created container vllm-inference
  Normal   Started           4s    kubelet                Started container vllm-inference
```

In this example, the GPU node was provisioned in 30 seconds and the 8.8 GB container image was pulled in approximately 48 seconds using SOCI. Fast image pulls reduce cold start times for large inference containers, which allows you to scale GPU pods dynamically instead of over-provisioning idle GPU capacity.

Next, check the vLLM logs to verify the model loading time:

```
kubectl logs $(kubectl get pod -l app=vllm-inference-app -o jsonpath='{.items[0].metadata.name}') | grep -i 'Model loading took'
```

Expected output:

```
INFO 05-18 18:41:49 [gpu_model_runner.py:4959] Model loading took 9.81 GiB memory and 5.023344 seconds
```

The log confirms that Run:ai Model Streamer loaded the 10.4 GB model weights directly from S3 into GPU memory in approximately 5 seconds, consuming 9.8 GiB of GPU memory.

The image download time in this example was using a g6e.4xlarge instance, which has 20 Gbps sustained network bandwidth. Image pulls and model loading times will vary on other instance types depending on available network bandwidth.

## Step 3: Run inference
<a name="_step_3_run_inference"></a>

With the vLLM Deployment running, validate the inference endpoint and deploy a chat frontend to interact with the model.

### Run a model validation test
<a name="_run_a_model_validation_test"></a>

Expose the inference endpoint via port forward:

```
kubectl port-forward svc/vllm-inference-svc 8000:8000
```

Open a new terminal window, and then validate that the inference container is responding:

```
curl -sI -X GET http://localhost:8000/health
```

Expected output:

```
HTTP/1.1 200 OK
date: Fri, 18 May 2026 00:39:23 GMT
server: uvicorn
content-length: 0
```

## Step 4: Monitor vLLM
<a name="ml-inference-load-serve-model-monitoring"></a>

vLLM exposes Prometheus metrics out of the box, including request rate, token throughput, end-to-end latency, and GPU KV cache utilization. In this section, you use these metrics with the monitoring stack you set up in the [Cluster setup](ml-cluster-setup.md) steps and view them on a pre-provisioned Grafana dashboard.

**Important**  
You must complete the [Monitoring](ml-cluster-setup-cli.md#cluster-setup-cli-monitoring) subsection of the [Cluster Setup via CLI](ml-cluster-setup-cli.md) section before continuing. This step depends on the kube-prometheus-stack being installed and the vLLM Grafana dashboard already provisioned in the values file.

### Apply the vLLM ServiceMonitor
<a name="ml-inference-load-serve-model-monitoring-servicemonitor"></a>

A ServiceMonitor tells Prometheus where to scrape vLLM metrics.

```
cat << EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-inference-app
  namespace: default
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: vllm-inference-app
  endpoints:
  - port: http
    path: /metrics
    interval: 15s
EOF
```

Verify that the ServiceMonitor was created:

```
kubectl get servicemonitor vllm-inference-app
```

Expected output:

```
NAME                  AGE
vllm-inference-app    5s
```

#### Generate inference traffic
<a name="ml-inference-load-serve-model-monitoring-traffic"></a>

To populate the dashboard with metrics, generate inference traffic against the vLLM endpoint you already exposed via port-forward in the validation step.

Discover the served model name:

```
MODEL_NAME=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id')
echo "Using model: $MODEL_NAME"
```

Send 50 chat completion requests in parallel:

```
for i in $(seq 1 50); do
  curl -s -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"Write a short poem about Kubernetes.\"}], \"max_tokens\": 128}" \
    > /dev/null &
done
wait
```

While traffic is flowing (or immediately after), check token-throughput metrics directly from the vLLM `/metrics` endpoint:

```
curl -s http://localhost:8000/metrics | grep -E '^vllm:(prompt_tokens_total|generation_tokens_total|avg_generation_throughput_toks_per_s|avg_prompt_throughput_toks_per_s)' | head
```

The `vllm:prompt_tokens_total` and `vllm:generation_tokens_total` metrics are monotonically increasing counters of input and output tokens served. The `vllm:avg_prompt_throughput_toks_per_s` and `vllm:avg_generation_throughput_toks_per_s` metrics are rolling-average throughput gauges. These same metrics power the Grafana dashboard you open in the following subsection.

### View the vLLM Grafana dashboard
<a name="ml-inference-load-serve-model-monitoring-dashboard"></a>

The kube-prometheus-stack values file from the [Monitoring](ml-cluster-setup-cli.md#cluster-setup-cli-monitoring) section already provisions the community [vLLM dashboard (gnetId 25263)](https://grafana.com/grafana/dashboards/25263-vllm-metrics/) under the **GPU Monitoring** folder, so no extra import is needed.

To access Grafana, start a port-forward to the Grafana service:

```
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
```

Open [http://localhost:3000](http://localhost:3000) in your browser and navigate to **Dashboards > GPU Monitoring > vLLM Metrics**.

 **vLLM Grafana dashboard** 

![vLLM Grafana dashboard showing request rate, token throughput, end-to-end latency, and GPU KV cache utilization](http://docs.aws.amazon.com/eks/latest/userguide/images/ml-inference-load-serve-model-vllm-monitoring.png)


The dashboard displays request rate, prompt and generation token throughput, latency percentiles, and GPU KV cache utilization for the vLLM inference endpoint.

## Step 5: Deploy chat application
<a name="_step_5_deploy_chat_application"></a>

In this step, you deploy Open WebUI as a chat frontend to interact with the model. Open WebUI is an open source, self-hosted AI interface that supports OpenAI-compatible APIs and provides a chat interface with conversation history and markdown rendering. Because vLLM exposes an OpenAI-compatible API, Open WebUI connects to it directly as a backend.

To deploy the Open WebUI application, apply the following manifest:

### Open WebUI Deployment and Service YAML
<a name="ml-inference-step5-open-webui-yaml"></a>

```
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: default
  labels:
    app: open-webui
    guide: ai-eks-docs
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
        guide: ai-eks-docs
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:v0.9.2
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "500Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        env:
        - name: OPENAI_API_BASE_URLS
          value: "http://vllm-inference-svc:8000/v1"
        - name: OPENAI_API_KEY
          value: "dummy"
        - name: WEBUI_AUTH
          value: "False"
        - name: ENABLE_OLLAMA_API
          value: "False"
        - name: ENABLE_EVALUATION_ARENA_MODELS
          value: "False"
        volumeMounts:
        - name: webui-volume
          mountPath: /app/backend/data
      volumes:
      - name: webui-volume
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui
  namespace: default
  labels:
    app: open-webui
spec:
  type: ClusterIP
  selector:
    app: open-webui
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
EOF
```

Wait for the Open WebUI pod to be ready:

```
kubectl wait --for=condition=ready pod -l app=open-webui --timeout=300s
```

Expected output:

```
pod/open-webui-6cbfc9867f-jf9w9 condition met
```

To access the application, set up port forwarding and open the application in your browser:

```
kubectl port-forward svc/open-webui 8080:80 &
sleep 5
echo "Open WebUI: http://localhost:8080"
```

Open [http://localhost:8080](http://localhost:8080) in your browser.

The chat interface appears where you can interact with the Ministral model.

When you finish testing, stop the backgrounded port-forward processes by running `kill %1 %2` (or run `jobs` to list them and `kill %<jobspec>` for each).

![Screenshot of Open WebUI chat interface showing a conversation with the Ministral model](http://docs.aws.amazon.com/eks/latest/userguide/images/ml-inference-load-serve-model-chatui.png)


## Clean up
<a name="_clean_up"></a>

To remove the workload resources that you created in this section, delete the Open WebUI application, the vLLM inference server, and the model-download Job:

```
kubectl delete deployment open-webui
kubectl delete service open-webui
kubectl delete deployment vllm-inference-app
kubectl delete service vllm-inference-svc
kubectl delete servicemonitor vllm-inference-app
kubectl delete job model-download
```

For instructions on removing infrastructure resources such as the cluster, NodePool, and S3 bucket, see [Cluster Setup Cleanup](ml-cluster-setup-cli.md#cluster-setup-cli-cleanup) 