Migrazione di un lavoro Ray su Amazon Elastic Kubernetes Service

AWS Glue per la fine del supporto di Ray

Importante

AWS Glue for Ray non è più aperto a nuovi clienti. I clienti esistenti possono continuare a utilizzare il servizio normalmente. Per ulteriori informazioni, consulta AWS Glue Ray End of Support.

Dopo un'attenta valutazione, abbiamo deciso di chiudere AWS Glue Ray a nuovi clienti a partire dal 30 aprile 2026. Se desideri utilizzarlo AWS Glue per Ray, registrati prima di tale data. I clienti esistenti possono continuare a utilizzare il servizio normalmente.

AWS continua a investire in miglioramenti della sicurezza e della disponibilità di AWS Glue For Ray. Tieni presente che non abbiamo intenzione di introdurre nuove funzionalità in AWS Glue for Ray, ad eccezione dei miglioramenti in termini di sicurezza e disponibilità.

In alternativa a AWS Glue for Ray, consigliamo di utilizzare Amazon Elastic Kubernetes Service. Amazon Elastic Kubernetes Service è un servizio conforme a Kubernetes completamente gestito e certificato che semplifica il processo di creazione, protezione, funzionamento e manutenzione dei cluster Kubernetes. AWSÈ un'opzione altamente personalizzabile che si affida a KubeRay Operator open source per distribuire e gestire i cluster Ray su Kubernetes, che offre un migliore utilizzo delle risorse, una gestione semplificata dell'infrastruttura e il supporto completo per le funzionalità di Ray.

Migrazione di un lavoro Ray su Amazon Elastic Kubernetes Service

Questa sezione fornisce i passaggi per la migrazione da AWS Glue for Ray a Ray su Amazon Elastic Kubernetes Service. Questi passaggi sono utili per due scenari di migrazione:

Migrazione standard (x86/amd64): per questi casi d'uso, la strategia di migrazione utilizza il contenitore OpenSource Ray per le implementazioni di base ed esegue gli script direttamente sul contenitore di base.
ARM64 Migrazione: per questi casi d'uso, la strategia di migrazione supporta configurazioni di container personalizzate per dipendenze e requisiti di architettura specifici. ARM64

Prerequisiti per la migrazione

Installa i seguenti strumenti CLI:aws,, kubectleksctl, helm Python 3.9+. Questi strumenti CLI sono necessari per il provisioning e la gestione dell'ambiente Ray on EKS. eksctlsemplifica la creazione e la gestione dei cluster EKS. kubectlè la CLI Kubernetes standard per l'implementazione e la risoluzione dei problemi dei carichi di lavoro sul cluster. helmviene utilizzato per l'installazione e la gestione KubeRay (l'operatore che esegue Ray su Kubernetes). Python 3.9+ è necessario per Ray stesso e per eseguire gli script di invio dei lavori localmente.

Installa eksctl

Segui le istruzioni sulle opzioni di installazione per Eksctl o usa le istruzioni seguenti per l'installazione.

Per macOS:


brew tap weaveworks/tap
brew install weaveworks/tap/eksctl

Per Linux:


curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp

# Move the extracted binary to /usr/local/bin
sudo mv /tmp/eksctl /usr/local/bin

# Test the installation
eksctl version

Installa kubectl

Segui le istruzioni su Configurare kubectl ed eksctl o usa le istruzioni seguenti per l'installazione.

Per macOS:


brew install kubectl

Per Linux:


curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

Installa helm

Segui le istruzioni sull'installazione di Helm o usa le istruzioni seguenti per l'installazione.

Per macOS:


brew install helm

Per Linux:


curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Passaggio 1. Crea o scegli un'immagine Docker per Ray

Opzione 1: usa l'immagine Ray ufficiale (non è richiesta la build)

Questa opzione utilizza l'immagine ufficiale di Ray Docker su Docker Hub, ad esempiorayproject/ray:2.4.0-py39, gestita dal progetto Ray.

Nota

Questa immagine è solo per amd64. Usala se le tue dipendenze sono compatibili con amd64 e non hai bisogno di build specifiche per ARM.

Opzione 2: crea e pubblica la tua immagine arm64 Ray 2.4.0

Questa opzione è utile quando si utilizzano i nodi Graviton (ARM), coerentemente con quelli utilizzati internamente da AWS Glue For Ray. È possibile creare un'immagine personalizzata associata alle stesse versioni di dipendenza di Ray, AWS Glue per ridurre le discrepanze di compatibilità.

Crea un Dockerfile localmente:


# Build an ARM64 image
FROM --platform=linux/arm64 python:3.9-slim-bullseye
# Handy tools: wget for KubeRay probes; CA certs; keep image small
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Keep pip/setuptools modern enough for wheels resolution
RUN python -m pip install -U "pip<24" "setuptools<70" wheel

# ---- Install Ray 2.4.0 (ARM64 / Py3.9) and Glue-like dependencies ----
# 1) Download the exact Ray 2.4.0 wheel for aarch64 (no network at runtime)
RUN python -m pip download --only-binary=:all: --no-deps --dest /tmp/wheels ray==2.4.0

# 2) Core libs used in Glue (pin to Glue-era versions)
#    + the dashboard & jobs API dependencies compatible with Ray 2.4.0.
#    (Pins matter: newer major versions break 2.4.0's dashboard.)
RUN python -m pip install --no-cache-dir \
    /tmp/wheels/ray-2.4.0-*.whl \
    "pyarrow==11.0.0" \
    "pandas==1.5.3" \
    "boto3==1.26.133" \
    "botocore==1.29.133" \
    "numpy==1.24.3" \
    "fsspec==2023.4.0" \
    "protobuf<4" \
    # --- dashboard / jobs server deps ---
    "aiohttp==3.8.5" \
    "aiohttp-cors==0.7.0" \
    "yarl<1.10" "multidict<7.0" "frozenlist<1.4" "aiosignal<1.4" "async_timeout<5" \
    "pydantic<2" \
    "opencensus<0.12" \
    "prometheus_client<0.17" \
    # --- needed if using py_modules ---
    "smart_open[s3]==6.4.0"

# Optional: prove Ray & arch at container start
ENV PYTHONUNBUFFERED=1
WORKDIR /app

# KubeRay overrides the start command; this is just a harmless default
CMD ["python","-c","import ray,platform; print('Ray', ray.__version__, 'on', platform.machine())"]


# Set environment variables
export AWS_REGION=us-east-1
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=ray-2-4-arm64
export IMAGE=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO}:v1

# Create repository and login
aws ecr create-repository --repository-name $REPO >/dev/null 2>&1 || true
aws ecr get-login-password --region $AWS_REGION \
  | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Enable Buildx (for cross-builds on non-ARM hosts)
docker buildx create --name multi --driver docker-container --use 2>/dev/null || true

# Build & push ARM64 image
docker buildx build \
  --platform linux/arm64 \
  -t "$IMAGE" \
  . --push

# Verify the image architecture remotely
aws ecr batch-get-image \
  --repository-name $REPO \
  --image-ids imageTag=v1 \
  --accepted-media-types application/vnd.docker.distribution.manifest.v2+json \
  | jq -r '.images[0].imageManifest' \
  | jq -r 'fromjson.config.digest'

Una volta terminato, fai riferimento a questa ARM64 immagine nelle RayCluster specifiche con. nodeSelector: { kubernetes.io/arch: arm64 }


spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: <your ECR image>

Passaggio 2. Conversione AWS Glue da Ray Job Configuration a Ray su Amazon Elastic Kubernetes Service

AWS Glue for Ray i job supportano una serie di argomenti di lavoro che configurano i lavoratori, le dipendenze, la memoria e la registrazione. Quando si esegue la migrazione ad Amazon Elastic Kubernetes KubeRay Service con, questi argomenti devono essere RayCluster tradotti in campi di specifiche o impostazioni dell'ambiente di runtime Ray Job.

Mappatura degli argomenti del lavoro

Mappatura AWS Glue di Ray Arguments su Ray su EKS Equivalents
AWS Glue per l'argomento Ray	Cosa fa in AWS Glue cambio di Ray	Equivalente a Ray su Amazon Elastic Kubernetes Service
`--min-workers`	Numero minimo di lavoratori che la mansione deve assegnare.	`workerGroupSpecs[].minReplicas`nel tuo RayCluster
`--working-dir`	Distribuisce uno zip (URI S3) a tutti i nodi.	Usa Ray runtime env: `working_dir` se esegui l'invio da file locali; `py_modules` affinché gli zip di S3 puntino all'artefatto S3
`--s3-py-modules`	Aggiunge Python wheels/dists da S3.	Usa Ray runtime env: `py_modules: ["s3://.../xxx.whl", ...]`
`--pip-install`	Installa pacchetti PyPI aggiuntivi per il lavoro.	Ray runtime env: `pip: ["pkg==ver", ...]` (Ray Job `--runtime-env-json` CLI RayJob `runtimeEnvYAML` o).
`--object_store_memory_head`	% di memoria per l'archivio Plasma del nodo principale.	`headGroupSpec[].rayStartParams.object-store-memory`nel tuo RayCluster. Nota che dovrebbe essere in byte. AWS Glue usa la percentuale, mentre Ray usa i byte.
`--object_store_memory_worker`	% di memoria per l'archivio Plasma dei nodi di lavoro.	Come sopra ma impostato in byte `rayStartParams.object-store-memory` (byte) di ogni gruppo di lavoro.
`--object_spilling_config`	Configura Ray object spilling.	`headGroupSpec[].rayStartParams.object-spilling-config`
`--logging_configuration`	AWS Glue-log gestiti (CloudWatch, S3).	Check pod stdout/stderr: usa. `kubectl -n ray logs <pod-name> --follow` Controlla i log da Ray Dashboard (port-forward to:8265), dove puoi anche vedere i log delle attività e dei lavori.

Mappatura della configurazione del Job

Mappatura delle configurazioni AWS Glue di Ray Job su Ray on EKS Equivalents
Configurazione	Cosa fa in cambio di Ray AWS Glue	Ray sull'equivalente di EKS
Tipo di lavoratore	Imposta il tipo di lavoratore predefinito consentito durante l'esecuzione di un lavoro. L'impostazione predefinita è Z 2X (8 vCPU, 64 GB RAM).	Tipo di istanza Nodegroup in EKS (ad esempio, r7g.2xlarge ≈ 8 vCPU/64 GB per ARM, r7a.2xlarge per x86).
Numero massimo di lavoratori	Il numero di lavoratori che desideri AWS Glue assegnare a questo lavoro.	Imposta `workerGroupSpecs[].maxReplicas` lo stesso numero di quello in AWS Glue cui hai usato. Questo è il limite superiore per la scalabilità automatica. Analogamente impostato `minReplicas` come limite inferiore. Puoi iniziare con`replicas: 0`,`minReplicas: 0`.

Fase 3. Configurazione del servizio Amazon Elastic Kubernetes

Puoi creare un nuovo cluster Amazon Elastic Kubernetes Service o riutilizzare un cluster Amazon Elastic Kubernetes Service esistente. Se utilizzi un cluster esistente, salta i comandi di creazione del cluster e passa a Aggiungi un gruppo di nodi, IRSA e installa. KubeRay

Crea un cluster Amazon Elastic Kubernetes Service

Nota

Se disponi di un cluster Amazon Elastic Kubernetes Service esistente, salta i comandi per creare un nuovo cluster e aggiungi semplicemente un gruppo di nodi.


# Environment Variables
export AWS_REGION=us-east-1
export CLUSTER=ray-eks
export NS=ray # namespace for your Ray jobs (you can reuse another if you like)

# Create a cluster (OIDC is required for IRSA)
eksctl create cluster \
  --name $CLUSTER \
  --region $AWS_REGION \
  --with-oidc \
  --managed

Aggiungi un gruppo di nodi


# ARM/Graviton (matches Glue's typical runtime):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name arm64-ng \
  --node-type m7g.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

# x86/amd64 (use if your image is amd64-only):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name amd64-ng \
  --node-type m5.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

Nota

Se utilizzi un cluster Amazon Elastic Kubernetes Service esistente, --with-oidc utilizzalo per abilitare OIDC quando aggiungi un gruppo di nodi.

Crea namespace e ruolo IAM per Service Accounts (IRSA) per S3

Un namespace Kubernetes è un raggruppamento logico di risorse (pod, servizi, ruoli, ecc.). Puoi creare o riutilizzare uno spazio dei nomi esistente. Dovrai anche creare una policy IAM per S3 che rispecchi l'accesso del tuo job. AWS Glue Usa le stesse autorizzazioni personalizzate del tuo ruolo AWS Glue lavorativo (in genere S3 read/write per bucket specifici). Per concedere autorizzazioni ad Amazon Elastic Kubernetes Service simili AWSGlue ServiceRole alle, crea un Service Account (IRSA) associato a questa policy IAM. Per istruzioni su come configurare questo account di servizio, consulta IAM Roles for Service Accounts.


# Create (or reuse) namespace
kubectl create namespace $NS || true


{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::YOUR-BUCKET",
      "arn:aws:s3:::YOUR-BUCKET/*"
    ]
  }]
}


# Create the IAM policy and wire IRSA:
aws iam create-policy \
  --policy-name RayS3Policy \
  --policy-document file://example.json || true

# Create a service account (IRSA) bound to that policy.
eksctl create iamserviceaccount \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --namespace $NS \
  --name ray-s3-access \
  --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT}:policy/RayS3Policy \
  --approve \
  --override-existing-serviceaccounts

KubeRay Operatore di installazione (controller che esegue Ray su K8s)


helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm upgrade --install kuberay-operator kuberay/kuberay-operator \
  --namespace kuberay-system \
  --create-namespace

# Validate the operator pod Running
kubectl -n kuberay-system get pods

Passaggio 4. Avvia un cluster Ray

Crea un file YAML per definire un ray cluster. Di seguito è riportato un esempio di configurazione (raycluster.yaml):


apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: glue-like
  namespace: ray
spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-head
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }
  workerGroupSpecs:
  - groupName: workers
    replicas: 0 # start with just a head (like small Glue dev job) and turn number of replicas later
    minReplicas: 0
    maxReplicas: 5
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-worker
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }

Implementa il cluster Ray sul cluster Amazon Elastic Kubernetes Service


kubectl apply -n $NS -f raycluster.yaml

# Validate that the head pod turns to READY/ RUNNING state
kubectl -n $NS get pods -l ray.io/cluster=glue-like -w

Se è necessario modificare lo yaml distribuito, elimina prima il cluster e poi riapplica lo yaml aggiornato:


kubectl -n $NS delete raycluster glue-like
kubectl -n $NS apply -f raycluster.yaml

Accesso alla Ray Dashboard

Puoi accedere alla dashboard di Ray abilitando il port forwarding usando kubectl:


# Get service
SVC=$(kubectl -n $NS get svc -l ray.io/cluster=glue-like,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

# Make the Ray dashboard accessible at http://localhost:8265 on your local machine.
kubectl -n $NS port-forward svc/$SVC 8265:8265

Fase 5. Invia Ray Job

Per inviare un lavoro Ray, usa la CLI di Ray jobs. La versione CLI può essere più recente del cluster, è retrocompatibile. Come prerequisito, memorizza lo script di lavoro localmente in un file, ad es. job.py


python3 -m venv ~/raycli && source ~/raycli/bin/activate
pip install "ray[default]==2.49.2"

# Submit your ray job by supplying all python dependencies that was added to your Glue job
ray job submit --address http://127.0.0.1:8265 --working-dir . \
  --runtime-env-json '{
    "pip": ["boto3==1.28.*","pyarrow==12.*","pandas==2.0.*"]
  }' \
  -- python job.py

Il lavoro può essere monitorato sulla dashboard di Ray.

Avvertimento JavaScript è disabilitato o non è disponibile nel tuo browser.

Per usare la documentazione AWS, JavaScript deve essere abilitato. Consulta le pagine della guida del browser per le istruzioni.

Convenzioni dei documenti

Che cos'è AWS Glue?

Come funziona