Ray 작업을 Amazon Elastic Kubernetes Service로 마이그레이션

AWS Glue for Ray 지원 종료

중요

AWS Glue for Ray는 더 이상 신규 고객에게 공개되지 않습니다. 기존 고객은 정상적으로 서비스를 계속 이용할 수 있습니다. 자세한 내용은 AWS Glue for Ray 지원 종료를 참조하세요.

닫기로 고려 끝에 2026년 4월 30일부터 신규 고객에게는 AWS Glue for Ray를 종료하기로 결정했습니다. AWS Glue for Ray를 사용하려면 해당 날짜 이전에 가입하세요. 기존 고객은 정상적으로 서비스를 계속 이용할 수 있습니다.

AWS는 AWS Glue for Ray의 보안 및 가용성 개선에 계속 투자하고 있습니다. 보안 및 가용성 개선을 제외하고 AWS Glue for Ray에 새로운 기능을 도입할 계획은 없습니다.

AWS Glue for Ray의 대안으로 Amazon Elastic Kubernetes Service 사용을 권장합니다. Amazon Elastic Kubernetes Service는 AWS에서 Kubernetes 클러스터를 빌드, 보안, 운영, 유지 관리하기 위해 인증된 완전관리형 Kubernetes 준수 서비스입니다. Kubernetes에서 Ray 클러스터를 배포하고 관리하기 위해 오픈 소스 KubeRay Operator를 사용하는 고도로 사용자 지정 가능한 옵션으로, 향상된 리소스 사용률, 단순화된 인프라 관리 및 Ray 기능에 대한 완전한 지원을 제공합니다.

Ray 작업을 Amazon Elastic Kubernetes Service로 마이그레이션

이 섹션에서는 AWS Glue for Ray에서 Amazon Elastic Kubernetes Service 기반 Ray로 마이그레이션하는 단계를 제공합니다. 이 단계는 다음과 같은 두 가지 마이그레이션 시나리오에 유용합니다.

표준 마이그레이션(x86/amd64): 이러한 사용 사례의 경우 마이그레이션 전략은 기본 구현을 위해 OpenSource Ray 컨테이너를 사용하고 기본 컨테이너에서 직접 스크립트를 실행합니다.
ARM64 마이그레이션: 이러한 사용 사례의 경우 마이그레이션 전략은 ARM64 특별 종속성 및 아키텍처 요구 사항에 대한 사용자 지정 컨테이너 빌드를 지원합니다.

마이그레이션의 사전 조건

aws, kubectl, eksctl, helm, Python 3.9 이상과 같은 CLI 도구를 설치합니다. 이러한 CLI 도구는 EKS 기반 Ray 환경을 프로비저닝하고 관리하는 데 필요합니다. eksctl은 EKS 클러스터 생성 및 관리를 단순화합니다. kubectl은 클러스터에 워크로드를 배포하고 문제를 해결하기 위한 표준 Kubernetes CLI입니다. helm은 KubeRay(Kubernetes에서 Ray를 실행하는 운영자)를 설치하고 관리하는 데 사용됩니다. Ray 자체와 로컬로 작업 제출 스크립트를 실행하려면 Python 3.9 이상이 필요합니다.

eksctl 설치

Installation options for Eksctl 지침을 따르거나 설치에 관한 아래 지침을 사용합니다.

macOS의 경우:


brew tap weaveworks/tap
brew install weaveworks/tap/eksctl

Linux의 경우:


curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp

# Move the extracted binary to /usr/local/bin
sudo mv /tmp/eksctl /usr/local/bin

# Test the installation
eksctl version

kubectl 설치

Set up kubectl and eksctl 지침을 따르거나 설치에 관한 아래 지침을 사용합니다.

macOS의 경우:


brew install kubectl

Linux의 경우:


curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

헬름 설치

Installing Helm 지침을 따르거나 설치에 관한 아래 지침을 사용합니다.

macOS의 경우:


brew install helm

Linux의 경우:


curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

1단계. Ray에 대한 Docker 이미지 빌드 또는 선택

옵션 1: 공식 Ray 이미지 사용(빌드하지 않아도 됨)

이 옵션에서는 Ray 프로젝트에서 유지 관리하는 rayproject/ray:2.4.0-py39 예제에 대해 Docker Hub의 공식 Ray Docker 이미지를 사용합니다.

참고

이 이미지는 amd64 전용입니다. 종속성이 amd64와 호환되고 ARM 특정 빌드가 필요하지 않은 경우 이 옵션을 사용합니다.

옵션 2: 자체 arm64 Ray 2.4.0 이미지 빌드 및 게시

이 옵션은 AWS Glue for Ray가 내부적으로 사용하는 것과 일치하는 Graviton(ARM) 노드를 사용할 때 유용합니다. AWS Glue for Ray와 동일한 종속성 버전에 고정된 사용자 지정 이미지를 생성하여 호환성 불일치를 줄일 수 있습니다.

로컬로 Dockerfile을 생성합니다.


# Build an ARM64 image
FROM --platform=linux/arm64 python:3.9-slim-bullseye
# Handy tools: wget for KubeRay probes; CA certs; keep image small
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Keep pip/setuptools modern enough for wheels resolution
RUN python -m pip install -U "pip<24" "setuptools<70" wheel

# ---- Install Ray 2.4.0 (ARM64 / Py3.9) and Glue-like dependencies ----
# 1) Download the exact Ray 2.4.0 wheel for aarch64 (no network at runtime)
RUN python -m pip download --only-binary=:all: --no-deps --dest /tmp/wheels ray==2.4.0

# 2) Core libs used in Glue (pin to Glue-era versions)
#    + the dashboard & jobs API dependencies compatible with Ray 2.4.0.
#    (Pins matter: newer major versions break 2.4.0's dashboard.)
RUN python -m pip install --no-cache-dir \
    /tmp/wheels/ray-2.4.0-*.whl \
    "pyarrow==11.0.0" \
    "pandas==1.5.3" \
    "boto3==1.26.133" \
    "botocore==1.29.133" \
    "numpy==1.24.3" \
    "fsspec==2023.4.0" \
    "protobuf<4" \
    # --- dashboard / jobs server deps ---
    "aiohttp==3.8.5" \
    "aiohttp-cors==0.7.0" \
    "yarl<1.10" "multidict<7.0" "frozenlist<1.4" "aiosignal<1.4" "async_timeout<5" \
    "pydantic<2" \
    "opencensus<0.12" \
    "prometheus_client<0.17" \
    # --- needed if using py_modules ---
    "smart_open[s3]==6.4.0"

# Optional: prove Ray & arch at container start
ENV PYTHONUNBUFFERED=1
WORKDIR /app

# KubeRay overrides the start command; this is just a harmless default
CMD ["python","-c","import ray,platform; print('Ray', ray.__version__, 'on', platform.machine())"]


# Set environment variables
export AWS_REGION=us-east-1
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=ray-2-4-arm64
export IMAGE=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO}:v1

# Create repository and login
aws ecr create-repository --repository-name $REPO >/dev/null 2>&1 || true
aws ecr get-login-password --region $AWS_REGION \
  | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Enable Buildx (for cross-builds on non-ARM hosts)
docker buildx create --name multi --driver docker-container --use 2>/dev/null || true

# Build & push ARM64 image
docker buildx build \
  --platform linux/arm64 \
  -t "$IMAGE" \
  . --push

# Verify the image architecture remotely
aws ecr batch-get-image \
  --repository-name $REPO \
  --image-ids imageTag=v1 \
  --accepted-media-types application/vnd.docker.distribution.manifest.v2+json \
  | jq -r '.images[0].imageManifest' \
  | jq -r 'fromjson.config.digest'

완료되면 nodeSelector: { kubernetes.io/arch: arm64 }를 사용하여 RayCluster 사양에서 이 ARM64 이미지를 참조합니다.


spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: <your ECR image>

2단계. AWS Glue for Ray 작업 구성을 Amazon Elastic Kubernetes Service 기반 Ray로 변환

AWS Glue for Ray 작업은 워커, 종속성, 메모리 및 로깅을 구성하는 작업 인수 세트를 지원합니다. KubeRay를 사용하여 Amazon Elastic Kubernetes Service로 마이그레이션하는 경우 이러한 인수를 RayCluster 사양 필드 또는 Ray Job 런타임 환경 설정으로 변환해야 합니다.

작업 인수 매핑

AWS Glue for Ray 인수를 EKS 기반 Ray의 동등한 요소에 매핑
AWS Glue for Ray 인수	AWS Glue for Ray에서 수행하는 작업	Amazon Elastic Kubernetes Service 기반 Ray의 동등한 요소
`--min-workers`	작업에서 할당해야 하는 최소 워커 수입니다.	RayCluster에서 `workerGroupSpecs[].minReplicas`
`--working-dir`	모든 노드에 zip(S3 URI)을 배포합니다.	로컬 파일에서 제출하는 경우 Ray 런타임 환경: `working_dir` 사용, S3 아티팩트에서 가리킬 S3 zip의 경우 `py_modules`
`--s3-py-modules`	S3에서 Python wheels/dists를 추가합니다.	Ray 런타임 환경: `py_modules: ["s3://.../xxx.whl", ...]` 사용
`--pip-install`	작업에 대한 추가 PyPI 패키지를 설치합니다.	Ray 런타임 환경: `pip: ["pkg==ver", ...]`(Ray 작업 CLI `--runtime-env-json` 또는 RayJob `runtimeEnvYAML`).
`--object_store_memory_head`	헤드 노드의 Plasma 저장소에 대한 메모리의 비율(%)입니다.	RayCluster에서 `headGroupSpec[].rayStartParams.object-store-memory`입니다. 이 값은 바이트 단위여야 합니다. AWS Glue는 백분율을 사용하지만 Ray는 바이트를 사용합니다.
`--object_store_memory_worker`	워커 노드의 Plasma 저장소에 대한 메모리의 비율(%)입니다.	위와 동일하지만 각 워커 그룹의 `rayStartParams.object-store-memory`(바이트)에 설정됩니다.
`--object_spilling_config`	Ray 객체 유출을 구성합니다.	`headGroupSpec[].rayStartParams.object-spilling-config`
`--logging_configuration`	AWS Glue 관리형 로그(CloudWatch, S3).	포드 stdout/stderr 확인: `kubectl -n ray logs <pod-name> --follow`를 사용합니다. Ray 대시보드(포트 포워드: 8265)에서 로그를 확인합니다. 여기에서 태스크 및 작업 로그를 볼 수도 있습니다.

작업 구성 매핑

AWS Glue for Ray 작업 구성을 EKS 기반 Ray의 동등한 요소에 매핑
구성	AWS Glue for Ray에서 수행하는 작업	EKS 기반 Ray의 동등한 요소
작업자 유형	작업이 실행될 때 허용되는 사전 정의된 워커 유형을 설정합니다. 기본값은 Z 2X(8vCPU, 64GB RAM)입니다.	EKS의 노드 그룹 인스턴스 유형(예: ARM의 경우 r7g.2xlarge ≈ 8vCPU/64GB, x86의 경우 r7a.2xlarge)입니다.
최대 워커 수	AWS Glue에서 이 작업에 할당하려는 워커 수입니다.	AWS Glue에서 사용한 것과 동일한 수로 `workerGroupSpecs[].maxReplicas`를 설정합니다. 오토 스케일링의 상한입니다. 마찬가지로 `minReplicas`를 하한으로 설정됩니다. `replicas: 0`, `minReplicas: 0`으로 시작할 수 있습니다.

3단계. Amazon Elastic Kubernetes Service 설정

새 Amazon Elastic Kubernetes Service 클러스터를 생성하거나 기존 Amazon Elastic Kubernetes Service 클러스터를 재사용할 수 있습니다. 기존 클러스터를 사용하는 경우 클러스터 생성 명령을 건너뛰고 노드 그룹 추가, IRSA로 이동하여 KubeRay를 설치합니다.

Amazon Elastic Kubernetes Service 클러스터 생성

참고

기존 Amazon Elastic Kubernetes Service 클러스터가 있는 경우 명령을 건너뛰어 새 클러스터를 생성하고 노드 그룹을 추가하기만 하면 됩니다.


# Environment Variables
export AWS_REGION=us-east-1
export CLUSTER=ray-eks
export NS=ray # namespace for your Ray jobs (you can reuse another if you like)

# Create a cluster (OIDC is required for IRSA)
eksctl create cluster \
  --name $CLUSTER \
  --region $AWS_REGION \
  --with-oidc \
  --managed

노드 그룹 추가


# ARM/Graviton (matches Glue's typical runtime):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name arm64-ng \
  --node-type m7g.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

# x86/amd64 (use if your image is amd64-only):
eksctl create nodegroup \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --name amd64-ng \
  --node-type m5.large \
  --nodes 2 --nodes-min 1 --nodes-max 5 \
  --managed \
  --node-labels "workload=ray"

참고

기존 Amazon Elastic Kubernetes Service 클러스터를 사용하는 경우 노드 그룹을 추가할 때 --with-oidc를 사용하여 OIDC를 활성화합니다.

네임스페이스 + S3에 대한 서비스 계정의 IAM 역할(IRSA) 생성

Kubernetes 네임스페이스는 리소스(포드, 서비스, 역할 등)에 대한 논리적 그룹입니다. 기존 네임스페이스를 생성하거나 재사용할 수 있습니다. 또한 AWS Glue 작업의 액세스를 미러링하는 S3에 대한 IAM 정책도 생성해야 합니다. AWS Glue 작업 역할에서 보유한 것(일반적으로 특정 버킷에 대한 S3 읽기/쓰기)과 동일한 사용자 지정 권한을 사용합니다. AWSGlueServiceRole과 유사한 권한을 Amazon Elastic Kubernetes Service에 부여하려면 이 IAM 정책에 바인딩된 서비스 계정(IRSA)을 생성합니다. 이 서비스 계정을 설정하는 지침은 서비스 계정에 대한 IAM 역할을 참조하세요.


# Create (or reuse) namespace
kubectl create namespace $NS || true


{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::YOUR-BUCKET",
      "arn:aws:s3:::YOUR-BUCKET/*"
    ]
  }]
}


# Create the IAM policy and wire IRSA:
aws iam create-policy \
  --policy-name RayS3Policy \
  --policy-document file://example.json || true

# Create a service account (IRSA) bound to that policy.
eksctl create iamserviceaccount \
  --cluster $CLUSTER \
  --region $AWS_REGION \
  --namespace $NS \
  --name ray-s3-access \
  --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT}:policy/RayS3Policy \
  --approve \
  --override-existing-serviceaccounts

KubeRay 운영자 설치(K8s 기반 Ray를 실행하는 컨트롤러)


helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm upgrade --install kuberay-operator kuberay/kuberay-operator \
  --namespace kuberay-system \
  --create-namespace

# Validate the operator pod Running
kubectl -n kuberay-system get pods

4단계. Ray 클러스터 스핀업

YAML 파일을 생성하여 Ray 클러스터를 정의합니다. 다음은 샘플 구성(raycluster.yaml)입니다.


apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: glue-like
  namespace: ray
spec:
  rayVersion: "2.4.0"
  headGroupSpec:
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-head
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }
  workerGroupSpecs:
  - groupName: workers
    replicas: 0 # start with just a head (like small Glue dev job) and turn number of replicas later
    minReplicas: 0
    maxReplicas: 5
    template:
      spec:
        nodeSelector:
          kubernetes.io/arch: amd64
        serviceAccountName: ray-s3-access
        containers:
        - name: ray-worker
          image: rayproject/ray:2.4.0-py39
          imagePullPolicy: Always
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "1", memory: "2Gi" }

Amazon Elastic Kubernetes Service 클러스터에 Ray 클러스터 배포


kubectl apply -n $NS -f raycluster.yaml

# Validate that the head pod turns to READY/ RUNNING state
kubectl -n $NS get pods -l ray.io/cluster=glue-like -w

배포된 yaml을 수정해야 하는 경우 먼저 클러스터를 삭제하고 업데이트된 yaml을 다시 적용합니다.


kubectl -n $NS delete raycluster glue-like
kubectl -n $NS apply -f raycluster.yaml

Ray 대시보드에 액세스

kubectl을 사용하여 포트 전달을 활성화함으로써 Ray 대시보드에 액세스할 수 있습니다.


# Get service
SVC=$(kubectl -n $NS get svc -l ray.io/cluster=glue-like,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

# Make the Ray dashboard accessible at http://localhost:8265 on your local machine.
kubectl -n $NS port-forward svc/$SVC 8265:8265

5단계. Ray 작업 제출

Ray 작업을 제출하려면 Ray 작업 CLI를 사용합니다. CLI 버전은 클러스터보다 최신 버전일 수 있으며 이전 버전과 호환됩니다. 사전 조건으로 작업 스크립트를 파일(예: job.py)에 로컬로 저장합니다.


python3 -m venv ~/raycli && source ~/raycli/bin/activate
pip install "ray[default]==2.49.2"

# Submit your ray job by supplying all python dependencies that was added to your Glue job
ray job submit --address http://127.0.0.1:8265 --working-dir . \
  --runtime-env-json '{
    "pip": ["boto3==1.28.*","pyarrow==12.*","pandas==2.0.*"]
  }' \
  -- python job.py

작업은 Ray 대시보드에서 모니터링할 수 있습니다.

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

AWS Glue이란 무엇인가요?

작동 방식