# What is AWS Glue? AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. AWS Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across various workloads and types of users. Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with AWS analytics services and Amazon S3 data lakes. AWS Glue has integration interfaces and job-authoring tools that are easy to use for all users, from developers to business users, with tailored solutions for varied technical skill sets. [![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/u14iVEc-C6E/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/u14iVEc-C6E) With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize the value of your data. It scales for any data size, and supports all data types and schema variances. To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go billing. For pricing information, see [AWS Glue pricing](https://aws.amazon.com/glue/pricing). **AWS Glue Studio** AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue. With AWS Glue Studio, you can create and manage jobs that gather, transform, and clean data. You can also use AWS Glue Studio to troubleshoot and edit job scripts. **Topics** + [AWS Glue features](#glue-features-summary) + [Learning about innovations in AWS Glue](#innovations-in-glue) + [Getting started with AWS Glue](#getting-started-with-glue) + [Accessing AWS Glue](#accessing-aws-glue) + [Related services](#what-is-glue-related-services) + [AWS Glue for Ray end of support](awsglue-ray-jobs-availability-change.md) ## AWS Glue features AWS Glue features fall into three major categories: + Discover and organize data + Transform, prepare, and clean data for analysis + Build and monitor data pipelines **Discover and organize data** + **Unify and search across multiple data stores** – Store, index, and search across multiple data sources and sinks by cataloging all your data in AWS. + **Automatically discover data ** – Use AWS Glue crawlers to automatically infer schema information and integrate it into your AWS Glue Data Catalog. + **Manage schemas and permissions** – Validate and control access to your databases and tables. + **Connect to a wide variety of data sources** – Tap into multiple data sources, both on premises and on AWS, using AWS Glue connections to build your data lake. **Transform, prepare, and clean data for analysis** + **Visually transform data with a job canvas interface** – Define your ETL process in the visual job editor and automatically generate the code to extract, transform, and load your data. + **Build complex ETL pipelines with simple job scheduling** – Invoke AWS Glue jobs on a schedule, on demand, or based on an event. + **Clean and transform streaming data in transit** – Enable continuous data consumption, and clean and transform it in transit. This makes it available for analysis in seconds in your target data store. + **Deduplicate and cleanse data with built-in machine learning** – Clean and prepare your data for analysis without becoming a machine learning expert by using the `FindMatches` feature. This feature deduplicates and finds records that are imperfect matches for each other. + **Built-in job notebooks** – AWS Glue job notebooks provide serverless notebooks with minimal setup in AWS Glue so you can get started quickly. + **Edit, debug, and test ETL code** – With AWS Glue interactive sessions, you can interactively explore and prepare data. You can explore, experiment on, and process data interactively using the IDE or notebook of your choice. + **Define, detect, and remediate sensitive data** – AWS Glue sensitive data detection lets you define, identify, and process sensitive data in your data pipeline and in your data lake. **Build and monitor data pipelines** + **Automatically scale based on workload** – Dynamically scale resources up and down based on workload. This assigns workers to jobs only when needed. + **Automate jobs with event-based triggers** – Start crawlers or AWS Glue jobs with event-based triggers, and design a chain of dependent jobs and crawlers. + **Run and monitor jobs** – Run AWS Glue jobs with your choice of engine, Spark or Ray. Monitor them with automated monitoring tools, AWS Glue job run insights, and AWS CloudTrail. Improve your monitoring of Spark-backed jobs with the Apache Spark UI. + **Define workflows for ETL and integration activities** – Define workflows for ETL and integration activities for multiple crawlers, jobs, and triggers. ## Learning about innovations in AWS Glue Learn about the latest innovations in AWS Glue and hear how customers use AWS Glue to enable self-service data preparation across their organization. [![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/cDDPg_XxPqc/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/cDDPg_XxPqc) Learn about how customers scale AWS Glue beyond the traditional setup and how they configure AWS Glue for job monitoring and performance. [![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/ce6t3FqB_Z4/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/ce6t3FqB_Z4) ## Getting started with AWS Glue We recommend that you start with the following sections: + [ Overview of using AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/start-console-overview.html) + [AWS Glue concepts ](https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html) + [ Setting up IAM permissions for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/set-up-iam.html) + [ Getting started with the AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/start-data-catalog.html) + [ Authoring jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/author-job-glue.html) + [ Getting started with AWS Glue interactive sessions ](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html) + [ Orchestration in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/etl-jobs.html) ## Accessing AWS Glue You can create, view, and manage your AWS Glue jobs using the following interfaces: + **AWS Glue console** – Provides a web interface for you to create, view, and manage your AWS Glue jobs. To access the console, see [https://console.aws.amazon.com/glue](https://console.aws.amazon.com/glue). + **AWS Glue Studio** – Provides a graphical interface for you to create and edit your AWS Glue jobs visually. For more information, see [Building visual ETL jobs](author-job-glue.md). + **AWS Glue section of the AWS CLI Reference** – Provides AWS CLI commands that you can use with AWS Glue. For more information, see [AWS CLI Reference for AWS Glue](https://docs.aws.amazon.com/cli/latest/reference/glue/index.html). + **AWS Glue API** – Provides a complete API reference for developers. For more information, see [AWS Glue API](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html). ## Related services Users of AWS Glue also use: + ** [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) ** – A service that is an authorization layer that provides fine-grained access control to resources in the AWS Glue Data Catalog. + ** [AWS Glue DataBrew](https://docs.aws.amazon.com/databrew/latest/dg/what-is.html) ** – A visual data preparation tool that you can use to clean and normalize data without writing any code. # AWS Glue for Ray end of support **Important** AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html). After careful consideration, we decided to close AWS Glue for Ray to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. AWS continues to invest in security and availability improvements for AWS Glue for Ray. Note that we do not plan to introduce new features to AWS Glue for Ray, except for security and availability enhancements. As an alternative to AWS Glue for Ray, we recommend using Amazon Elastic Kubernetes Service. Amazon Elastic Kubernetes Service is a fully managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. It is a highly customizable option that relies on open-source KubeRay Operator to deploy and manage Ray clusters on Kubernetes, offering improved resource utilization, simplified infrastructure management, and full support for Ray features. ## Migrating a Ray job to Amazon Elastic Kubernetes Service This section provides steps for migrating from AWS Glue for Ray to Ray on Amazon Elastic Kubernetes Service. These steps are helpful for two migration scenarios: + **Standard Migration (x86/amd64)**: For these use cases, the migration strategy uses OpenSource Ray container for basic implementations and executes scripts directly on the base container. + **ARM64 Migration**: For these use cases, the migration strategy supports custom container builds for ARM64-specific dependencies and architecture requirements. ### Prerequisites for migration Install the following CLI tools: **aws**, **kubectl**, **eksctl**, **helm**, Python 3.9\$1. These CLI tools are required to provision and manage your Ray on EKS environment. **eksctl** simplifies creating and managing EKS clusters. **kubectl** is the standard Kubernetes CLI for deploying and troubleshooting workloads on your cluster. **helm** is used to install and manage KubeRay (the operator that runs Ray on Kubernetes). Python 3.9\$1 is required for Ray itself and to run job submission scripts locally. #### Install eksctl Follow the instructions on [Installation options for Eksctl](https://docs.aws.amazon.com/eks/latest/eksctl/installation.html) or use the instructions below for installation. For macOS: ``` brew tap weaveworks/tap brew install weaveworks/tap/eksctl ``` For Linux: ``` curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp # Move the extracted binary to /usr/local/bin sudo mv /tmp/eksctl /usr/local/bin # Test the installation eksctl version ``` #### Install kubectl Follow the instructions on [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) or use the instructions below for installation. For macOS: ``` brew install kubectl ``` For Linux: ``` curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" chmod +x kubectl sudo mv kubectl /usr/local/bin/ ``` #### Install helm Follow the instructions on [Installing Helm](https://helm.sh/docs/intro/install/) or use the instructions below for installation. For macOS: ``` brew install helm ``` For Linux: ``` curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash ``` ### Step 1. Build or choose a Docker Image for Ray **Option 1: Use the official Ray image (no build required)** This option uses the official Ray Docker image on [Docker Hub](https://hub.docker.com/u/rayproject), for example `rayproject/ray:2.4.0-py39`, which is maintained by the Ray project. **Note** This image is amd64-only. Use this if your dependencies are compatible with amd64 and you don't require ARM-specific builds. **Option 2: Build and publish your own arm64 Ray 2.4.0 image** This option is useful when using Graviton (ARM) nodes, consistent with what AWS Glue for Ray uses internally. You can create a custom image pinned to the same dependency versions as AWS Glue for Ray, to reduce compatibility mismatches. Create a Dockerfile locally: ``` # Build an ARM64 image FROM --platform=linux/arm64 python:3.9-slim-bullseye # Handy tools: wget for KubeRay probes; CA certs; keep image small RUN apt-get update && apt-get install -y --no-install-recommends \ wget ca-certificates \ && rm -rf /var/lib/apt/lists/* # Keep pip/setuptools modern enough for wheels resolution RUN python -m pip install -U "pip<24" "setuptools<70" wheel # ---- Install Ray 2.4.0 (ARM64 / Py3.9) and Glue-like dependencies ---- # 1) Download the exact Ray 2.4.0 wheel for aarch64 (no network at runtime) RUN python -m pip download --only-binary=:all: --no-deps --dest /tmp/wheels ray==2.4.0 # 2) Core libs used in Glue (pin to Glue-era versions) # + the dashboard & jobs API dependencies compatible with Ray 2.4.0. # (Pins matter: newer major versions break 2.4.0's dashboard.) RUN python -m pip install --no-cache-dir \ /tmp/wheels/ray-2.4.0-*.whl \ "pyarrow==11.0.0" \ "pandas==1.5.3" \ "boto3==1.26.133" \ "botocore==1.29.133" \ "numpy==1.24.3" \ "fsspec==2023.4.0" \ "protobuf<4" \ # --- dashboard / jobs server deps --- "aiohttp==3.8.5" \ "aiohttp-cors==0.7.0" \ "yarl<1.10" "multidict<7.0" "frozenlist<1.4" "aiosignal<1.4" "async_timeout<5" \ "pydantic<2" \ "opencensus<0.12" \ "prometheus_client<0.17" \ # --- needed if using py_modules --- "smart_open[s3]==6.4.0" # Optional: prove Ray & arch at container start ENV PYTHONUNBUFFERED=1 WORKDIR /app # KubeRay overrides the start command; this is just a harmless default CMD ["python","-c","import ray,platform; print('Ray', ray.__version__, 'on', platform.machine())"] ``` ``` # Set environment variables export AWS_REGION=us-east-1 export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text) export REPO=ray-2-4-arm64 export IMAGE=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO}:v1 # Create repository and login aws ecr create-repository --repository-name $REPO >/dev/null 2>&1 || true aws ecr get-login-password --region $AWS_REGION \ | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com # Enable Buildx (for cross-builds on non-ARM hosts) docker buildx create --name multi --driver docker-container --use 2>/dev/null || true # Build & push ARM64 image docker buildx build \ --platform linux/arm64 \ -t "$IMAGE" \ . --push # Verify the image architecture remotely aws ecr batch-get-image \ --repository-name $REPO \ --image-ids imageTag=v1 \ --accepted-media-types application/vnd.docker.distribution.manifest.v2+json \ | jq -r '.images[0].imageManifest' \ | jq -r 'fromjson.config.digest' ``` Once done, reference this ARM64 image in the RayCluster spec with `nodeSelector: { kubernetes.io/arch: arm64 }`. ``` spec: rayVersion: "2.4.0" headGroupSpec: template: spec: containers: - name: ray-head image: ``` ### Step 2. Convert AWS Glue for Ray Job Configuration to Ray on Amazon Elastic Kubernetes Service AWS Glue for Ray jobs support a set of job arguments that configure workers, dependencies, memory, and logging. When migrating to Amazon Elastic Kubernetes Service with KubeRay, these arguments need to be translated into RayCluster spec fields or Ray Job runtime environment settings. #### Job Argument Mapping **Mapping AWS Glue for Ray Arguments to Ray on EKS Equivalents** | AWS Glue for Ray argument | What it does in AWS Glue for Ray | Ray on Amazon Elastic Kubernetes Service equivalent | | --- | --- | --- | | --min-workers | Minimum workers the job must allocate. | workerGroupSpecs[].minReplicas in your RayCluster | | --working-dir | Distributes a zip (S3 URI) to all nodes. | Use Ray runtime env: working\$1dir if you're submitting from local files; py\$1modules for S3 zips to point at the S3 artifact | | --s3-py-modules | Adds Python wheels/dists from S3. | Use Ray runtime env: py\$1modules: ["s3://.../xxx.whl", ...] | | --pip-install | Installs extra PyPI packages for the job. | Ray runtime env: pip: ["pkg==ver", ...] (Ray Job CLI --runtime-env-json or RayJob runtimeEnvYAML). | | --object\$1store\$1memory\$1head | % of memory for head node's Plasma store. | headGroupSpec[].rayStartParams.object-store-memory in your RayCluster. Note this should be in bytes. AWS Glue uses percentage, while Ray uses bytes. | | --object\$1store\$1memory\$1worker | % of memory for worker nodes' Plasma store. | Same as above but set in each worker group's rayStartParams.object-store-memory (bytes). | | --object\$1spilling\$1config | Configure Ray object spilling. | headGroupSpec[].rayStartParams.object-spilling-config | | --logging\$1configuration | AWS Glue-managed logs (CloudWatch, S3). | Check pod stdout/stderr: use kubectl -n ray logs --follow. Check logs from Ray Dashboard (port-forward to :8265), you can also see task and job logs there. | #### Job Configuration Mapping **Mapping AWS Glue for Ray Job Configurations to Ray on EKS Equivalents** | Configuration | What it does in AWS Glue for Ray | Ray on EKS equivalent | | --- | --- | --- | | Worker type | Set the type of predefined worker that is allowed when a job runs. Default to Z 2X (8vCPU, 64 GB RAM). | Nodegroup instance type in EKS (e.g., r7g.2xlarge ≈ 8 vCPU / 64 GB for ARM, r7a.2xlarge for x86). | | Maximum number of workers | The number of workers you want AWS Glue to allocate to this job. | Set workerGroupSpecs[].maxReplicas to the same number of what you used in AWS Glue. This is the upper bound for autoscaling. Similarly set minReplicas as lower bound. You can start with replicas: 0, minReplicas: 0. | ### Step 3. Setup Amazon Elastic Kubernetes Service You can either create a new Amazon Elastic Kubernetes Service cluster or reuse an existing Amazon Elastic Kubernetes Service cluster. If using an existing cluster, skip the create cluster commands and jump to Add a node group, IRSA, and install KubeRay. #### Create an Amazon Elastic Kubernetes Service cluster **Note** If you have an existing Amazon Elastic Kubernetes Service cluster, skip the commands to create a new cluster and just add a node group. ``` # Environment Variables export AWS_REGION=us-east-1 export CLUSTER=ray-eks export NS=ray # namespace for your Ray jobs (you can reuse another if you like) # Create a cluster (OIDC is required for IRSA) eksctl create cluster \ --name $CLUSTER \ --region $AWS_REGION \ --with-oidc \ --managed ``` #### Add a node group ``` # ARM/Graviton (matches Glue's typical runtime): eksctl create nodegroup \ --cluster $CLUSTER \ --region $AWS_REGION \ --name arm64-ng \ --node-type m7g.large \ --nodes 2 --nodes-min 1 --nodes-max 5 \ --managed \ --node-labels "workload=ray" # x86/amd64 (use if your image is amd64-only): eksctl create nodegroup \ --cluster $CLUSTER \ --region $AWS_REGION \ --name amd64-ng \ --node-type m5.large \ --nodes 2 --nodes-min 1 --nodes-max 5 \ --managed \ --node-labels "workload=ray" ``` **Note** If you are using an existing Amazon Elastic Kubernetes Service cluster, then use `--with-oidc` to enable OIDC when adding a node group. #### Create namespace \$1 IAM role for Service Accounts (IRSA) for S3 A Kubernetes namespace is a logical grouping for resources (pods, services, roles, etc.). You can create or reuse an existing namespace. You will also need to create an IAM policy for S3 which mirrors your AWS Glue job's access. Use the same custom permissions your AWS Glue job role had (typically S3 read/write to specific buckets). To grant permissions to Amazon Elastic Kubernetes Service similar to the AWSGlueServiceRole, create a Service Account (IRSA) bound to this IAM policy. Refer to [IAM Roles for Service Accounts](https://docs.aws.amazon.com/eks/latest/eksctl/iamserviceaccounts.html) for instructions to setup this service account. ``` # Create (or reuse) namespace kubectl create namespace $NS || true ``` ``` { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::YOUR-BUCKET", "arn:aws:s3:::YOUR-BUCKET/*" ] }] } ``` ``` # Create the IAM policy and wire IRSA: aws iam create-policy \ --policy-name RayS3Policy \ --policy-document file://example.json || true # Create a service account (IRSA) bound to that policy. eksctl create iamserviceaccount \ --cluster $CLUSTER \ --region $AWS_REGION \ --namespace $NS \ --name ray-s3-access \ --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT}:policy/RayS3Policy \ --approve \ --override-existing-serviceaccounts ``` #### Install KubeRay operator (controller that runs Ray on K8s) ``` helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update helm upgrade --install kuberay-operator kuberay/kuberay-operator \ --namespace kuberay-system \ --create-namespace # Validate the operator pod Running kubectl -n kuberay-system get pods ``` ### Step 4. Spin up a Ray cluster Create a YAML file to define ray cluster. Below is a sample configuration (raycluster.yaml): ``` apiVersion: ray.io/v1 kind: RayCluster metadata: name: glue-like namespace: ray spec: rayVersion: "2.4.0" headGroupSpec: template: spec: nodeSelector: kubernetes.io/arch: amd64 serviceAccountName: ray-s3-access containers: - name: ray-head image: rayproject/ray:2.4.0-py39 imagePullPolicy: Always resources: requests: { cpu: "1", memory: "2Gi" } limits: { cpu: "1", memory: "2Gi" } workerGroupSpecs: - groupName: workers replicas: 0 # start with just a head (like small Glue dev job) and turn number of replicas later minReplicas: 0 maxReplicas: 5 template: spec: nodeSelector: kubernetes.io/arch: amd64 serviceAccountName: ray-s3-access containers: - name: ray-worker image: rayproject/ray:2.4.0-py39 imagePullPolicy: Always resources: requests: { cpu: "1", memory: "2Gi" } limits: { cpu: "1", memory: "2Gi" } ``` #### Deploy the Ray cluster on Amazon Elastic Kubernetes Service cluster ``` kubectl apply -n $NS -f raycluster.yaml # Validate that the head pod turns to READY/ RUNNING state kubectl -n $NS get pods -l ray.io/cluster=glue-like -w ``` If there is a need to modify the deployed yaml, delete the cluster first and then re-apply the updated yaml: ``` kubectl -n $NS delete raycluster glue-like kubectl -n $NS apply -f raycluster.yaml ``` #### Accessing the Ray Dashboard You can access the Ray dashboard by enabling port-forwarding using kubectl: ``` # Get service SVC=$(kubectl -n $NS get svc -l ray.io/cluster=glue-like,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}') # Make the Ray dashboard accessible at http://localhost:8265 on your local machine. kubectl -n $NS port-forward svc/$SVC 8265:8265 ``` ### Step 5. Submit Ray Job To submit a Ray job, use the Ray jobs CLI. The CLI version can be newer than the cluster, it is backward compatible. As a pre-requisite, store your job script locally in a file, e.g. `job.py`. ``` python3 -m venv ~/raycli && source ~/raycli/bin/activate pip install "ray[default]==2.49.2" # Submit your ray job by supplying all python dependencies that was added to your Glue job ray job submit --address http://127.0.0.1:8265 --working-dir . \ --runtime-env-json '{ "pip": ["boto3==1.28.*","pyarrow==12.*","pandas==2.0.*"] }' \ -- python job.py ``` The job can be monitored on the Ray dashboard.