

# Multi-node parallel jobs on Amazon EKS
<a name="mnp-eks-jobs"></a>

You can use AWS Batch on Amazon Elastic Kubernetes Service to run multi-node parallel (MNP) jobs (also known as *gang scheduling*) on your managed Kubernetes clusters. This option is commonly used for large, tightly-coupled, high-performance jobs that can’t be run on a single Amazon Elastic Compute Cloud instance. For more information, see [Multi-node parallel jobs](multi-node-parallel-jobs.md).

You can use this feature to run Amazon EKS managed Kubernetes-specific high-performance computing applications, large language model training, and other Artificial Intelligence (AI)/Machine Learning (ML) jobs.

**Topics**
+ [Running MNP jobs](mnp-eks-running-mnp-jobs.md)
+ [Create an Amazon EKS MNP job definition](mnp-eks-create-eks-mnp-job-definition.md)
+ [Submit an Amazon EKS MNP job](mnp-eks-submit-eks-mnp-job.md)
+ [Override an Amazon EKS MNP job definition](mnp-eks-override-eks-mnp-job-definition.md)

# Running MNP jobs
<a name="mnp-eks-running-mnp-jobs"></a>

AWS Batch supports MNP jobs on Amazon Elastic Container Service and Amazon EKS using Amazon EC2. The following provides more specifics about the instance and container parameters for the feature.

## Instance quotas for MNP on Amazon EKS
<a name="mnp-eks-instance-quotas"></a>
+ Up to 1000 instances can be used for a single MNP job.
+ Up to 5000 instances can join a single Amazon EKS cluster.
+ Up to 5 compute environments can be clustered and attached to a job-queue.

For example, you can scale up to 5 clustered compute environments in a job queue and 1000 instances in each compute environment.

In addition to the instance parameters, it’s important to note that you can’t use Fargate for MNP jobs through either service. 

You can use only one instance type in each MNP job. You can change the instance type by updating the compute environment, or when you define a new compute environment. You can also specify the instance type, and provide vCPU and memory requirements when creating the job definition.

## Container quotas for MNP on Amazon EKS
<a name="mnp-eks-container-quotas"></a>
+ A multi-node parallel job supports one pod per node.
+ Up to 10 containers (or 10 init containers. For more information see [Init Containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) in the *Kubernetes documentation*.) in each pod. 
+ Up to 5 node ranges in each MNP job.
+ Up 10 distinct container images in each node range.

For example, you can run up to a maximum of 10,000 containers in a single MNP job that contains 5 node ranges and a total of 50 unique images.

## Running MNP jobs in a private Amazon VPC and an Amazon EKS cluster
<a name="mnp-eks-running-mnp-jobs-vpc"></a>

MNP jobs can run on any Amazon EKS cluster whether it has public Internet or not. When using an Amazon EKS cluster with only private network access be sure that AWS Batch can access the Amazon EKS control plane and the managed Kubernetes API server. You can grant the necessary access through Amazon Virtual Private Cloud endpoints. For more information, see [Configure an endpoint service](https://docs.aws.amazon.com/vpc/latest/privatelink/configure-endpoint-service.html).

Amazon EKS cluster Pods can’t download an image from a public source since the private VPC doesn’t have Internet access. Your Amazon EKS cluster must pull images from a container registry that's within your Amazon VPC. You can create an [Amazon Elastic Container Registry (Amazon ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/Registries.html) in your Amazon VPC and copy container images to it for your nodes access. 

You can also create a pull through cache rule with Amazon ECR. Once a pull through cache rule is created for an external public registry, you can simply pull an image from that external public registry using your Amazon ECR private registry URI. Then Amazon ECR creates a repository and caches the image. When a cached image is pulled using the Amazon ECR private registry URI, Amazon ECR checks the remote registry to see if there is a new version of the image and will update your private registry up to one time every 24 hours. For more information, see [Creating a pull through cache rule in Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html).

## Error notification
<a name="mnp-eks-error-notificaton"></a>

If your MNP jobs are blocked, you can receive notifications through the AWS Management Console and Amazon EventBridge. For example, if an MNP job is stuck at the head of the queue, you can be notified about the issue along with information about what caused it so that you can take prompt action to unblock your job queue. Optionally, you can auto-terminate the MNP job if no action is taken within a distinct amount of time, which can be defined in the job queue template. For more information, see [Job queue blocked events](batch-job-queue-blocked-events.md)

# Create an Amazon EKS MNP job definition
<a name="mnp-eks-create-eks-mnp-job-definition"></a>

To define and run MNP jobs on Amazon EKS, there are new parameters within the [https://docs.aws.amazon.com/batch/latest/APIReference/API_RegisterJobDefinition.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_RegisterJobDefinition.html) and [https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html) API operations. 
+ Use [https://docs.aws.amazon.com/batch/latest/APIReference/API_EksProperties.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_EksProperties.html) under the [https://docs.aws.amazon.com/batch/latest/APIReference/API_NodeProperties.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_NodeProperties.html) section to define your MNP job definition.
+ Use [https://docs.aws.amazon.com/batch/latest/APIReference/API_EksPropertiesOverride.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_EksPropertiesOverride.html) under the [https://docs.aws.amazon.com//batch/latest/APIReference/API_NodePropertyOverride.html](https://docs.aws.amazon.com//batch/latest/APIReference/API_NodePropertyOverride.html) section to override the parameters defined in the job definition when submitting an MNP job.

These actions can be defined through API operations and the AWS Management Console.

## Reference: Register the Amazon EKS MNP job definition request payload
<a name="mnp-eks-register-eks-mnp-job-definition"></a>

The following example illustrates how you can register an Amazon EKS MNP job definition with two nodes.

```
{
  "jobDefinitionName": "MyEksMnpJobDefinition",
  "type": "multinode",
  "nodeProperties": {
    "numNodes": 2,
    "mainNode": 0,
    "nodeRangeProperties": [
      {
        "targetNodes" : "0:",
        "eksProperties": {
          "podProperties": {
            "containers": [
              {
                "name": "test-eks-container-1",
                "image": "public.ecr.aws/amazonlinux/amazonlinux:2",
                "command": [
                  "sleep",
                  "60"
                ],
                "resources": {
                  "limits": {
                    "cpu": "1",
                    "memory": "1024Mi"
                  }
                },
                "securityContext":{
                  "runAsUser":1000,
                  "runAsGroup":3000,
                  "privileged":true,
                  "readOnlyRootFilesystem":true,
                  "runAsNonRoot":true
               }
              }
            ],
            "initContainers": [
               {
                  "name":"init-ekscontainer",
                  "image": "public.ecr.aws/amazonlinux/amazonlinux:2",
                  "command": [
                     "echo",
                     "helloWorld"
                   ],
                   "resources": {
                     "limits": {
                       "cpu": "1",
                       "memory": "1024Mi"
                     }
                  }
               }
            ],
            "metadata": {
               "labels": {
                  "environment" : "test"
               }
            }
          }
        }
      }
    ]
  }
}
```

To register the job definition using the AWS CLI, copy the definition to a local file named *MyEksMnpJobDefinition.json* and run the following command.

```
aws batch register-job-definition --cli-input-json file://MyEksMnpJobDefinition.json
```

You will receive the following JSON response.

```
{
    "jobDefinitionName": "MyEksMnpJobDefinition",
    "jobDefinitionArn": "arn:aws:batch:us-east-1:0123456789:job-definition/MyEksMnpJobDefinition:1",
    "revision": 1
}
```

# Submit an Amazon EKS MNP job
<a name="mnp-eks-submit-eks-mnp-job"></a>

To submit a job using the registered job definition, enter the following command. Replace the value of <EKS\$1JOB\$1QUEUE\$1NAME> with the name or ARN of a pre-existing job queue associated with an Amazon EKS compute environment. 

```
aws batch submit-job --job-queue <EKS_JOB_QUEUE_NAME> \
    --job-definition MyEksMnpJobDefinition \
    --job-name myFirstEksMnpJob
```

You will receive the following JSON response.

```
{
    "jobArn": "arn:aws:batch:region:account:job/9b979cce-9da0-446d-90e2-ffa16d52af68",
    "jobName": "myFirstEksMnpJob", 
    "jobId": "<JOB_ID>"
}
```

You can check the status of the job using the returned jobId with the following command.

```
aws batch describe-jobs --jobs <JOB_ID>
```

# Override an Amazon EKS MNP job definition
<a name="mnp-eks-override-eks-mnp-job-definition"></a>

Optionally, you can override the job definition details (such as changing the MNP job size or child job details). The following provides an example JSON request payload to submit a five node MNP job, and changes to the `test-eks-container-1` container’s command.

```
{
  "numNodes": 5,
  "nodePropertyOverrides": [
    {
      "targetNodes": "0:",
      "eksPropertiesOverride": {
        "podProperties": {
          "containers": [
            {
              "name": "test-eks-container-1",
              "command": [
                "sleep",
                "150"
              ]
            }
          ]
        }
      }
    }
  ]
}
```

To submit a job with these overrides, save the example to a local file, *eks-mnp-job-nodeoverride.json*, and use the AWS CLI to submit the job with the overrides.