

# Service jobs in AWS Batch
<a name="service-jobs"></a>

AWS Batch service jobs enable you to submit requests to AWS services through AWS Batch job queues. Currently, AWS Batch supports SageMaker Training jobs as service jobs. Unlike containerized jobs where AWS Batch manages the underlying container execution, service jobs allow AWS Batch to provide job scheduling and queuing capabilities while the target AWS service (such as SageMaker AI) handles the actual job execution.

AWS Batch for SageMaker Training jobs allows data scientists to submit training jobs with priorities to configurable queues, ensuring workloads run without intervention as soon as resources are available. This capability addresses common challenges such as resource coordination, preventing accidental overspending, meeting budget constraints, optimizing costs with reserved instances, and eliminating the need for manual coordination between team members.

Service jobs differ from containerized jobs in several key ways:
+ **Job submission**: Service jobs must be submitted using the [SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html) API. Service jobs cannot be submitted through the AWS Batch console.
+ **Job execution**: AWS Batch schedules and queues service jobs, but the target AWS service runs the actual job workload. 
+ **Resource identifiers**: Service jobs use ARNs that contain "service-job" instead of "job" to distinguish them from containerized jobs.

To get started with AWS Batch service jobs for SageMaker Training, see [Getting started with AWS Batch on SageMaker AI](getting-started-sagemaker.md).

**Topics**
+ [Service job payloads in AWS Batch](service-job-payload.md)
+ [Submit a service job in AWS Batch](service-job-submit.md)
+ [Mapping AWS Batch service job status to SageMaker AI status](service-job-status.md)
+ [Service job retry strategies in AWS Batch](service-job-retries.md)
+ [Monitor service jobs in an AWS Batch queue](monitor-sagemaker-job-queue.md)
+ [Terminate service jobs](terminate-service-jobs.md)

# Service job payloads in AWS Batch
<a name="service-job-payload"></a>

When you submit service jobs using [SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html), you provide two key parameters that define the job: `serviceJobType`, and `serviceRequestPayload`.
+ The `serviceJobType` specifies which AWS service will execute the job. For SageMaker Training jobs, this value is `SAGEMAKER_TRAINING`.
+ The `serviceRequestPayload` is a JSON-encoded string that contains the complete request that would normally be sent directly to the target service. For SageMaker Training jobs, this payload contains the same parameters you would use with the SageMaker AI [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API.

For a complete list of all available parameters and their descriptions, see the SageMaker AI [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API reference. All parameters supported by `CreateTrainingJob` can be included in your service job payload.

For examples of more training job configurations, see [APIs, CLI, and SDKs](https://docs.aws.amazon.com/sagemaker/latest/dg/api-and-sdk-reference-overview.html) in the [SageMaker AI Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html).

We recommend using the PySDK for service job creation because PySDK has helper classes and utilities. For an example of using PySDK, see [SageMaker AI examples](https://github.com/aws/amazon-sagemaker-examples) on GitHub.

## Example service job payload
<a name="service-job-payload-example"></a>

The following example shows a simple service job payload for a SageMaker Training job that runs a "hello world" training script:

This payload would be passed as a JSON string to the `serviceRequestPayload` parameter when calling `SubmitServiceJob`.

```
{
  "TrainingJobName": "my-simple-training-job",
  "RoleArn": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
  "AlgorithmSpecification": {
    "TrainingInputMode": "File",
    "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310",
    "ContainerEntrypoint": [
      "echo",
      "hello world"
    ]
  },
  "ResourceConfig": {
    "InstanceType": "ml.c5.xlarge",
    "InstanceCount": 1,
    "VolumeSizeInGB": 1
  },
  "OutputDataConfig": {
    "S3OutputPath": "s3://your-output-bucket/output"
  },
  "StoppingCondition": {
    "MaxRuntimeInSeconds": 30
  }
}
```

# Submit a service job in AWS Batch
<a name="service-job-submit"></a>

To submit service jobs to AWS Batch, you use the [SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html) API. You can submit jobs using the AWS CLI or SDK.

If you don't already have an execution role then you must create one before you can submit your service job. To create the SageMaker AI execution role, see [How to use SageMaker AI execution roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) in the *[SageMaker AI Developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)*.

## Service job submission workflow
<a name="service-job-submit-workflow"></a>

When you submit a service job, AWS Batch follows this workflow:

1. AWS Batch receives your `[SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html)` request and validates the AWS Batch-specific parameters. The `serviceRequestPayload` is passed through without validation.

1. The job enters the `SUBMITTED` state and is placed in the specified job queue

1. AWS Batch evaluates if there is available capacity in the service environment for `RUNNABLE` jobs at the front of the queue

1. If capacity is available, the job moves to `SCHEDULED` and the job has been passed to SageMaker AI

1. When capacity has been acquired and SageMaker AI has downloaded the service job data, the service job will start initialization and the job is changed to `STARTING`. 

1. When SageMaker AI starts to execute the job its status is changed to `RUNNING`.

1. While SageMaker AI executes the job, AWS Batch monitors its progress and maps service states to AWS Batch job states. For details about how service job states are mapped, see [Mapping AWS Batch service job status to SageMaker AI status](service-job-status.md)

1. When the service job is completed it moves to `SUCCEEDED` and any output is ready to be downloaded.

## Prerequisites
<a name="service-job-submit-prerequisites"></a>

Before submitting a service job, ensure you have:
+ **Service environment** – A service environment that defines capacity limits. For more information, see [Create a service environment in AWS Batch](create-service-environments.md).
+ **SageMaker job queue** – A SageMaker job queue to provide job scheduling. For more information, see [Create a SageMaker Training job queue in AWS Batch](create-sagemaker-job-queue.md).
+ **IAM permissions** – Permissions to create and manage AWS Batch job queues and service environments. For more information, see [AWS Batch IAM policies, roles, and permissions](IAM_policies.md).

## Submit a service job
<a name="service-job-submit-example"></a>

The table below shows how to submit a service job using either the SageMaker Python SDK or the AWS CLI:

------
#### [ Submit using the SageMaker Python SDK ]

The [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v3-examples/training-examples/aws_batch/sm-training-queues_getting_started_with_model_trainer.html) has built-in support for submitting jobs to AWS Batch. The following examples show how to create a model trainer, create a training queue, and submit a job. For a complete example, see the [full sample notebook](https://github.com/aws/sagemaker-python-sdk/blob/master/v3-examples/training-examples/aws_batch/sm-training-queues_getting_started_with_model_trainer.ipynb) on GitHub.

Create a `ModelTrainer` that defines the training job configuration.

```
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, StoppingCondition

source_code = SourceCode(command="echo 'Hello World'")

model_trainer = ModelTrainer(
    training_image="123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.5-gpu-py311",
    source_code=source_code,
    base_job_name="my-training-job",
    compute=Compute(instance_type="ml.g5.xlarge", instance_count=1),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=300),
)
```

Create a `TrainingQueue` object that references your job queue by name.

```
from sagemaker.train.aws_batch.training_queue import TrainingQueue

queue = TrainingQueue("my-sagemaker-job-queue")
```

Submit a job by calling `queue.submit`.

```
job = queue.submit(
    training_job=model_trainer,
    inputs=None,
)
```

------
#### [ Submit using the AWS CLI ]

The following shows how to submit a service job using the AWS CLI:

```
aws batch submit-service-job \
    --job-name "my-sagemaker-training-job" \
    --job-queue "my-sagemaker-job-queue" \
    --service-job-type "SAGEMAKER_TRAINING" \
    --service-request-payload '{\"TrainingJobName\": \"sagemaker-training-job-example\", \"AlgorithmSpecification\": {\"TrainingImage\": \"123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-cpu-py3\", \"TrainingInputMode\": \"File\", \"ContainerEntrypoint\":  [\"sleep\", \"1\"]}, \"RoleArn\":\"arn:aws:iam::123456789012:role/SageMakerExecutionRole\", \"OutputDataConfig\": {\"S3OutputPath\": \"s3://example-bucket/model-output/\"}, \"ResourceConfig\": {\"InstanceType\": \"ml.m5.large\", \"InstanceCount\": 1, \"VolumeSizeInGB\": 1}}'
    --client-token "unique-token-12345"
```

For more information about the `serviceRequestPayload` parameters, see [Service job payloads in AWS Batch](service-job-payload.md).

------

# Mapping AWS Batch service job status to SageMaker AI status
<a name="service-job-status"></a>

When you submit jobs to a SageMaker job queue using [SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html), AWS Batch manages the job lifecycle and maps AWS Batch [job states](job_states.md) to equivalent SageMaker Training job states. Service jobs, such as SageMaker Training jobs, follow a different state lifecycle than traditional container jobs. While service jobs share most states with container jobs, they introduce the `SCHEDULED` state and exhibit different retry behaviors, particularly for handling insufficient capacity errors from the target service.

The following table shows the AWS Batch job state and the corresponding SageMaker Status/SecondaryStatus:


| Batch Status | SageMaker AI Primary Status | SageMaker AI Secondary Status | Description | 
| --- | --- | --- | --- | 
| SUBMITTED | N/A | N/A | Job submitted to queue, waiting for scheduler evaluation.  | 
| RUNNABLE | N/A | N/A | Job is queued and ready for scheduling. Jobs in this state are started as soon as sufficient resources are available in the service environment. Jobs can remain in this state indefinitely when sufficient resources are unavailable. | 
| SCHEDULED | InProgress | Pending | Service job successfully submitted to SageMaker AI | 
| STARTING | InProgress | Downloading | SageMaker Training job downloading data and images. Training job capacity has been acquired and job initialization begins. | 
| RUNNING | InProgress | Training | SageMaker Training job executing algorithm  | 
| RUNNING | InProgress | Uploading | SageMaker Training job uploading output artifacts after training completion | 
| SUCCEEDED | Completed | Completed | SageMaker Training job completed successfully. Output artifacts finished uploading. | 
| FAILED | Failed | Failed | SageMaker Training job encountered an unrecoverable error. | 
| FAILED | Stopped | Stopped | SageMaker Training job was manually stopped using StopTrainingJob. | 

# Service job retry strategies in AWS Batch
<a name="service-job-retries"></a>

Service job retry strategies allow AWS Batch to automatically retry failed service jobs under specific conditions.

Service jobs may require multiple attempts for several reasons:
+ **Temporary service issues**: Internal service errors, throttling, or temporary outages can cause jobs to fail during submission or execution.
+ **Training initialization failures**: Issues during job startup, such as image pulling problems or initialization errors, may be resolved on retry.

By configuring appropriate retry strategies, you can improve job success rates and reduce the need for manual intervention, especially for long-running training workloads.

**Note**  
Service jobs automatically retry certain types of failures, such as insufficient capacity errors, without consuming your configured retry attempts. Your retry strategy primarily handles other types of failures such as algorithm errors or service issues.

## Configuring retry strategies
<a name="configuring-service-job-retries"></a>

Service job retry strategies are configured using [ServiceJobRetryStrategy](https://docs.aws.amazon.com/batch/latest/APIReference/API_ServiceJobRetryStrategy.html), which supports both simple retry counts and conditional retry logic.

### Retry configuration
<a name="basic-retry-configuration"></a>

The simplest retry strategy specifies the number of retry attempts that should be made if a service job fails:

```
{
  "retryStrategy": {
    "attempts": 3
  }
}
```

This configuration allows the service job to be retried up to 3 times if it fails.

**Important**  
The `attempts` value represents the total number of times the job can be placed in the `RUNNABLE` state, including the initial attempt. A value of 3 means the job will be attempted once initially, then retried up to 2 additional times if it fails.

### Retry configuration with evaluateOnExit
<a name="advanced-retry-configuration"></a>

You can use the `evaluateOnExit` parameter to specify conditions under which jobs should be retried or allowed to fail. This is useful for when different types of failures require different handling.

The `evaluateOnExit` array can contain up to 5 retry strategies, each specifying an action (`RETRY` or `EXIT`) and conditions based on status reasons:

```
{
  "retryStrategy": {
    "attempts": 5,
    "evaluateOnExit": [
      {
        "action": "RETRY",
        "onStatusReason": "Received status from SageMaker: InternalServerError*"
      },
      {
        "action": "EXIT",
        "onStatusReason": "Received status from SageMaker: ValidationException*"
      },
      {
        "action": "EXIT",
        "onStatusReason": "*"
      }
    ]
  }
}
```

This configuration:
+ Retries jobs that fail due to SageMaker AI internal server errors
+ Immediately fails jobs that encounter validation exceptions (client errors that won't be resolved by retry)
+ Includes a catch-all rule to exit for any other failure types

#### Status reason pattern matching
<a name="status-reason-patterns"></a>

The `onStatusReason` parameter supports pattern matching with up to 512 characters. Patterns can use wildcards (\$1) and match against status reasons returned by SageMaker AI.

For service jobs, status messages from SageMaker AI are prefixed with "Received status from SageMaker: " to distinguish them from AWS Batch-generated messages. Common patterns include:
+ `Received status from SageMaker: InternalServerError*` - Match internal service errors
+ `Received status from SageMaker: ValidationException*` - Match client validation errors
+ `Received status from SageMaker: ResourceLimitExceeded*` - Match resource limit errors
+ `*CapacityError*` - Match capacity-related failures

**Tip**  
Use specific pattern matching to handle different error types appropriately. For example, retry internal server errors but immediately fail on validation errors that indicate problems with job parameters.

# Monitor service jobs in an AWS Batch queue
<a name="monitor-sagemaker-job-queue"></a>

You can monitor the status of jobs in your SageMaker Training job queue using `list-service-jobs`, and `get-job-queue-snapshot`.

View running jobs in your queue:

```
aws batch list-service-jobs \
  --job-queue my-sm-training-fifo-jq \
  --job-status RUNNING
```

View jobs waiting in the queue:

```
aws batch list-service-jobs \
  --job-queue my-sm-training-fifo-jq \
  --job-status RUNNABLE
```

View jobs that have been submitted to SageMaker but not yet running:

```
aws batch list-service-jobs \
  --job-queue my-sm-training-fifo-jq \
  --job-status SCHEDULED
```

Get a snapshot of jobs at the front of your queue:

```
aws batch get-job-queue-snapshot --job-queue my-sm-training-fifo-jq
```

This command shows the order of upcoming service jobs in your queue.

## Get detailed service job information
<a name="describe-service-job"></a>

Use the [https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeServiceJob.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeServiceJob.html) operation to get comprehensive information about a specific service job, including its current status, service resource identifiers, and detailed attempt information.

View detailed information about a specific job:

```
aws batch describe-service-job \
  --job-id a4d6c728-8ee8-4c65-8e2a-9a5e8f4b7c3d
```

This command returns comprehensive information about the job, including:
+ Job ARN and current status
+ Service resource identifiers (such as SageMaker Training job ARN)
+ Scheduling priority and retry configuration
+ Service request payload containing the original service parameters
+ Detailed attempt information with start and stop times
+ Status messages from the target service

## Monitor SageMaker Training jobs
<a name="monitor-sagemaker-training-jobs"></a>

When monitoring SageMaker Training jobs through AWS Batch, you can access both AWS Batch job information and the underlying SageMaker Training job details.

The service resource identifier in the job details contains the SageMaker Training job ARN:

```
{
  "latestAttempt": {
    "serviceResourceId": {
      "name": "TrainingJobArn",
      "value": "arn:aws:sagemaker:us-east-1:123456789012:training-job/my-training-job"
    }
  }
}
```

You can use this ARN to get additional details directly from SageMaker:

```
aws sagemaker describe-training-job \
  --training-job-name my-training-job
```

Monitor job progress by checking both AWS Batch status and SageMaker Training job status. The AWS Batch job status shows the overall job lifecycle, while the SageMaker Training job status provides service-specific details about the training process.

# Terminate service jobs
<a name="terminate-service-jobs"></a>

Use the [https://docs.aws.amazon.com/batch/latest/APIReference/API_TerminateServiceJob.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_TerminateServiceJob.html) operation to stop a running service job.

Terminate a specific service job:

```
aws batch terminate-service-job \
  --job-id a4d6c728-8ee8-4c65-8e2a-9a5e8f4b7c3d \
  --reason "Job terminated by user request"
```

When you terminate a service job, AWS Batch stops the job and notifies the target service. For SageMaker Training jobs, this will stop the training job in SageMaker AI as well.