# Training plans utilization for SageMaker training jobs You can use a SageMaker training plans for your training jobs by specifying the plan of your choice when creating a training job. **Note** The training plan must be in the `Scheduled` or `Active` status to be used by a training job. If the required capacity is not immediately available for a training job, the job waits until it becomes available, or until the `StoppingCondition` is met, or the job has been `Pending` for capacity for 2 days, whichever comes first. If the stopping condition is met, the job is stopped. If a job has been pending for 2 days, it is terminated with an `InsufficientCapacityError`. **Important** **Reserved Capacity termination process:** You have full access to all reserved instances until 30 minutes before the Reserved Capacity end time. When there are 30 minutes remaining in your Reserved Capacity, SageMaker training plans begin the process of terminating any running instances within that Reserved Capacity. To ensure you don't lose progress due to these terminations, we recommend checkpointing your training jobs. ## Checkpoint your training job When using SageMaker training plans for your SageMaker training jobs, ensure to implement checkpointing in your training script. This allows you to save your training progress before a Reserved Capacity expires. Checkpointing is especially important when working with reserved capacities, as it enables you to resume training from the last saved point if your job is interrupted between two reserved capacities or when your training plan reaches its end date. To achieve this, you can use the `SAGEMAKER_CURRENT_CAPACITY_BLOCK_EXPIRATION_TIMESTAMP` environment variable. This variable helps determine when to initiate the checkpointing process. By incorporating this logic into your training script, you ensure that your model's progress is saved at appropriate intervals. Here's an example of how you can implement this checkpointing logic in your Python training script: ``` import os import time from datetime import datetime, timedelta def is_close_to_expiration(threshold_minutes=30): # Retrieve the expiration timestamp from the environment variable expiration_time_str = os.environ.get('SAGEMAKER_CURRENT_CAPACITY_BLOCK_EXPIRATION_TIMESTAMP', '0') # If the timestamp is not set (default '0'), return False if expiration_time_str == '0': return False # Convert the timestamp string (in milliseconds) to a datetime object expiration_time = datetime(1970, 1, 1) + timedelta(milliseconds=int(expiration_time_str)) # Calculate the time difference between now and the expiration time time_difference = expiration_time - datetime.now() # Return True if we're within the threshold time of expiration return time_difference < timedelta(minutes=threshold_minutes) def start_checkpointing(): # Placeholder function for checkpointing logic print("Starting checkpointing process...") # TODO: Implement actual checkpointing logic here # For example: # - Save model state # - Save optimizer state # - Save current epoch and iteration numbers # - Save any other relevant training state # Main training loop num_epochs = 100 final_checkpointing_done = False for epoch in range(num_epochs): # TODO: Replace this with your actual training code # For example: # - Load a batch of data # - Forward pass # - Calculate loss # - Backward pass # - Update model parameters # Check if we're close to capacity expiration and haven't done final checkpointing if not final_checkpointing_done and is_close_to_expiration(): start_checkpointing() final_checkpointing_done = True # Simulate some training time (remove this in actual implementation) time.sleep(1) print("Training completed.") ``` **Note** Training job provisioning follows a First-In-First-Out (FIFO) order, but a smaller cluster job created later might be assigned capacity before a larger cluster job created earlier, if the larger job cannot be fulfilled. SageMaker training managed warm-pool is compatible with SageMaker training plans. For cluster re-use, you must provide identical `TrainingPlanArn` values in subsequent `CreateTrainingJob` requests to reuse the same cluster. **Topics** + [Checkpoint your training job](#training-jobs-checkpointing) + [Create a training job using the SageMaker AI console](use-training-plan-for-training-jobs-using-console.md) + [Create a training job using the API, AWS CLI, SageMaker SDK](use-training-plan-for-training-jobs-using-api-cli-sdk.md) # Create a training job using the SageMaker AI console You can use a SageMaker training plans for your training jobs using the SageMaker AI UI. When creating a training job, the available plans are suggested to you if your instance choice and region matches the available plans. To create a training job using a training plan's reserved capacity in the SageMaker console: 1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 1. In the left navigation pane, choose **Training**, and then **Training jobs**. 1. Choose the **Create training job** button. 1. When configuring the resources for your training job, look for the **Instance capacity** section. If there are plans available that match your chosen instance type and region, they are displayed here. Select a training plan that aligns with your compute capacity needs. If no suitable plans are available, you can either adjust your instance type or region, or proceed without using a training plan. 1. After selecting a training plan (or choosing to proceed without one), complete the rest of your training job configuration and choose **Create training job** to start the process. ![\[SageMaker AI console page for creating a new training job. The page displays various configuration options including job settings, algorithm options, resource configuration, training plan selection, and stopping conditions.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-plans/tp-create-training-job.png) Review and launch your job. Your job starts running as soon as the training plan becomes `Active`, pending capacity. # Create a training job using the API, AWS CLI, SageMaker SDK To use SageMaker training plans for your SageMaker training job, specify the `TrainingPlanArn` parameter of the desired plan in the `ResourceConfig` when calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. You can use exactly one plan per job. **Important** The `InstanceType` field set in the `ResourceConfig` section of the `CreateTrainingJob` request must match the`InstanceType` of your training plan. ## Run a training job on a plan using the CLI The following example demonstrates how to create a SageMaker training job and associate it with a provided training plan using the `TrainingPlanArn` attribute in the `create-training-job` AWS CLI command. For more information about how to create a training job using the AWS CLI [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) command, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-training-job.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-training-job.html). ``` # Create a training job aws sagemaker create-training-job \ --training-job-name training-job-name \ ... --resource-config '{ "InstanceType": "ml.p5.48xlarge", "InstanceCount": 8, "VolumeSizeInGB": 10, "TrainingPlanArn": "training-plan-arn" } }' \ ... ``` This AWS CLI example command creates a new training job in SageMaker AI passing a training plan in the `--resource-config` argument. ``` aws sagemaker create-training-job \ --training-job-name job-name \ --role-arn arn:aws:iam::111122223333:role/DataAndAPIAccessRole \ --algorithm-specification '{"TrainingInputMode": "File","TrainingImage": "111122223333.dkr.ecr.us-east-1.amazonaws.com/algo-image:tag", "ContainerArguments": [" "]}' \ --input-data-config '[{"ChannelName":"training","DataSource":{"S3DataSource":{"S3DataType":"S3Prefix","S3Uri":"s3://bucketname/input","S3DataDistributionType":"ShardedByS3Key"}}}]' \ --output-data-config '{"S3OutputPath": "s3://bucketname/output"}' \ --resource-config '{"VolumeSizeInGB":10,"InstanceCount":4,"InstanceType":"ml.p5.48xlarge", "TrainingPlanArn" : "arn:aws:sagemaker:us-east-1:111122223333:training-plan/plan-name"}' \ --stopping-condition '{"MaxRuntimeInSeconds": 1800}' \ --region us-east-1 ``` After creating the training job, you can verify that it was properly assigned to the training plan by calling the `DescribeTrainingJob` API. ``` aws sagemaker describe-training-job --training-job-name training-job-name ``` ## Run a training job on a plan using the SageMaker AI Python SDK Alternatively, you can create a training job associated with a training plan using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). If you are using the SageMaker Python SDK from JupyterLab in Studio to create a training job, ensure that the execution role used by the space running your JupyterLab application has the required permissions to use SageMaker training plans. To learn about the required permissions to use SageMaker training plans, see [IAM for SageMaker training plans](training-plan-iam-permissions.md). The following example demonstrates how to create a SageMaker training job and associate it with a provided training plan using the `training_plan` attribute in the `Estimator` object when using the SageMaker Python SDK. For more information on the SageMaker Estimator, see [Use a SageMaker estimator to run a training job](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own-private-registry-estimator.html). ``` import sagemaker import boto3 from sagemaker import get_execution_role from sagemaker.estimator import Estimator from sagemaker.inputs import TrainingInput # Set up the session and SageMaker client session = boto3.Session() region = session.region_name sagemaker_session = session.client('sagemaker') # Get the execution role for the training job role = get_execution_role() # Define the input data configuration trainingInput = TrainingInput( s3_data='s3://input-path', distribution='ShardedByS3Key', s3_data_type='S3Prefix' ) estimator = Estimator( entry_point='train.py', image_uri="123456789123.dkr.ecr.{}.amazonaws.com/image:tag", role=role, instance_count=4, instance_type='ml.p5.48xlarge', training_plan="training-plan-arn", volume_size=20, max_run=3600, sagemaker_session=sagemaker_session, output_path="s3://output-path" ) # Create the training job estimator.fit(inputs=trainingInput, job_name=job_name) ``` After creating the training job, you can verify that it was properly assigned to the training plan by calling the `DescribeTrainingJob` API. ``` # Check job details sagemaker_session.describe_training_job(TrainingJobName=job_name) ```