View a markdown version of this page

Training job submission - Amazon SageMaker AI

Training job submission

Launching Training Jobs

After your agent is deployed and your dataset is in S3, create a training job using one of the following methods.

SageMaker AI Studio

  • Navigate to Models in the navigation pane and select JumpStart Base Models.

  • Select a model that supports multi-turn RL (see the supported models table) and choose Customize model, then Customize with UI.

  • Select Multi-Turn Reinforcement Learning as the customization technique.

  • Configure your agent environment — select your Bedrock AgentCore runtime or provide your Lambda forwarder ARN.

  • Provide your training dataset as an S3 URI or registered dataset.

  • Adjust hyperparameters as needed.

  • Review your configuration and choose Submit.

SageMaker AI Python SDK

Discover supported models

from sagemaker.train.multi_turn_rl_trainer import MultiTurnRLTrainer supported_models = MultiTurnRLTrainer.list_supported_models() print(f"Supported MTRL models ({len(supported_models)}):") for m in supported_models: print(f" - {m}")

Set up your agent environment

Option 1: Bedrock AgentCore runtime

# List available runtimes runtimes = MultiTurnRLTrainer.list_bedrock_agentcore_runtimes() for rt in runtimes: print(f" - {rt['name']} ({rt['status']}) → {rt['arn']}")

Option 2: Custom Lambda agent

from sagemaker.train.agent_lambda import AgentLambda # Create from inline code adapter = AgentLambda.create( source=''' import json def handler(event, context): prompt = event.get("prompt", "") return {"statusCode": 200, "body": json.dumps({"status": "ok", "agentResponse": prompt})} ''', role="arn:aws:iam::123456789012:role/AgentLambdaRole", ) # Create from a local file adapter = AgentLambda.create( source="~/my_agent_handler.py", role="arn:aws:iam::123456789012:role/AgentLambdaRole", ) # Create from S3 adapter = AgentLambda.create( source="s3://my-bucket/agent_handler.py", role="arn:aws:iam::123456789012:role/AgentLambdaRole", ) # Wrap an existing Lambda adapter = AgentLambda.get("arn:aws:lambda:us-west-2:123456789012:function:my-agent")

Register your dataset (optional)

from sagemaker.ai_registry.dataset import DataSet dataset = DataSet.create( name="my-mtrl-dataset", source="s3://my-bucket/prompts/training_prompts.parquet" ) print(f"Dataset ARN: {dataset.arn}")

Create Restricted Model Package Group for Nova (optional)

If you are choosing Nova model (nova-textgeneration-lite-v2), then optionally create Restricted Model Package Group prior to submitting a training job (next step). If you skip this step, the SDK automatically creates one for you.

Restricted Model Package Group (RMPG) is a Model Package Group with ManagedStorageType: Restricted. It's required for closed-source models like Nova where the model weights are managed by AWS and not directly accessible to the customer.

The RFT job schema requires two separate Restricted MPGs:

  • Output MPG — stores the final fine-tuned model package

  • Intermediate Checkpoint MPG — reserved for intermediate training checkpoints (must differ from the Output MPG)

from sagemaker.core.resources import Job, ModelPackageGroup from sagemaker.core.shapes import ManagedConfiguration model_name = "nova-textgeneration-lite-v2" # Restricted configuration managed_config = ManagedConfiguration(managed_storage_type="Restricted") # Output Model package group output_mpg_name = f"{model_name}-mtrl-output-mpg" create_kwargs = { "model_package_group_name": output_mpg_name, "region": "us-east-1", "managed_configuration": managed_config } output_mpg = ModelPackageGroup.create(**create_kwargs) # Intermediate Model package group intermediate_mpg_name = f"{model_name}-mtrl-inter-mpg" create_kwargs = { "model_package_group_name": intermediate_mpg_name, "region": "us-east-1", "managed_configuration": managed_config } intermediate_mpg = ModelPackageGroup.create(**create_kwargs)

Once the Model package group is created, pass the groups in the next step when submitting a training job.

Submit a training job with Bedrock AgentCore

from sagemaker.train.multi_turn_rl_trainer import MultiTurnRLTrainer trainer = MultiTurnRLTrainer( model="openai-reasoning-gpt-oss-20b", agent_env="arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent-runtime", training_dataset="s3://my-bucket/prompts/prompts.parquet", mlflow_app_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/mlflow-app-id", s3_output_path="s3://my-bucket/output/", role="arn:aws:iam::123456789012:role/SageMakerRole", accept_eula=True, ) # View and adjust hyperparameters trainer.hyperparameters.get_info() trainer.hyperparameters.max_epochs = 1 trainer.hyperparameters.global_batch_size = 32 trainer.hyperparameters.max_steps = 12 job = trainer.train(wait=True) print(f"Job: {job.job_name}") print(f"Status: {job.job_status}") print(f"Output Model Package: {job.output_model_package_arn}")

Submit a training job with a custom Lambda agent

trainer = MultiTurnRLTrainer( model="openai-reasoning-gpt-oss-20b", agent_env=adapter, # AgentLambda object or Lambda ARN string training_dataset="s3://my-bucket/prompts/prompts.parquet", mlflow_app_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/mlflow-app-id", s3_output_path="s3://my-bucket/output/", role="arn:aws:iam::123456789012:role/SageMakerRole", accept_eula=True, ) trainer.hyperparameters.max_epochs = 1 trainer.hyperparameters.global_batch_size = 32 trainer.hyperparameters.max_steps = 12 job = trainer.train(wait=True) print(f"Job: {job.job_name}") print(f"Status: {job.job_status}") print(f"Output Model Package: {job.output_model_package_arn}")

Submit a training job with Restricted Model package group for Nova

Refer to the step above (Create Restricted Model Package Group for Nova) on how to create a Restricted Model Package group.

from sagemaker.train.multi_turn_rl_trainer import MultiTurnRLTrainer trainer = MultiTurnRLTrainer( model="nova-textgeneration-lite-v2", agent_env="arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent-runtime", training_dataset="s3://my-bucket/prompts/prompts.parquet", mlflow_app_arn="arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/mlflow-app-id", s3_output_path="s3://my-bucket/output/", role="arn:aws:iam::123456789012:role/SageMakerRole", accept_eula=True, output_model_package_group=output_mpg, intermediate_checkpoint_model_package_group=intermediate_mpg ) # View and adjust hyperparameters trainer.hyperparameters.get_info() trainer.hyperparameters.max_epochs = 1 trainer.hyperparameters.global_batch_size = 32 trainer.hyperparameters.max_steps = 12 job = trainer.train(wait=True) print(f"Job: {job.job_name}") print(f"Status: {job.job_status}") print(f"Output Model Package: {job.output_model_package_arn}")

AWS CLI

Create a training job using the CreateJob API. You specify the agent configuration, training data location, base model, and output settings in the JobConfigDocument.

To retrieve the full JobConfigDocument schema:

aws sagemaker list-job-schema-versions --job-category AgentRFT aws sagemaker describe-job-schema-version --job-category AgentRFT --version "1.0.0"

Create job with Bedrock AgentCore

aws sagemaker create-job \ --job-category AgentRFT \ --job-name "my-agent-rft-job" \ --role-arn "arn:aws:iam::123456789012:role/SageMakerFineTuningJobRole" \ --job-config-schema-version "1.0.0" \ --job-config-document '{ "AgentConfig": { "BedrockAgentCoreConfig": { "AgentRuntimeArn": "arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent" } }, "InputDataConfig": [...], "OutputDataConfig": {...}, "ModelPackageConfig": {...}, "TrainingConfig": {...} }' \ --region us-west-2

Create job with custom Lambda agent

aws sagemaker create-job \ --job-category AgentRFT \ --job-name "my-custom-agent-rft-job" \ --role-arn "arn:aws:iam::account-id:role/SageMakerFineTuningJobRole" \ --job-config-schema-version "1.0.0" \ --job-config-document '{ "AgentConfig": { "CustomAgentLambdaConfig": { "LambdaArn": "arn:aws:lambda:us-west-2:account-id:function:rft-agent-forwarder" } }, "InputDataConfig": [...], "OutputDataConfig": {...}, "ModelPackageConfig": {...}, "TrainingConfig": {...} }' \ --region us-west-2

boto3

Create job with Bedrock AgentCore

import json import boto3 sm = boto3.client("sagemaker") response = sm.create_job( JobName="my-agent-rft-job", RoleArn="arn:aws:iam::123456789012:role/SageMakerFineTuningJobRole", JobCategory="AgentRFT", JobConfigSchemaVersion="1.0.0", JobConfigDocument=json.dumps({ "AgentConfig": { "BedrockAgentCoreConfig": { "AgentRuntimeArn": "arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent" } }, "InputDataConfig": [...], "OutputDataConfig": {...}, "ModelPackageConfig": {...}, "TrainingConfig": {...} }) ) print(f"Job ARN: {response['JobArn']}")

Create job with custom Lambda agent

import json import boto3 sm = boto3.client("sagemaker") response = sm.create_job( JobName="my-custom-agent-rft-job", RoleArn="arn:aws:iam::account-id:role/SageMakerFineTuningJobRole", JobCategory="AgentRFT", JobConfigSchemaVersion="1.0.0", JobConfigDocument=json.dumps({ "AgentConfig": { "CustomAgentLambdaConfig": { "LambdaArn": "arn:aws:lambda:us-west-2:account-id:function:rft-agent-forwarder" } }, "InputDataConfig": [...], "OutputDataConfig": {...}, "ModelPackageConfig": {...}, "TrainingConfig": {...} }) ) print(f"Job ARN: {response['JobArn']}")

Monitoring Training

Monitor your Training Job

Use the DescribeJob API to check your job's current status at any time. The job status transitions through InProgress, and then to Completed, Failed or Stopped.

aws sagemaker describe-job \ --job-name "my-agent-rft-job" \ --job-category AgentRFT \ --region us-west-2

Use the SDK:

# Run without blocking job = trainer.train(wait=False) job.wait(poll=5, timeout=3000, max_log_lines=10) # Check status job.refresh() print(f"Status: {job.job_status}") print(f"Secondary Status: {job.secondary_status}") print(f"Output Model Package: {job.output_model_package_arn}") print(f"MLflow Details: {job.mlflow_details}") print(f"Billable Tokens: {job.billable_token_usage}") # Open MLflow tracking URL job.get_mlflow_url() # Stop a running job job.stop() # Attach to an existing job from a different session existing_job = MultiTurnRLTrainer.attach(job_name="my-existing-job-name") print(f"Status: {existing_job.job_status}") print(f"Output Model: {existing_job.output_model_package_arn}") # List all completed jobs from sagemaker.train.agent_rft_job import AgentRFTJob for j in AgentRFTJob.get_all(status_equals="Completed"): print(f"{j.job_name}: {j.job_status}")

Monitor Training in MLflow

SageMaker AI automatically integrates with managed MLflow to track your training job's progress, metrics, and artifacts. To enable MLflow tracking, include an MlflowConfig in your job's OutputDataConfig:

"OutputDataConfig": { "S3OutputPath": "s3://your-bucket/output/", "MlflowConfig": { "MlflowResourceArn": "arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/my-rft-mlflow-app" } }

Prerequisites

  • Create a managed MLflow App in your account. For setup instructions, see MLflow App Setup.

  • Ensure your SageMaker AI execution role has permissions to write to the MLflow App (sagemaker-mlflow:* actions).

  • Include the MlflowResourceArn in your job configuration.

What gets logged

# Category What's logged Where in MLflow UI
1 Training metrics Per-step counters, throughput, datum and token accounting, all-clock duration of each phase of a step, trajectory-reward summary rollout batch, and turn-count distribution per trajectory Metrics tab (time-series charts)
2 Trajectory traces Full multi-turn conversations with tool calls and rewards Traces tab

Detailed training metrics reference

The following metrics are logged at each training step.

Step counters and throughput (training/)

Metric Description
training/epoch Current epoch number
training/global_step Global training step counter
training/num_groups Trajectory groups in this step
training/num_trajectories Total trajectories processed in this step
training/total_tokens Tokens summed across all micro-batches in this step
training/num_datums Training datums formed from trajectories
training/datums_per_trajectory Average datums emitted per trajectory
training/action_tokens_mean Mean action (response) tokens per trajectory
training/obs_tokens_mean Mean observation (prompt) tokens per trajectory
training/trainable_token_positions Total trainable target positions in this step
training/nontrainable_token_positions Total non-trainable target positions in this step
training/trainable_token_ratio Ratio: trainable / (trainable + nontrainable) token positions

Phase durations (timing_s/)

Metric Description
timing_s/step Total time for the full step
timing_s/training Time for forward/backward passes and optimizer step
timing_s/policy_update Time saving updated weights for the sampler
timing_s/save_checkpoint Time saving a checkpoint (only on checkpoint steps)
timing_s/eval Time running evaluation (only on eval steps)

Reward distribution (rollout/reward/)

Metric Description
rollout/reward/mean Mean trajectory reward across all groups
rollout/reward/valid_mean Mean reward over only the valid (non-zero-advantage) groups; equals mean when no filtering occurred
rollout/reward/std Standard deviation of trajectory rewards
rollout/reward/min Minimum trajectory reward
rollout/reward/max Maximum trajectory reward
rollout/reward/zero_frac Fraction of trajectories with total reward exactly 0.0

Turn counts (rollout/turns/)

Metric Description
rollout/turns/mean Mean turns (transitions) per trajectory
rollout/turns/min Minimum turns across trajectories
rollout/turns/max Maximum turns across trajectories

Token lengths (rollout/tokens/)

Metric Description
rollout/tokens/prompt_mean Mean prompt token count per transition
rollout/tokens/response_mean Mean response token count per transition
rollout/tokens/response_std Standard deviation of response token counts
rollout/tokens/response_min Minimum response tokens
rollout/tokens/response_max Maximum response tokens (watch for clustering at sampling_max_tokens)

Log-probability health (rollout/logprob/)

Metric Description
rollout/logprob/zero_count Total zero-logprob tokens
rollout/logprob/zero_frac Fraction of all logprobs that are exactly 0.0
rollout/logprob/zero_per_group Average zero logprobs per trajectory group
rollout/logprob/nz_mean Mean of non-zero logprobs
rollout/logprob/nz_std Standard deviation of non-zero logprobs
rollout/logprob/nz_min Minimum non-zero logprob
rollout/logprob/nz_max Maximum non-zero logprob

Advantage distribution (rollout/advantage/)

Metric Description
rollout/advantage/mean Mean advantage value across all transitions
rollout/advantage/std Standard deviation of advantages
rollout/advantage/min Minimum advantage
rollout/advantage/max Maximum advantage
rollout/advantage/n_positive Transitions with positive advantage
rollout/advantage/n_negative Transitions with negative advantage

Batch-quality classification (analysis/)

Metric Description
analysis/batch_completion_ratio total_completed / batch_size — fraction of expected groups that arrived
analysis/batch_valid_ratio valid_count / batch_size — non-zero-advantage groups relative to full batch
analysis/zero_adv_groups Groups where all transitions have near-zero advantage
analysis/zero_adv_nonzero_reward Zero-advantage groups where at least one transition has reward not equal to 0 (all-correct case for binary rewards)
analysis/zero_adv_zero_reward Zero-advantage groups where all rewards are 0 (all-wrong case)
analysis/reward_variance_across_groups Variance of per-group mean rewards (high = diverse batch)
analysis/mean_group_reward_spread Average within-group reward spread max - min

Evaluation reward and pass@k (val/reward/)

Emitted at baseline (step 0), at every val_every interval, and at the final step. Includes the same distribution metrics as rollout/reward plus group-reward metrics aggregated by prompt.

Distribution:

Metric Description
val/reward/mean Mean reward over the eval set
val/reward/std Reward std dev
val/reward/min Minimum reward
val/reward/max Maximum reward
val/reward/zero_frac Fraction of zero-reward trajectories

Group-reward (per-prompt aggregation):

Metric Description
val/reward/min_within_groups Average per-prompt minimum reward
val/reward/mean_within_groups Average per-prompt mean reward
val/reward/max_within_groups Average per-prompt maximum reward
val/reward/std_within_groups Average per-prompt reward std (consistency)
val/reward/rollouts_per_prompt Mean rollouts (n) across prompts
val/reward/num_prompts Distinct prompts evaluated

Pass@k and success accounting:

Metric Description
val/reward/succeeded_rollouts Total rollouts with reward ≥ success_threshold
val/reward/failed_rollouts Total rollouts with reward < success_threshold
val/reward/success_threshold Threshold used (echoed for clarity)
val/reward/pass_at_{k} Probability ≥1 of k samples passes
val/reward/pass_power_{k} Probability all k samples pass (reliability)

Evaluation turn counts (val/turns/)

Metric Description
val/turns/mean Mean turns per eval trajectory
val/turns/min Minimum turns
val/turns/max Maximum turns

Evaluation token lengths (val/tokens/)

Metric Description
val/tokens/prompt_mean Mean prompt tokens per transition
val/tokens/response_mean Mean response tokens per transition
val/tokens/response_std Standard deviation of response tokens
val/tokens/response_min Minimum response tokens
val/tokens/response_max Maximum response tokens

Evaluation log-probability health (val/logprob/)

Metric Description
val/logprob/zero_count Total zero-logprob tokens
val/logprob/zero_frac Fraction of zero logprobs
val/logprob/zero_per_group Zero logprobs per group
val/logprob/nz_mean Mean of non-zero logprobs
val/logprob/nz_std Standard deviation of non-zero logprobs
val/logprob/nz_min Minimum non-zero logprob
val/logprob/nz_max Maximum non-zero logprob

Accessing the MLflow UI

Access the MLflow UI via a presigned URL:

aws sagemaker create-presigned-mlflow-app-url \ --arn arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/mlflow-app-id \ --region us-west-2

Copy the AuthorizedUrl from the output into your browser.

Agent trajectories and traces

During training, SageMaker AI records every interaction between your agent and the policy model as a trajectory — the complete record of one rollout. Each trajectory captures every prompt sent to the model, every response generated, every tool call made, and the final reward. Trajectories are published to your MLflow experiment as structured traces.

Trace contents

  • The input prompt from your training dataset

  • Every model inference turn (prompt, response, and token-level data)

  • Tool calls and their results, if your agent uses tools

  • The final reward score

  • Timing information for each turn

Viewing trajectories in the MLflow UI

Access the MLflow UI via a presigned URL:

aws sagemaker create-presigned-mlflow-app-url \ --arn arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/mlflow-app-id \ --region us-west-2

Copy the AuthorizedUrl from the output into your browser.

Open the MLflow UI using the presigned URL above. Navigate to your experiment's run and select the Traces tab. Each trace represents one completed rollout and shows:

  • The system prompt and user prompt

  • Each assistant response (with thinking/reasoning if applicable)

  • Tool use spans showing which tools were called and their outputs

  • The reward score assigned to the trajectory

Use trajectories to debug low reward scores

Symptom What to look for
Low reward across most rollouts Are model responses coherent? Is the prompt format correct?
Tool-related failures Are tool calls succeeding? Are inputs and outputs well-formed?
Agent looping Is the agent repeating the same actions without making progress?
Truncated responses Are responses being cut off by the maxTokens limit?

Get Training Results

When a training job completes, your trained model weights are stored as a SageMaker AI Model Package. This section explains how to find your results, understand the checkpoint types produced during training, and use them for deployment or continued training.

How results are stored

SageMaker AI stores training output as versioned, immutable Model Packages inside Model Package Groups. Multi-turn RL uses two separate groups, which you specify when creating a job:

Group Purpose Contents
Output Model Package Group Final trained model HuggingFace-compatible LoRA adapter weights (adapter_config.json + adapter_model.safetensors)
Intermediate Checkpoint Model Package Group Resumable training state LoRA adapter weights + optimizer states + training step metadata

Configure both groups in your ModelPackageConfig:

"ModelPackageConfig": { "OutputModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-final-models", "IntermediateCheckpointModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-intermediate-checkpoints" }

Checkpoint types

Training produces two types of checkpoints, saved at every training step:

Model checkpoint (weights only)

  • Stored in the Output Model Package Group

  • Contains HuggingFace-compatible LoRA adapter weights in SafeTensors format

  • Use for inference, deployment, or as the starting point for a new training job

  • Created at every step, at job completion, and when a job is stopped

Resumable checkpoint (full state)

  • Stored in the Intermediate Checkpoint Model Package Group

  • Contains LoRA adapter weights, optimizer states, and per-GPU training step metadata

  • Use to resume an interrupted job from the exact step it stopped

  • Internal format — not directly usable for inference

Checkpoint lifecycle

Step 1 → Intermediate Checkpoint (resumable) Step 1 → Intermediate Checkpoint (HF-compatible) ... Step N-1 → Intermediate Checkpoint (resumable) Step N-1 → Intermediate Checkpoint (HF-compatible) ... Step N (final) → Model Checkpoint (HuggingFace LoRA) → Output Model Package Group

Retrieve your trained model

When a job completes successfully, the final model is saved as a Model Package in the Output Model Package Group. The OutputModelPackageArn field on the job record contains the ARN.

Check job completion and retrieve the output model ARN:

aws sagemaker describe-job \ --job-name "my-agent-rft-job" \ --job-category AgentRFT \ --region us-west-2

Look for OutputModelPackageArn in the response. Use it to describe the Model Package and get the S3 location of the weights:

aws sagemaker describe-model-package \ --model-package-name "arn:aws:sagemaker:us-west-2:123456789012:model-package/my-final-models/5"

If a job fails or is stopped before completion, the last intermediate checkpoint is promoted to the Output Model Package Group on a best-effort basis. Check OutputModelPackageArn the same way.

To monitor checkpoint creation during training, watch the ResumableCheckpoint and ModelCheckpoint fields in DescribeJob output.

Resume an interrupted job

If a job fails or is stopped mid-training, you can start a new job that picks up from the exact step where it left off. The platform restores the full training state — weights, optimizer momentum, and step counter — from the resumable checkpoint.

Specify a resumable checkpoint from the Intermediate Checkpoint Model Package Group as InputModelPackageArn:

"ModelPackageConfig": { "OutputModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-final-models", "IntermediateCheckpointModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-intermediate-checkpoints", "InputModelPackageArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package/my-intermediate-checkpoints/5" }

The InputModelPackageArn must point to a resumable checkpoint (one with IsCheckpoint=true in its Model Package metadata). Training resumes from the step after the checkpoint — for example, if the checkpoint was saved at step 4, training continues from step 5.

The following must stay the same between the original job and the resumed job:

  • Base model

  • LoRA configuration (rank and alpha)

  • Hyperparameters (learning rate, batch size, etc.)

  • Dataset

Continue training on a new job (iterative training)

Iterative training lets you build on a previously trained model with a different dataset, different hyperparameters, or a refined reward function. Unlike resuming, this starts a fresh training run — the optimizer resets, the step counter resets to 0, and only the trained LoRA weights carry over.

Specify a model checkpoint from the Output Model Package Group as InputModelPackageArn:

"ModelPackageConfig": { "OutputModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-final-models", "IntermediateCheckpointModelPackageGroupArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package-group/my-intermediate-checkpoints", "InputModelPackageArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package/my-final-models/3" }

What you can change between iterations:

  • Hyperparameters (learning rate, batch size, max_steps, group_size, etc.)

  • Dataset (different prompts or data distribution)

  • Reward function

  • Agent configuration

What must stay the same:

  • Base model — the LoRA adapter is tied to the base model architecture

Common patterns for iterative training:

  • Curriculum learning — train on easier problems first, then continue on harder ones

  • Reward refinement — start with a simple reward function, then iterate with a more nuanced one

  • Hyperparameter adjustment — increase batch size or tune learning rate after observing initial training dynamics

Checkpoint best practices

  • Monitor checkpoint creation. Use DescribeJob to track ResumableCheckpoint and ModelCheckpoint fields during training so you know what's available if you need to resume.

  • Plan for failures on long jobs. If a job has many steps, design your workflow to resume from checkpoints rather than restart from scratch.