# RFT on Nova 2.0


RFT training data follows the OpenAI conversational format. Each training example is a JSON object containing messages, reference answers, and optional tool definitions. This section provides guidance on preparing effective training data for RFT on Nova 2.0.

**Topics**
+ [

## Data format and structure
](#nova-hp-rft-data-format)
+ [

## Field descriptions
](#nova-hp-rft-field-descriptions)
+ [

## Hyperparameter guidance
](#nova-hp-rft-monitoring-hyperparams)
+ [

## Additional properties
](#nova-hp-rft-additional-properties)
+ [

## Dataset size recommendations
](#nova-hp-rft-dataset-size)
+ [

## Characteristics of effective training data
](#nova-hp-rft-effective-data)
+ [

# Monitoring RFT training
](nova-hp-rft-monitoring.md)

## Data format and structure


Each training example is a JSON object containing the following:
+ **messages**: An array of conversational turns using system, user, and optionally assistant roles
+ **reference\$1answer**: Expected output or evaluation criteria for reward calculation
+ **tools** (optional): Array of function definitions available to the model
+ **id** (optional): Unique identifier for tracking and deduplication

Each example should be on a single line in your JSONL file, with one JSON object per line.

### Example 1: Chemistry problem


The following example shows a chemistry problem with reference answer containing ground truth values:

```
{  
  "id": "chem-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful chemistry assistant"  
    },  
    {  
      "role": "user",  
      "content": "Predict hydrogen bond donors and acceptors for this SMILES: CCN(CC)CCC(=O)c1sc(N)nc1C"  
    }  
  ],  
  "reference_answer": {  
    "donor_bond_counts": 2,  
    "acceptor_bond_counts": 4,  
    "explanation": "Calculated using Lipinski's rule of five: N-H groups (2 donors), N and O atoms with lone pairs (4 acceptors)"  
  }  
}
```

**Note**  
The reference\$1answer contains ground truth values calculated using domain-specific rules. Your reward function compares the model's predicted values against these reference values to calculate a reward score.

### Example 2: Math problem


The following example shows a math problem with solution steps:

```
{  
  "id": "math-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  }  
}
```

### Example 3: Tool usage


The following example shows tool usage with expected behavior:

```
{  
  "id": "tool-001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a helpful game master assistant"  
    },  
    {  
      "role": "user",  
      "content": "Generate a strength stat for a warrior character. Apply a +2 racial bonus modifier."  
    }  
  ],  
  "tools": [  
    {  
      "type": "function",  
      "function": {  
        "name": "StatRollAPI",  
        "description": "Generates character stats by rolling 4d6, dropping the lowest die result, and applying a modifier.",  
        "parameters": {  
          "type": "object",  
          "properties": {  
            "modifier": {  
              "description": "An integer representing the modifier to apply to the total of the stat roll.",  
              "type": "integer"  
            }  
          },  
          "required": ["modifier"]  
        }  
      }  
    }  
  ],  
  "reference_answer": {  
    "tool_called": "StatRollAPI",  
    "tool_parameters": {  
      "modifier": 2  
    },  
    "expected_behavior": "Call StatRollAPI with modifier=2 and return the calculated stat value"  
  }  
}
```

## Field descriptions


| Field | Description | Additional notes | Required | 
| --- |--- |--- |--- |
| id | Unique identifier for this RFT example | String (for example, "sample-001"). Useful for tracking and deduplication. | No | 
| messages | Ordered list of chat messages that define the prompt and context | Array of objects. Model sees them in order. Typically starts with a system message, then user. | Yes | 
| messages[].role | Who is speaking in the message | Common values: "system", "user" (sometimes "assistant" in other contexts) | No | 
| messages[].content | The text content of the message | Plain string. For system it's instructions, for user it's the task or input. | No | 
| tools | Tool specifications available to the model during this example | Array. Each item defines a tool's interface and metadata. Types may include "function" or "internal". | No | 
| reference\$1answer | The expected model output for this example | String or object depending on task. Used as target for evaluation or training. | No | 

**Note**  
Any additional custom fields (for example, task\$1id, difficulty\$1level, context\$1data) are not validated and will be passed to your reward function as metadata.

## Hyperparameter guidance


Use the following recommended hyperparameters based on your training approach:

**General:**
+ Epochs: 1
+ Learning rate (lr): 1e-7
+ Number of generations: 8
+ Max new tokens: 8192
+ Batch size: 256

**LoRA (Low-Rank Adaptation):**
+ LoRA Rank: 32

**Note**  
Adjust these values based on your dataset size and validation performance. Monitor training metrics to prevent overfitting.

## Additional properties


The "additionalProperties": true setting allows you to include custom fields beyond the core schema requirements, providing flexibility to add any data your reward function needs for proper evaluation.

### Common additional fields


You can include the following types of additional fields:

**Metadata:**
+ task\$1id: Unique identifier for tracking
+ difficulty\$1level: Problem complexity indicator
+ domain: Subject area or category
+ expected\$1reasoning\$1steps: Number of steps in solution

**Evaluation criteria:**
+ evaluation\$1criteria: Specific grading rubrics
+ custom\$1scoring\$1weights: Relative importance of different aspects
+ context\$1data: Background information for the problem
+ external\$1references: Links to relevant documentation or resources

### Example with additional properties


The following example includes custom metadata fields:

```
{  
  "id": "algebra_001",  
  "messages": [  
    {  
      "role": "system",  
      "content": "You are a math tutor"  
    },  
    {  
      "role": "user",  
      "content": "Solve: 2x + 5 = 13"  
    }  
  ],  
  "reference_answer": {  
    "solution": "x = 4",  
    "steps": ["2x = 13 - 5", "2x = 8", "x = 4"]  
  },  
  "task_id": "algebra_001",  
  "difficulty_level": "easy",  
  "domain": "algebra",  
  "expected_reasoning_steps": 3  
}
```

## Dataset size recommendations


### Starting point


Begin with the following minimum dataset sizes:
+ Minimum 100 training examples
+ Minimum 100 evaluation examples

Prioritize high-quality input data and a reliable reward function that executes consistently on model responses.

### Evaluation-first approach


Before investing in large-scale RFT training, evaluate your model's baseline performance:
+ **High performance (greater than 95% reward)**: RFT may be unnecessary—your model already performs well
+ **Very poor performance (0% reward)**: Switch to SFT first to establish basic capabilities
+ **Moderate performance**: RFT is likely appropriate

This evaluation-first approach ensures your reward function is bug-free and determines if RFT is the right method for your use case. Starting small allows you to get comfortable with the RFT workflow, identify and fix issues early, validate your approach before scaling up, and test reward function reliability. Once validated, you can expand to larger datasets to further improve performance.

## Characteristics of effective training data


### Clarity and consistency


Good RFT examples require clear, unambiguous input data that enables accurate reward calculation across different model outputs. Avoid noise in your data, including:
+ Inconsistent formatting
+ Contradictory labels or instructions
+ Ambiguous prompts
+ Conflicting reference answers

Any ambiguity will mislead the training process and cause the model to learn unintended behaviors.

### Diversity


Your dataset should capture the full diversity of production use cases to ensure robust real-world performance. Include:
+ Various problem types and difficulty levels
+ Different input formats and edge cases
+ Representative samples from all expected scenarios

This diversity helps prevent overfitting and ensures the model handles unfamiliar inputs gracefully.

### Reward function considerations


Design your reward function for efficient training:
+ Execute within seconds (not minutes)
+ Parallelize effectively with Lambda
+ Return consistent, reliable scores
+ Handle different types of model outputs gracefully

Fast, scalable reward functions enable rapid iteration and cost-effective experimentation at scale.

# Monitoring RFT training


Monitor key metrics during training to ensure effective learning and identify potential issues early.

**Topics**
+ [

## Key metrics to track
](#nova-hp-rft-monitoring-metrics)
+ [

## Evaluation after RFT
](#nova-hp-rft-monitoring-evaluation)
+ [

## Using fine-tuned models
](#nova-hp-rft-monitoring-checkpoints)
+ [

## Limitations and best practices
](#nova-hp-rft-monitoring-limitations)
+ [

## Troubleshooting
](#nova-hp-rft-monitoring-troubleshooting)

## Key metrics to track


Monitor the following metrics using MlFlow during training:

**Reward metrics:**
+ **Average reward score**: Overall quality of model responses (should increase over time)
+ **Reward distribution**: Percentage of responses receiving high, medium, and low rewards
+ **Training vs. validation rewards**: Compare to detect overfitting

**Training metrics:**
+ **Policy updates**: Number of successful weight updates
+ **Rollout completion rate**: Percentage of samples successfully evaluated

**Concerning patterns:**
+ Rewards plateauing (indicates poor learning)
+ Validation rewards dropping while training rewards increase (overfitting)
+ Reward variance increasing significantly over time (instability)
+ High percentage of reward function errors (implementation issues)

**When to stop training:**
+ Target performance metrics are achieved
+ Rewards plateau and no longer improve
+ Validation performance degrades (overfitting detected)
+ Maximum training budget is reached

## Evaluation after RFT


After training completes, evaluate your fine-tuned model to assess performance improvements:
+ **Run RFT evaluation job**: Use the checkpoint from your RFT training as the model
+ **Compare to baseline**: Evaluate both base model and fine-tuned model on the same test set
+ **Analyze metrics**: Review task-specific metrics (accuracy, reward scores, etc.)
+ **Conduct qualitative review**: Manually inspect sample outputs for quality

For detailed evaluation procedures, see the Evaluation section.

## Using fine-tuned models


**Accessing checkpoints:**

After training completes, locate your checkpoint:

1. Navigate to your `output_path` in S3

1. Download and extract `output.tar.gz`

1. Open `manifest.json`

1. Copy the `checkpoint_s3_bucket` value

**Deploying for inference:**

Use the checkpoint S3 path for inference or further training:

```
run:
    model_type: amazon.nova-2-lite-v1:0:256k
    model_name_or_path: "s3://customer-escrow-<account-number>-smtj-<unique-identifier>/<job-name>"
```

For deployment and inference instructions, refer to the Inference section.

## Limitations and best practices


**Current limitations:**

**Beta restrictions:**
+ Need to create a new RIG group for RFT. This limitation will be resolved by GA.
+ Instance type requirements: Only P5 instances supported (minimum 8x P5.48xlarge). Coming Soon: Support for smaller instance types (ETA: mid-January 2025).

**Functional limitations:**
+ 15-minute Lambda timeout: Reward functions must complete within 15 minutes
+ Single-turn only: Multi-turn conversations not supported
+ Validation datasets: Not supported during training. Use separate evaluation jobs to assess training progress.

**Training considerations:**
+ Low reward scenarios: May struggle when less than 5% of examples receive positive rewards - consider SFT first
+ Data requirements: Needs sufficient diversity to learn effectively
+ Computational cost: More expensive than supervised fine-tuning

**Nova Forge removes some of these limitations:**
+ Supports multi-turn conversations
+ Allows reward functions exceeding 15-minute timeouts
+ Provides advanced algorithms and tuning options
+ Designed for complex enterprise use cases, specifically tuned to build frontier models

**Best practices:**

**Start small and scale:**
+ Begin with minimal datasets (100-200 examples) and few training epochs
+ Validate your approach before scaling up
+ Gradually increase dataset size and training steps based on results

**Baseline with SFT first:**
+ If reward scores are consistently low (e.g., always 0), perform SFT before RFT
+ RFT requires reasonable baseline performance to improve effectively

**Design efficient reward functions:**
+ Execute in seconds, not minutes
+ Minimize external API calls
+ Use efficient algorithms and data structures
+ Implement proper error handling
+ Test thoroughly before training
+ Leverage Lambda's parallel scaling capabilities

**Monitor training actively:**
+ Track average reward scores over time
+ Watch reward distribution across samples
+ Compare training vs. validation rewards
+ Look for concerning patterns (plateaus, overfitting, instability)

**Iterate based on results:**
+ If rewards don't improve after several iterations, adjust reward function design
+ Increase dataset diversity to provide clearer learning signals
+ Consider switching to SFT if rewards remain near zero
+ Experiment with different hyperparameters (learning rate, batch size)

**Optimize data quality:**
+ Ensure diverse, representative examples
+ Include edge cases and difficult samples
+ Verify reward function correctly scores all example types
+ Remove or fix samples that confuse the reward function

## Troubleshooting


**Reward function errors:**

Symptoms: High error rate in reward function calls during training


| Issue | Symptoms | Resolution | 
| --- |--- |--- |
| Lambda timeout | Frequent timeouts after 15 minutes | Optimize function performance; consider Nova Forge for complex evaluations | 
| Insufficient concurrency | Lambda throttling errors | Increase lambda\$1concurrency\$1limit or request quota increase | 
| Invalid return format | Training fails with format errors | Verify return structure matches required interface format | 
| Unhandled exceptions | Intermittent errors | Add comprehensive error handling and logging | 
| External API failures | Inconsistent scoring | Implement retry logic and fallback strategies | 

**Poor training performance:**

Symptoms: Rewards not improving or plateauing at low values

Resolutions:
+ **Verify reward function correctness**: Test with known good/bad examples
+ **Check baseline performance**: Evaluate base model; if near-zero accuracy, do SFT first
+ **Increase data diversity**: Add more varied examples covering different scenarios
+ **Adjust hyperparameters**: Try different learning rates or batch sizes
+ **Review reward signal quality**: Ensure rewards differentiate between good and bad responses

**Overfitting:**

Symptoms: Training rewards increase while validation rewards decrease

Resolutions:
+ **Reduce training steps**: Stop training earlier
+ **Increase dataset size**: Add more training examples
+ **Add regularization**: Adjust `weight_decay` or `entropy_coeff`
+ **Increase data diversity**: Ensure training set represents full distribution