

# Model quality
<a name="model-monitor-model-quality"></a>

Model quality monitoring jobs monitor the performance of a model by comparing the predictions that the model makes with the actual Ground Truth labels that the model attempts to predict. To do this, model quality monitoring merges data that is captured from real-time or batch inference with actual labels that you store in an Amazon S3 bucket, and then compares the predictions with the actual labels.

To measure model quality, model monitor uses metrics that depend on the ML problem type. For example, if your model is for a regression problem, one of the metrics evaluated is mean square error (mse). For information about all of the metrics used for the different ML problem types, see [Model quality metrics and Amazon CloudWatch monitoring](model-monitor-model-quality-metrics.md). 

Model quality monitoring follows the same steps as data quality monitoring, but adds the additional step of merging the actual labels from Amazon S3 with the predictions captured from the real-time inference endpoint or batch transform job. To monitor model quality, follow these steps:
+ Enable data capture. This captures inference input and output from a real-time inference endpoint or batch transform job and stores the data in Amazon S3. For more information, see [Data capture](model-monitor-data-capture.md).
+ Create a baseline. In this step, you run a baseline job that compares predictions from the model with Ground Truth labels in a baseline dataset. The baseline job automatically creates baseline statistical rules and constraints that define thresholds against which the model performance is evaluated. For more information, see [Create a model quality baseline](model-monitor-model-quality-baseline.md).
+ Define and schedule model quality monitoring jobs. For specific information and code samples of model quality monitoring jobs, see [Schedule model quality monitoring jobs](model-monitor-model-quality-schedule.md). For general information about monitoring jobs, see [Schedule monitoring jobs](model-monitor-scheduling.md).
+ Ingest Ground Truth labels that model monitor merges with captured prediction data from a real-time inference endpoint or batch transform job. For more information, see [Ingest Ground Truth labels and merge them with predictions](model-monitor-model-quality-merge.md).
+ Integrate model quality monitoring with Amazon CloudWatch. For more information, see [Monitoring model quality metrics with CloudWatch](model-monitor-model-quality-metrics.md#model-monitor-model-quality-cw).
+ Interpret the results of a monitoring job. For more information, see [Interpret results](model-monitor-interpreting-results.md).
+ Use SageMaker Studio to enable model quality monitoring and visualize results. For more information, see [Visualize results for real-time endpoints in Amazon SageMaker Studio](model-monitor-interpreting-visualize-results.md).

**Topics**
+ [Create a model quality baseline](model-monitor-model-quality-baseline.md)
+ [Schedule model quality monitoring jobs](model-monitor-model-quality-schedule.md)
+ [Ingest Ground Truth labels and merge them with predictions](model-monitor-model-quality-merge.md)
+ [Model quality metrics and Amazon CloudWatch monitoring](model-monitor-model-quality-metrics.md)

# Create a model quality baseline
<a name="model-monitor-model-quality-baseline"></a>

Create a baseline job that compares your model predictions with ground truth labels in a baseline dataset that you have stored in Amazon S3. Typically, you use a training dataset as the baseline dataset. The baseline job calculates metrics for the model and suggests constraints to use to monitor model quality drift.

To create a baseline job, you need to have a dataset that contains predictions from your model along with labels that represent the Ground Truth for your data.

To create a baseline job use the `ModelQualityMonitor` class provided by the SageMaker Python SDK, and complete the following steps.

**To create a model quality baseline job**

1.  First, create an instance of the `ModelQualityMonitor` class. The following code snippet shows how to do this.

   ```
   from sagemaker import get_execution_role, session, Session
   from sagemaker.model_monitor import ModelQualityMonitor
                   
   role = get_execution_role()
   session = Session()
   
   model_quality_monitor = ModelQualityMonitor(
       role=role,
       instance_count=1,
       instance_type='ml.m5.xlarge',
       volume_size_in_gb=20,
       max_runtime_in_seconds=1800,
       sagemaker_session=session
   )
   ```

1. Now call the `suggest_baseline` method of the `ModelQualityMonitor` object to run a baseline job. The following code snippet assumes that you have a baseline dataset that contains both predictions and labels stored in Amazon S3.

   ```
   baseline_job_name = "MyBaseLineJob"
   job = model_quality_monitor.suggest_baseline(
       job_name=baseline_job_name,
       baseline_dataset=baseline_dataset_uri, # The S3 location of the validation dataset.
       dataset_format=DatasetFormat.csv(header=True),
       output_s3_uri = baseline_results_uri, # The S3 location to store the results.
       problem_type='BinaryClassification',
       inference_attribute= "prediction", # The column in the dataset that contains predictions.
       probability_attribute= "probability", # The column in the dataset that contains probabilities.
       ground_truth_attribute= "label" # The column in the dataset that contains ground truth labels.
   )
   job.wait(logs=False)
   ```

1. After the baseline job finishes, you can see the constraints that the job generated. First, get the results of the baseline job by calling the `latest_baselining_job` method of the `ModelQualityMonitor` object.

   ```
   baseline_job = model_quality_monitor.latest_baselining_job
   ```

1. The baseline job suggests constraints, which are thresholds for metrics that model monitor measures. If a metric goes beyond the suggested threshold, Model Monitor reports a violation. To view the constraints that the baseline job generated, call the `suggested_constraints` method of the baseline job. The following code snippet loads the constraints for a binary classification model into a Pandas dataframe.

   ```
   import pandas as pd
   pd.DataFrame(baseline_job.suggested_constraints().body_dict["binary_classification_constraints"]).T
   ```

   We recommend that you view the generated constraints and modify them as necessary before using them for monitoring. For example, if a constraint is too aggressive, you might get more alerts for violations than you want.

   If your constraint contains numbers expressed in scientific notation, you will need to convert them to float. The following python [preprocessing script](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html#model-monitor-pre-processing-script) example shows how to convert numbers in scientific notation to float. 

   ```
   import csv
   
   def fix_scientific_notation(col):
       try:
           return format(float(col), "f")
       except:
           return col
   
   def preprocess_handler(csv_line):
       reader = csv.reader([csv_line])
       csv_record = next(reader)
       #skip baseline header, change HEADER_NAME to the first column's name
       if csv_record[0] == “HEADER_NAME”:
          return []
       return { str(i).zfill(20) : fix_scientific_notation(d) for i, d in enumerate(csv_record)}
   ```

   You can add your pre-processing script to a baseline or monitoring schedule as a `record_preprocessor_script`, as defined in the [Model Monitor](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html) documentation.

1. When you are satisfied with the constraints, pass them as the `constraints` parameter when you create a monitoring schedule. For more information, see [Schedule model quality monitoring jobs](model-monitor-model-quality-schedule.md).

The suggested baseline constraints are contained in the constraints.json file in the location you specify with `output_s3_uri`. For information about the schema for this file in the [Schema for Constraints (constraints.json file)](model-monitor-byoc-constraints.md).

# Schedule model quality monitoring jobs
<a name="model-monitor-model-quality-schedule"></a>

After you create your baseline, you can call the `create_monitoring_schedule()` method of your `ModelQualityMonitor` class instance to schedule an hourly model quality monitor. The following sections show you how to create a model quality monitor for a model deployed to a real-time endpoint as well as for a batch transform job.

**Important**  
You can specify either a batch transform input or an endpoint input, but not both, when you create your monitoring schedule.

Unlike data quality monitoring, you need to supply Ground Truth labels if you want to monitor model quality. However, Ground Truth labels could be delayed. To address this, specify offsets when you create your monitoring schedule. 

## Model monitor offsets
<a name="model-monitor-model-quality-schedule-offsets"></a>

Model quality jobs include `StartTimeOffset` and `EndTimeOffset`, which are fields of the `ModelQualityJobInput` parameter of the `create_model_quality_job_definition` method that work as follows:
+ `StartTimeOffset` - If specified, jobs subtract this time from the start time.
+ `EndTimeOffset` - If specified, jobs subtract this time from the end time.

The format of the offsets are, for example, -PT7H, where 7H is 7 hours. You can use -PT\$1H or -P\$1D, where H=hours, D=days, and M=minutes, and \$1 is the number. In addition, the offset should be in [ISO 8601 duration format](https://en.wikipedia.org/wiki/ISO_8601#Durations).

For example, if your Ground Truth starts coming in after 1 day, but is not complete for a week, set `StartTimeOffset` to `-P8D` and `EndTimeOffset` to `-P1D`. Then, if you schedule a job to run at `2020-01-09T13:00`, it analyzes data from between `2020-01-01T13:00` and `2020-01-08T13:00`.

**Important**  
The schedule cadence should be such that one execution finishes before the next execution starts, which allows the Ground Truth merge job and monitoring job from the execution to complete. The maximum runtime of an execution is divided between the two jobs, so for an hourly model quality monitoring job, the value of `MaxRuntimeInSeconds` specified as part of `StoppingCondition` should be no more than 1800.

## Model quality monitoring for models deployed to real-time endpoints
<a name="model-monitor-data-quality-schedule-rt"></a>

To schedule a model quality monitor for a real-time endpoint, pass your `EndpointInput` instance to the `endpoint_input` argument of your `ModelQualityMonitor` instance, as shown in the following code sample:

```
from sagemaker.model_monitor import CronExpressionGenerator
                    
model_quality_model_monitor = ModelQualityMonitor(
   role=sagemaker.get_execution_role(),
   ...
)

schedule = model_quality_model_monitor.create_monitoring_schedule(
   monitor_schedule_name=schedule_name,
   post_analytics_processor_script=s3_code_postprocessor_uri,
   output_s3_uri=s3_report_path,
   schedule_cron_expression=CronExpressionGenerator.hourly(),    
   statistics=model_quality_model_monitor.baseline_statistics(),
   constraints=model_quality_model_monitor.suggested_constraints(),
   schedule_cron_expression=CronExpressionGenerator.hourly(),
   enable_cloudwatch_metrics=True,
   endpoint_input=EndpointInput(
        endpoint_name=endpoint_name,
        destination="/opt/ml/processing/input/endpoint",
        start_time_offset="-PT2D",
        end_time_offset="-PT1D",
    )
)
```

## Model quality monitoring for batch transform jobs
<a name="model-monitor-data-quality-schedule-tt"></a>

To schedule a model quality monitor for a batch transform job, pass your `BatchTransformInput` instance to the `batch_transform_input` argument of your `ModelQualityMonitor` instance, as shown in the following code sample:

```
from sagemaker.model_monitor import CronExpressionGenerator

model_quality_model_monitor = ModelQualityMonitor(
   role=sagemaker.get_execution_role(),
   ...
)

schedule = model_quality_model_monitor.create_monitoring_schedule(
    monitor_schedule_name=mon_schedule_name,
    batch_transform_input=BatchTransformInput(
        data_captured_destination_s3_uri=s3_capture_upload_path,
        destination="/opt/ml/processing/input",
        dataset_format=MonitoringDatasetFormat.csv(header=False),
        # the column index of the output representing the inference probablity
        probability_attribute="0",
        # the threshold to classify the inference probablity to class 0 or 1 in 
        # binary classification problem
        probability_threshold_attribute=0.5,
        # look back 6 hour for transform job outputs.
        start_time_offset="-PT6H",
        end_time_offset="-PT0H"
    ),
    ground_truth_input=gt_s3_uri,
    output_s3_uri=s3_report_path,
    problem_type="BinaryClassification",
    constraints = constraints_path,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,
)
```

# Ingest Ground Truth labels and merge them with predictions
<a name="model-monitor-model-quality-merge"></a>

Model quality monitoring compares the predictions your model makes with ground truth labels to measure the quality of the model. For this to work, you periodically label data captured by your endpoint or batch transform job and upload it to Amazon S3.

To match Ground Truth labels with captured prediction data, there must be a unique identifier for each record in the dataset. The structure of each record for ground truth data is as follows:

```
{
  "groundTruthData": {
    "data": "1",
    "encoding": "CSV"
  },
  "eventMetadata": {
    "eventId": "aaaa-bbbb-cccc"
  },
  "eventVersion": "0"
}
```

In the `groundTruthData` structure, `eventId` can be one of the following:
+ `eventId` – This ID is automatically generated when a user invokes the endpoint.
+ `inferenceId` – The caller supplies this ID when they invoke the endpoint.

If `inferenceId` is present in captured data records, Model Monitor uses it to merge captured data with Ground Truth records. You are responsible for making sure that the `inferenceId` in the Ground Truth records match the `inferenceId` in the captured records. If `inferenceId` is not present in captured data, model monitor uses `eventId` from the captured data records to match them with a Ground Truth record.

You must upload Ground Truth data to an Amazon S3 bucket that has the same path format as captured data. 

**Data format requirements**  
When you save your data to Amazon S3 it must use the jsonlines format (.jsonl), and be saved using the following naming structure. To learn more about jsonline requirements, see [Use input and output data](sms-data.md). 

```
s3://amzn-s3-demo-bucket1/prefix/yyyy/mm/dd/hh
```

The date in this path is the date when the Ground Truth label is collected, and does not have to match the date when the inference was generated.

After you create and upload the Ground Truth labels, include the location of the labels as a parameter when you create the monitoring job. If you are using AWS SDK for Python (Boto3), do this by specifying the location of Ground Truth labels as the `S3Uri` field of the `GroundTruthS3Input` parameter in a call to the `create_model_quality_job_definition` method. If you are using the SageMaker Python SDK, specify the location of the Ground Truth labels as the `ground_truth_input` parameter in the call to the `create_monitoring_schedule` of the `ModelQualityMonitor` object.

# Model quality metrics and Amazon CloudWatch monitoring
<a name="model-monitor-model-quality-metrics"></a>

Model quality monitoring jobs compute different metrics to evaluate the quality and performance of your machine learning models. The specific metrics calculated depend on the type of ML problem: regression, binary classification, or multiclass classification. Monitoring these metrics is crucial for detecting model drift over time. The following sections cover the key model quality metrics for each problem type, as well as how to set up automated monitoring and alerting using CloudWatch to continuously track your model's performance.

**Note**  
Standard deviation for metrics are provided only when at least 200 samples are available. Model Monitor computes standard deviation by randomly sampling 80% of the data five times, computing the metric, and taking the standard deviation for those results.

## Regression metrics
<a name="model-monitor-model-quality-metrics-regression"></a>

The following shows an example of the metrics that model quality monitor computes for a regression problem.

```
"regression_metrics" : {
    "mae" : {
      "value" : 0.3711832061068702,
      "standard_deviation" : 0.0037566388129940394
    },
    "mse" : {
      "value" : 0.3711832061068702,
      "standard_deviation" : 0.0037566388129940524
    },
    "rmse" : {
      "value" : 0.609248066149471,
      "standard_deviation" : 0.003079253267651125
    },
    "r2" : {
      "value" : -1.3766111872212665,
      "standard_deviation" : 0.022653980022771227
    }
  }
```

## Binary classification metrics
<a name="model-monitor-model-quality-metrics-binary"></a>

The following shows an example of the metrics that model quality monitor computes for a binary classification problem.

```
"binary_classification_metrics" : {
    "confusion_matrix" : {
      "0" : {
        "0" : 1,
        "1" : 2
      },
      "1" : {
        "0" : 0,
        "1" : 1
      }
    },
    "recall" : {
      "value" : 1.0,
      "standard_deviation" : "NaN"
    },
    "precision" : {
      "value" : 0.3333333333333333,
      "standard_deviation" : "NaN"
    },
    "accuracy" : {
      "value" : 0.5,
      "standard_deviation" : "NaN"
    },
    "recall_best_constant_classifier" : {
      "value" : 1.0,
      "standard_deviation" : "NaN"
    },
    "precision_best_constant_classifier" : {
      "value" : 0.25,
      "standard_deviation" : "NaN"
    },
    "accuracy_best_constant_classifier" : {
      "value" : 0.25,
      "standard_deviation" : "NaN"
    },
    "true_positive_rate" : {
      "value" : 1.0,
      "standard_deviation" : "NaN"
    },
    "true_negative_rate" : {
      "value" : 0.33333333333333337,
      "standard_deviation" : "NaN"
    },
    "false_positive_rate" : {
      "value" : 0.6666666666666666,
      "standard_deviation" : "NaN"
    },
    "false_negative_rate" : {
      "value" : 0.0,
      "standard_deviation" : "NaN"
    },
    "receiver_operating_characteristic_curve" : {
      "false_positive_rates" : [ 0.0, 0.0, 0.0, 0.0, 0.0, 1.0 ],
      "true_positive_rates" : [ 0.0, 0.25, 0.5, 0.75, 1.0, 1.0 ]
    },
    "precision_recall_curve" : {
      "precisions" : [ 1.0, 1.0, 1.0, 1.0, 1.0 ],
      "recalls" : [ 0.0, 0.25, 0.5, 0.75, 1.0 ]
    },
    "auc" : {
      "value" : 1.0,
      "standard_deviation" : "NaN"
    },
    "f0_5" : {
      "value" : 0.3846153846153846,
      "standard_deviation" : "NaN"
    },
    "f1" : {
      "value" : 0.5,
      "standard_deviation" : "NaN"
    },
    "f2" : {
      "value" : 0.7142857142857143,
      "standard_deviation" : "NaN"
    },
    "f0_5_best_constant_classifier" : {
      "value" : 0.29411764705882354,
      "standard_deviation" : "NaN"
    },
    "f1_best_constant_classifier" : {
      "value" : 0.4,
      "standard_deviation" : "NaN"
    },
    "f2_best_constant_classifier" : {
      "value" : 0.625,
      "standard_deviation" : "NaN"
    }
  }
```

## Multiclass metrics
<a name="model-monitor-model-quality-metrics-multi"></a>

The following shows an example of the metrics that model quality monitor computes for a multiclass classification problem.

```
"multiclass_classification_metrics" : {
    "confusion_matrix" : {
      "0" : {
        "0" : 1180,
        "1" : 510
      },
      "1" : {
        "0" : 268,
        "1" : 138
      }
    },
    "accuracy" : {
      "value" : 0.6288167938931297,
      "standard_deviation" : 0.00375663881299405
    },
    "weighted_recall" : {
      "value" : 0.6288167938931297,
      "standard_deviation" : 0.003756638812994008
    },
    "weighted_precision" : {
      "value" : 0.6983172269629505,
      "standard_deviation" : 0.006195912915307507
    },
    "weighted_f0_5" : {
      "value" : 0.6803947317178771,
      "standard_deviation" : 0.005328406973561699
    },
    "weighted_f1" : {
      "value" : 0.6571162346664904,
      "standard_deviation" : 0.004385008075019733
    },
    "weighted_f2" : {
      "value" : 0.6384024354394601,
      "standard_deviation" : 0.003867109755267757
    },
    "accuracy_best_constant_classifier" : {
      "value" : 0.19370229007633588,
      "standard_deviation" : 0.0032049848450732355
    },
    "weighted_recall_best_constant_classifier" : {
      "value" : 0.19370229007633588,
      "standard_deviation" : 0.0032049848450732355
    },
    "weighted_precision_best_constant_classifier" : {
      "value" : 0.03752057718081697,
      "standard_deviation" : 0.001241536088657851
    },
    "weighted_f0_5_best_constant_classifier" : {
      "value" : 0.04473443104152011,
      "standard_deviation" : 0.0014460485504284792
    },
    "weighted_f1_best_constant_classifier" : {
      "value" : 0.06286421244683643,
      "standard_deviation" : 0.0019113576884608862
    },
    "weighted_f2_best_constant_classifier" : {
      "value" : 0.10570313141262414,
      "standard_deviation" : 0.002734216826748117
    }
  }
```

## Monitoring model quality metrics with CloudWatch
<a name="model-monitor-model-quality-cw"></a>

If you set the value of the `enable_cloudwatch_metrics` to `True` when you create the monitoring schedule, model quality monitoring jobs send all metrics to CloudWatch.

Model quality metrics appear in the following namespace:
+ For real-time endpoints: `aws/sagemaker/Endpoints/model-metrics`
+ For batch transform jobs: `aws/sagemaker/ModelMonitoring/model-metrics`

For a list of the metrics that are emitted, see the previous sections on this page.

You can use CloudWatch metrics to create an alarm when a specific metric doesn't meet the threshold you specify. For instructions about how to create CloudWatch alarms, see [Create a CloudWatch alarm based on a static threshold](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *CloudWatch User Guide*.