# Data quality
<a name="model-monitor-data-quality"></a>

Data quality monitoring automatically monitors machine learning (ML) models in production and notifies you when data quality issues arise. ML models in production have to make predictions on real-life data that is not carefully curated like most training datasets. If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions. Amazon SageMaker Model Monitor uses rules to detect data drift and alerts you when it happens. To monitor data quality, follow these steps:
+ Enable data capture. This captures inference input and output from a real-time inference endpoint or batch transform job and stores the data in Amazon S3. For more information, see [Data capture](model-monitor-data-capture.md).
+ Create a baseline. In this step, you run a baseline job that analyzes an input dataset that you provide. The baseline computes baseline schema constraints and statistics for each feature using [Deequ](https://github.com/awslabs/deequ), an open source library built on Apache Spark, which is used to measure data quality in large datasets. For more information, see [Create a Baseline](model-monitor-create-baseline.md).
+ Define and schedule data quality monitoring jobs. For specific information and code samples of data quality monitoring jobs, see [Schedule data quality monitoring jobs](model-monitor-schedule-data-monitor.md). For general information about monitoring jobs, see [Schedule monitoring jobs](model-monitor-scheduling.md).
  + Optionally use preprocessing and postprocessing scripts to transform the data coming out of your data quality analysis. For more information, see [Preprocessing and Postprocessing](model-monitor-pre-and-post-processing.md).
+ View data quality metrics. For more information, see [Schema for Statistics (statistics.json file)](model-monitor-interpreting-statistics.md).
+ Integrate data quality monitoring with Amazon CloudWatch. For more information, see [CloudWatch Metrics](model-monitor-interpreting-cloudwatch.md).
+ Interpret the results of a monitoring job. For more information, see [Interpret results](model-monitor-interpreting-results.md).
+ Use SageMaker Studio to enable data quality monitoring and visualize results if you are using a real-time endpoint. For more information, see [Visualize results for real-time endpoints in Amazon SageMaker Studio](model-monitor-interpreting-visualize-results.md).

**Note**  
Model Monitor computes model metrics and statistics on tabular data only. For example, an image classification model that takes images as input and outputs a label based on that image can still be monitored. Model Monitor would be able to calculate metrics and statistics for the output, not the input.

**Topics**
+ [

# Create a Baseline
](model-monitor-create-baseline.md)
+ [

# Schedule data quality monitoring jobs
](model-monitor-schedule-data-monitor.md)
+ [

# Schema for Statistics (statistics.json file)
](model-monitor-interpreting-statistics.md)
+ [

# CloudWatch Metrics
](model-monitor-interpreting-cloudwatch.md)
+ [

# Schema for Violations (constraint\$1violations.json file)
](model-monitor-interpreting-violations.md)

# Create a Baseline
<a name="model-monitor-create-baseline"></a>

The baseline calculations of statistics and constraints are needed as a standard against which data drift and other data quality issues can be detected. Model Monitor provides a built-in container that provides the ability to suggest the constraints automatically for CSV and flat JSON input. This *sagemaker-model-monitor-analyzer* container also provides you with a range of model monitoring capabilities, including constraint validation against a baseline, and emitting Amazon CloudWatch metrics. This container is based on Spark version 3.3.0 and is built with [Deequ](https://github.com/awslabs/deequ) version 2.0.2. All column names in your baseline dataset must be compliant with Spark. For column names, use only lowercase characters, and `_` as the only special character.

The training dataset that you used to train the model is usually a good baseline dataset. The training dataset data schema and the inference dataset schema should exactly match (the number and order of the features). Note that the prediction/output columns are assumed to be the first columns in the training dataset. From the training dataset, you can ask SageMaker AI to suggest a set of baseline constraints and generate descriptive statistics to explore the data. For this example, upload the training dataset that was used to train the pretrained model included in this example. If you already stored the training dataset in Amazon S3, you can point to it directly.

**To Create a baseline from a training dataset** 

When you have your training data ready and stored in Amazon S3, start a baseline processing job with `DefaultModelMonitor.suggest_baseline(..)` using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). This uses an [Amazon SageMaker Model Monitor prebuilt container](model-monitor-pre-built-container.md) that generates baseline statistics and suggests baseline constraints for the dataset and writes them to the `output_s3_uri` location that you specify.

```
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri+'/training-dataset-with-header.csv',
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True
)
```

**Note**  
If you provide the feature/column names in the training dataset as the first row and set the `header=True` option as shown in the previous code sample, SageMaker AI uses the feature name in the constraints and statistics file.

The baseline statistics for the dataset are contained in the statistics.json file and the suggested baseline constraints are contained in the constraints.json file in the location you specify with `output_s3_uri`.

Output Files for Tabular Dataset Statistics and Constraints


| File Name | Description | 
| --- | --- | 
| statistics.json |  This file is expected to have columnar statistics for each feature in the dataset that is analyzed. For more information about the schema for this file, see [Schema for Statistics (statistics.json file)](model-monitor-byoc-statistics.md).  | 
| constraints.json |  This file is expected to have the constraints on the features observed. For more information about the schema for this file, see [Schema for Constraints (constraints.json file)](model-monitor-byoc-constraints.md).  | 

The [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) provides convenience functions described to generate the baseline statistics and constraints. But if you want to call processing job directly for this purpose instead, you need to set the `Environment` map as shown in the following example:

```
"Environment": {
    "dataset_format": "{\"csv\”: { \”header\”: true}",
    "dataset_source": "/opt/ml/processing/sm_input",
    "output_path": "/opt/ml/processing/sm_output",
    "publish_cloudwatch_metrics": "Disabled",
}
```

# Schedule data quality monitoring jobs
<a name="model-monitor-schedule-data-monitor"></a>

After you create your baseline, you can call the `create_monitoring_schedule()` method of your `DefaultModelMonitor` class instance to schedule an hourly data quality monitor. The following sections show you how to create a data quality monitor for a model deployed to a real-time endpoint as well as for a batch transform job.

**Important**  
You can specify either a batch transform input or an endpoint input, but not both, when you create your monitoring schedule.

## Data quality monitoring for models deployed to real-time endpoints
<a name="model-monitor-data-quality-rt"></a>

To schedule a data quality monitor for a real-time endpoint, pass your `EndpointInput` instance to the `endpoint_input` argument of your `DefaultModelMonitor` instance, as shown in the following code sample:

```
from sagemaker.model_monitor import CronExpressionGenerator
                
data_quality_model_monitor = DefaultModelMonitor(
   role=sagemaker.get_execution_role(),
   ...
)

schedule = data_quality_model_monitor.create_monitoring_schedule(
   monitor_schedule_name=schedule_name,
   post_analytics_processor_script=s3_code_postprocessor_uri,
   output_s3_uri=s3_report_path,
   schedule_cron_expression=CronExpressionGenerator.hourly(),
   statistics=data_quality_model_monitor.baseline_statistics(),
   constraints=data_quality_model_monitor.suggested_constraints(),
   schedule_cron_expression=CronExpressionGenerator.hourly(),
   enable_cloudwatch_metrics=True,
   endpoint_input=EndpointInput(
        endpoint_name=endpoint_name,
        destination="/opt/ml/processing/input/endpoint",
   )
)
```

## Data quality monitoring for batch transform jobs
<a name="model-monitor-data-quality-bt"></a>

To schedule a data quality monitor for a batch transform job, pass your `BatchTransformInput` instance to the `batch_transform_input` argument of your `DefaultModelMonitor` instance, as shown in the following code sample:

```
from sagemaker.model_monitor import CronExpressionGenerator
                
data_quality_model_monitor = DefaultModelMonitor(
   role=sagemaker.get_execution_role(),
   ...
)

schedule = data_quality_model_monitor.create_monitoring_schedule(
    monitor_schedule_name=mon_schedule_name,
    batch_transform_input=BatchTransformInput(
        data_captured_destination_s3_uri=s3_capture_upload_path,
        destination="/opt/ml/processing/input",
        dataset_format=MonitoringDatasetFormat.csv(header=False),
    ),
    output_s3_uri=s3_report_path,
    statistics= statistics_path,
    constraints = constraints_path,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,
)
```

# Schema for Statistics (statistics.json file)
<a name="model-monitor-interpreting-statistics"></a>

Amazon SageMaker Model Monitor prebuilt container computes per column/feature statistics. The statistics are calculated for the baseline dataset and also for the current dataset that is being analyzed.

```
{
    "version": 0,
    # dataset level stats
    "dataset": {
        "item_count": number
    },
    # feature level stats
    "features": [
        {
            "name": "feature-name",
            "inferred_type": "Fractional" | "Integral",
            "numerical_statistics": {
                "common": {
                    "num_present": number,
                    "num_missing": number
                },
                "mean": number,
                "sum": number,
                "std_dev": number,
                "min": number,
                "max": number,
                "distribution": {
                    "kll": {
                        "buckets": [
                            {
                                "lower_bound": number,
                                "upper_bound": number,
                                "count": number
                            }
                        ],
                        "sketch": {
                            "parameters": {
                                "c": number,
                                "k": number
                            },
                            "data": [
                                [
                                    num,
                                    num,
                                    num,
                                    num
                                ],
                                [
                                    num,
                                    num
                                ][
                                    num,
                                    num
                                ]
                            ]
                        }#sketch
                    }#KLL
                }#distribution
            }#num_stats
        },
        {
            "name": "feature-name",
            "inferred_type": "String",
            "string_statistics": {
                "common": {
                    "num_present": number,
                    "num_missing": number
                },
                "distinct_count": number,
                "distribution": {
                    "categorical": {
                         "buckets": [
                                {
                                    "value": "string",
                                    "count": number
                                }
                          ]
                     }
                }
            },
            #provision for custom stats
        }
    ]
}
```

Note the following:
+ The prebuilt containers compute [KLL sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html), which is a compact quantiles sketch.
+ By default, we materialize the distribution in 10 buckets. This is not currently configurable.

# CloudWatch Metrics
<a name="model-monitor-interpreting-cloudwatch"></a>

You can use the built-in Amazon SageMaker Model Monitor container for CloudWatch metrics. When the `emit_metrics` option is `Enabled` in the baseline constraints file, SageMaker AI emits these metrics for each feature/column observed in the dataset in the following namespace:
+ `For real-time endpoints: /aws/sagemaker/Endpoints/data-metric` namespace with `EndpointName` and `ScheduleName` dimensions.
+ `For batch transform jobs: /aws/sagemaker/ModelMonitoring/data-metric` namespace with `MonitoringSchedule` dimension.

For numerical fields, the built-in container emits the following CloudWatch metrics:
+ Metric: Max → query for `MetricName: feature_data_{feature_name}, Stat: Max`
+ Metric: Min → query for `MetricName: feature_data_{feature_name}, Stat: Min`
+ Metric: Sum → query for `MetricName: feature_data_{feature_name}, Stat: Sum`
+ Metric: SampleCount → query for `MetricName: feature_data_{feature_name}, Stat: SampleCount`
+ Metric: Average → query for `MetricName: feature_data_{feature_name}, Stat: Average`

For both numerical and string fields, the built-in container emits the following CloudWatch metrics:
+ Metric: Completeness → query for `MetricName: feature_non_null_{feature_name}, Stat: Sum`
+ Metric: Baseline Drift → query for `MetricName: feature_baseline_drift_{feature_name}, Stat: Sum`

# Schema for Violations (constraint\$1violations.json file)
<a name="model-monitor-interpreting-violations"></a>

The violations file is generated as the output of a `MonitoringExecution`, which lists the results of evaluating the constraints (specified in the constraints.json file) against the current dataset that was analyzed. The Amazon SageMaker Model Monitor prebuilt container provides the following violation checks.

```
{
    "violations": [{
      "feature_name" : "string",
      "constraint_check_type" :
              "data_type_check",
            | "completeness_check",
            | "baseline_drift_check",
            | "missing_column_check",
            | "extra_column_check",
            | "categorical_values_check"
      "description" : "string"
    }]
}
```

Types of Violations Monitored 


| Violation Check Type | Description  | 
| --- | --- | 
| data\$1type\$1check | If the data types in the current execution are not the same as in the baseline dataset, this violation is flagged. During the baseline step, the generated constraints suggest the inferred data type for each column. The `monitoring_config.datatype_check_threshold` parameter can be tuned to adjust the threshold on when it is flagged as a violation.  | 
| completeness\$1check | If the completeness (% of non-null items) observed in the current execution exceeds the threshold specified in completeness threshold specified per feature, this violation is flagged. During the baseline step, the generated constraints suggest a completeness value.   | 
| baseline\$1drift\$1check | If the calculated distribution distance between the current and the baseline datasets is more than the threshold specified in `monitoring_config.comparison_threshold`, this violation is flagged.  | 
| missing\$1column\$1check | If the number of columns in the current dataset is less than the number in the baseline dataset, this violation is flagged.  | 
| extra\$1column\$1check | If the number of columns in the current dataset is more than the number in the baseline, this violation is flagged.  | 
| categorical\$1values\$1check | If there are more unknown values in the current dataset than in the baseline dataset, this violation is flagged. This value is dictated by the threshold in `monitoring_config.domain_content_threshold`.  |