

# Support for Your Own Containers With Amazon SageMaker Model Monitor
<a name="model-monitor-byoc-containers"></a>

Amazon SageMaker Model Monitor provides a prebuilt container with ability to analyze the data captured from endpoints or batch transform jobs for tabular datasets. If you would like to bring your own container, Model Monitor provides extension points which you can leverage.

Under the hood, when you create a `MonitoringSchedule`, Model Monitor ultimately kicks off processing jobs. Hence the container needs to be aware of the processing job contract documented in the [How to Build Your Own Processing Container (Advanced Scenario)](build-your-own-processing-container.md) topic. Note that Model Monitor kicks off the processing job on your behalf per the schedule. While invoking, Model Monitor sets up additional environment variables for you so that your container has enough context to process the data for that particular execution of the scheduled monitoring. For additional information on container inputs, see the [Container Contract Inputs](model-monitor-byoc-contract-inputs.md).

In the container, using the above environment variables/context, you can now analyze the dataset for the current period in your custom code. After this analysis is complete, you can chose to emit your reports to be uploaded to an S3 bucket. The reports that the prebuilt container generates are documented in [Container Contract Outputs](model-monitor-byoc-contract-outputs.md). If you would like the visualization of the reports to work in SageMaker Studio, you should follow the same format. You can also choose to emit completely custom reports.

You also emit CloudWatch metrics from the container by following the instructions in [CloudWatch Metrics for Bring Your Own Containers](model-monitor-byoc-cloudwatch.md).

**Topics**
+ [Container Contract Inputs](model-monitor-byoc-contract-inputs.md)
+ [Container Contract Outputs](model-monitor-byoc-contract-outputs.md)
+ [CloudWatch Metrics for Bring Your Own Containers](model-monitor-byoc-cloudwatch.md)

# Container Contract Inputs
<a name="model-monitor-byoc-contract-inputs"></a>

The Amazon SageMaker Model Monitor platform invokes your container code according to a specified schedule. If you choose to write your own container code, the following environment variables are available. In this context, you can analyze the current dataset or evaluate the constraints if you choose and emit metrics, if applicable.

The available environment variables are the same for real-time endpoints and batch transform jobs, except for the `dataset_format` variable. If you are using a real-time endpoint, the `dataset_format` variable supports the following options:

```
{\"sagemakerCaptureJson\": {\"captureIndexNames\": [\"endpointInput\",\"endpointOutput\"]}}
```

If you are using a batch transform job, the `dataset_format` supports the following options:

```
{\"csv\": {\"header\": [\"true\",\"false\"]}}
```

```
{\"json\": {\"line\": [\"true\",\"false\"]}}
```

```
{\"parquet\": {}}
```

The following code sample shows the complete set of environment variables available for your container code (and uses the `dataset_format` format for a real-time endpoint).

```
"Environment": {
 "dataset_format": "{\"sagemakerCaptureJson\": {\"captureIndexNames\": [\"endpointInput\",\"endpointOutput\"]}}",
 "dataset_source": "/opt/ml/processing/endpointdata",
 "end_time": "2019-12-01T16: 20: 00Z",
 "output_path": "/opt/ml/processing/resultdata",
 "publish_cloudwatch_metrics": "Disabled",
 "sagemaker_endpoint_name": "endpoint-name",
 "sagemaker_monitoring_schedule_name": "schedule-name",
 "start_time": "2019-12-01T15: 20: 00Z"
}
```

Parameters 


| Parameter Name | Description | 
| --- | --- | 
| dataset\$1format |  For a job started from a `MonitoringSchedule` backed by an `Endpoint`, this is `sageMakerCaptureJson` with the capture indices `endpointInput`,or `endpointOutput`, or both. For a batch transform job, this specifies the data format, whether CSV, JSON, or Parquet.  | 
| dataset\$1source |  If you are using a real-time endpoint, the local path in which the data corresponding to the monitoring period, as specified by `start_time` and `end_time`, are available. At this path, the data is available in` /{endpoint-name}/{variant-name}/yyyy/mm/dd/hh`. We sometimes download more than what is specified by the start and end times. It is up to the container code to parse the data as required.  | 
| output\$1path |  The local path to write output reports and other files. You specify this parameter in the `CreateMonitoringSchedule` request as `MonitoringOutputConfig.MonitoringOutput[0].LocalPath`. It is uploaded to the `S3Uri` path specified in `MonitoringOutputConfig.MonitoringOutput[0].S3Uri`.  | 
| publish\$1cloudwatch\$1metrics |  For a job launched by `CreateMonitoringSchedule`, this parameter is set to `Enabled`. The container can choose to write the Amazon CloudWatch output file at `[filepath]`.  | 
| sagemaker\$1endpoint\$1name |  If you are using a real-time endpoint, the name of the `Endpoint` that this scheduled job was launched for.  | 
| sagemaker\$1monitoring\$1schedule\$1name |  The name of the `MonitoringSchedule` that launched this job.  | 
| \$1sagemaker\$1endpoint\$1datacapture\$1prefix\$1 |  If you are using a real-time endpoint, the prefix specified in the `DataCaptureConfig` parameter of the `Endpoint`. The container can use this if it needs to directly access more data than already downloaded by SageMaker AI at the `dataset_source` path.  | 
| start\$1time, end\$1time |  The time window for this analysis run. For example, for a job scheduled to run at 05:00 UTC and a job that runs on 20/02/2020, `start_time`: is 2020-02-19T06:00:00Z and `end_time`: is 2020-02-20T05:00:00Z  | 
| baseline\$1constraints: |  The local path of the baseline constraint file specified in` BaselineConfig.ConstraintResource.S3Uri`. This is available only if this parameter was specified in the `CreateMonitoringSchedule` request.  | 
| baseline\$1statistics |  The local path to the baseline statistics file specified in `BaselineConfig.StatisticsResource.S3Uri`. This is available only if this parameter was specified in the `CreateMonitoringSchedule` request.:   | 

# Container Contract Outputs
<a name="model-monitor-byoc-contract-outputs"></a>

The container can analyze the data available in the `*dataset_source*` path and write reports to the path in `*output_path*.` The container code can write any reports that suit your needs.

If you use the following structure and contract, certain output files are treated specially by SageMaker AI in the visualization and API . This applies only to tabular datasets.

Output Files for Tabular Datasets


| File Name | Description | 
| --- | --- | 
| statistics.json |  This file is expected to have columnar statistics for each feature in the dataset that is analyzed. The schema for this file is available in the next section.  | 
| constraints.json |  This file is expected to have the constraints on the features observed. The schema for this file is available in the next section.  | 
| constraints\$1violations.json |  This file is expected to have the list of violations found in this current set of data as compared to the baseline statistics and constraints file specified in the `baseline_constaints` and `baseline_statistics` path.  | 

In addition, if the `publish_cloudwatch_metrics` value is `"Enabled"` container code can emit Amazon CloudWatch metrics in this location: `/opt/ml/output/metrics/cloudwatch`. The schema for these files is described in the following sections.

**Topics**
+ [Schema for Statistics (statistics.json file)](model-monitor-byoc-statistics.md)
+ [Schema for Constraints (constraints.json file)](model-monitor-byoc-constraints.md)

# Schema for Statistics (statistics.json file)
<a name="model-monitor-byoc-statistics"></a>

The schema defined in the `statistics.json` file specifies the statistical parameters to be calculated for the baseline and data that is captured. It also configures the bucket to be used by [KLL](https://datasketches.apache.org/docs/KLL/KLLSketch.html), a very compact quantiles sketch with lazy compaction scheme.

```
{
    "version": 0,
    # dataset level stats
    "dataset": {
        "item_count": number
    },
    # feature level stats
    "features": [
        {
            "name": "feature-name",
            "inferred_type": "Fractional" | "Integral",
            "numerical_statistics": {
                "common": {
                    "num_present": number,
                    "num_missing": number
                },
                "mean": number,
                "sum": number,
                "std_dev": number,
                "min": number,
                "max": number,
                "distribution": {
                    "kll": {
                        "buckets": [
                            {
                                "lower_bound": number,
                                "upper_bound": number,
                                "count": number
                            }
                        ],
                        "sketch": {
                            "parameters": {
                                "c": number,
                                "k": number
                            },
                            "data": [
                                [
                                    num,
                                    num,
                                    num,
                                    num
                                ],
                                [
                                    num,
                                    num
                                ][
                                    num,
                                    num
                                ]
                            ]
                        }#sketch
                    }#KLL
                }#distribution
            }#num_stats
        },
        {
            "name": "feature-name",
            "inferred_type": "String",
            "string_statistics": {
                "common": {
                    "num_present": number,
                    "num_missing": number
                },
                "distinct_count": number,
                "distribution": {
                    "categorical": {
                         "buckets": [
                                {
                                    "value": "string",
                                    "count": number
                                }
                          ]
                     }
                }
            },
            #provision for custom stats
        }
    ]
}
```

**Notes**  
The specified metrics are recognized by SageMaker AI in later visualization changes. The container can emit more metrics if required.
[KLL sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html) is the recognized sketch. Custom containers can write their own representation, but it won’t be recognized by SageMaker AI in visualizations.
By default, the distribution is materialized in 10 buckets. You can't change this.

# Schema for Constraints (constraints.json file)
<a name="model-monitor-byoc-constraints"></a>

A constraints.json file is used to express the constraints that a dataset must satisfy. Amazon SageMaker Model Monitor containers can use the constraints.json file to evaluate datasets against. Prebuilt containers provide the ability to generate the constraints.json file automatically for a baseline dataset. If you bring your own container, you can provide it with similar abilities or you can create the constraints.json file in some other way. Here is the schema for the constraint file that the prebuilt container uses. Bring your own containers can adopt the same format or enhance it as required.

```
{
    "version": 0,
    "features":
    [
        {
            "name": "string",
            "inferred_type": "Integral" | "Fractional" | 
                    | "String" | "Unknown",
            "completeness": number,
            "num_constraints":
            {
                "is_non_negative": boolean
            },
            "string_constraints":
            {
                "domains":
                [
                    "list of",
                    "observed values",
                    "for small cardinality"
                ]
            },
            "monitoringConfigOverrides":
            {}
        }
    ],
    "monitoring_config":
    {
        "evaluate_constraints": "Enabled",
        "emit_metrics": "Enabled",
        "datatype_check_threshold": 0.1,
        "domain_content_threshold": 0.1,
        "distribution_constraints":
        {
            "perform_comparison": "Enabled",
            "comparison_threshold": 0.1,
            "comparison_method": "Simple"||"Robust",
            "categorical_comparison_threshold": 0.1,
            "categorical_drift_method": "LInfinity"||"ChiSquared"
        }
    }
}
```

The `monitoring_config` object contains options for monitoring job for the feature. The following table describes each option.

Monitoring Constraints

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-constraints.html)

# CloudWatch Metrics for Bring Your Own Containers
<a name="model-monitor-byoc-cloudwatch"></a>

If the `publish_cloudwatch_metrics` value is `Enabled` in the `Environment` map in the `/opt/ml/processing/processingjobconfig.json` file, the container code emits Amazon CloudWatch metrics in this location: `/opt/ml/output/metrics/cloudwatch`. 

The schema for this file is closely based on the CloudWatch `PutMetrics` API. The namespace is not specified here. It defaults to the following:
+ `For real-time endpoints: /aws/sagemaker/Endpoint/data-metrics`
+ `For batch transform jobs: /aws/sagemaker/ModelMonitoring/data-metrics`

However, you can specify dimensions. We recommend you add the following dimensions at minimum:
+ `Endpoint` and `MonitoringSchedule` for real-time endpoints
+ `MonitoringSchedule` for batch transform jobs

The following JSON snippets show how to set your dimensions.

For a real-time endpoint, see the following JSON snippet which includes the `Endpoint` and `MonitoringSchedule` dimensions:

```
{ 
    "MetricName": "", # Required
    "Timestamp": "2019-11-26T03:00:00Z", # Required
    "Dimensions" : [{"Name":"Endpoint","Value":"endpoint_0"},{"Name":"MonitoringSchedule","Value":"schedule_0"}]
    "Value": Float,
    # Either the Value or the StatisticValues field can be populated and not both.
    "StatisticValues": {
        "SampleCount": Float,
        "Sum": Float,
        "Minimum": Float,
        "Maximum": Float
    },
    "Unit": "Count", # Optional
}
```

For a batch transform job, see the following JSON snippet which includes the `MonitoringSchedule` dimension:

```
{ 
    "MetricName": "", # Required
    "Timestamp": "2019-11-26T03:00:00Z", # Required
    "Dimensions" : [{"Name":"MonitoringSchedule","Value":"schedule_0"}]
    "Value": Float,
    # Either the Value or the StatisticValues field can be populated and not both.
    "StatisticValues": {
        "SampleCount": Float,
        "Sum": Float,
        "Minimum": Float,
        "Maximum": Float
    },
    "Unit": "Count", # Optional
}
```