This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

# Governance and control
<a name="governance-and-control"></a>

## Guardrails
<a name="guardrails"></a>

Large enterprises with strict security and compliance requirements need to set up guardrails for operating the ML environments. IAM policies can be used for enforcing guardrails, such as requiring proper resource tagging or limiting type of resources used, for different users and roles. For enterprise scale guardrail management, consider [AWS Organizations](https://aws.amazon.com/organizations/). Its Service Control Policies (SCP) feature can help with enterprise guardrail management, by attaching a SCP to an AWS Organizations entity (root, organizational unit (OU), or account). You still need to attach [identity-based or resource-based policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html) to IAM users or roles, or to the resources in your organization's accounts to actually grant permissions. When an IAM user or role belongs to an account that is a member of an organization, the SCPs can limit the user's or role's [effective permissions](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html#scp-effects-on-permissions).

![\[A diagram that shows managing guardrails with AWS Organizations and Service Control Policies.\]](http://docs.aws.amazon.com/whitepapers/latest/build-secure-enterprise-ml-platform/images/build-ml-17.png)


*Managing guardrails with AWS Organizations and Service Control Policies*

### Enforcing encryption
<a name="enforcing-encryption"></a>
+ **Enforcing notebook encryption** — SageMaker AI Notebook Instance EBS volume encryption can be enforced using the `sagemaker:VolumeKmsKey` condition key.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerNoteBookEnforceEncryption",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateNotebookInstance",
          "sagemaker:UpdateNotebookInstance"
        ],
        "Resource": "*",
        "Condition": {
          "Null": {
            "sagemaker:VolumeKmsKey": "true"
          }
        }
      }
    ]
  }
  ```

------
+ **Enforcing Studio Notebook EFS encryption** — The EFS storage encryption can be enforced using the `sagemaker:VolumeKmsKey` condition key.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerStudioEnforceEncryption",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateDomain"
        ],
        "Resource": "*",
        "Condition": {
          "Null": {
            "sagemaker:VolumeKmsKey": "true"
          }
        }
      }
    ]
  }
  ```

------
+ **Enforcing job encryption **— Similarly, encryption for the SageMaker AI training job, processing job, transform job, and hyperparameter tuning job can be enforced using the `sagemaker:VolumeKmsKey` condition key.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerJobEnforceEncryption",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateHyperParameterTuningJob",
          "sagemaker:CreateProcessingJob",
          "sagemaker:CreateTrainingJob",
          "sagemaker:CreateTransformJob"
        ],
        "Resource": "*",
        "Condition": {
          "Null": {
            "sagemaker:VolumeKmsKey": "true"
          }
        }
      }
    ]
  }
  ```

------
+ **Enforcing inter-container traffic encryption** — For extremely sensitive distributed model training job and tuning job, the `sagemaker:InterContainerTrafficEncryption` condition key can be used to encrypt inter-container traffic. 
**Note**  
 The training speed will be negatively impacted when this is enabled.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerEnforceInterContainerTrafficEncryption",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateHyperParameterTuningJob",
          "sagemaker:CreateTrainingJob"
        ],
        "Resource": "*",
        "Condition": {
          "Bool": {
            "sagemaker:InterContainerTrafficEncryption": "false"
          }
        }
      }
    ]
  }
  ```

------

### Controlling data egress
<a name="controlling-data-egress"></a>
+ **Enforcing deployment in VPC** — To route traffic from SageMaker to access resources in a VPC, `sagemaker:VpcSubnets` and `sagemaker:VpcSecurityGroupIds` can be used to configure VPC and security group to manage the traffic.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerEnforceVPCDeployment",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateHyperParameterTuningJob",
          "sagemaker:CreateModel",
          "sagemaker:CreateNotebookInstance",
          "sagemaker:CreateProcessingJob",
          "sagemaker:CreateTrainingJob"
        ],
        "Resource": "*",
        "Condition": {
          "Null": {
            "sagemaker:VpcSubnets": "true",
            "sagemaker:VpcSecurityGroupIds": "true"
          }
        }
      }
    ]
  }
  ```

------
+ **Enforcing Network Isolation** — Networking traffic can be blocked for the algorithm container using the `sagemaker:NetworkIsolation` condition key.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "NetworkIsolation",
      "Effect": "Deny",
      "Action": [
        "sagemaker:CreateHyperParameterTuningJob",
        "sagemaker:CreateTrainingJob"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "sagemaker:NetworkIsolation": "false"
          }
        }
      }
    ]
  }
  ```

------
+ Restricting access to SageMaker AI API and runtime by IP address — You can restrict the IP address ranges for invoking different SageMaker AI APIs by using the `aws:SourceIp` condition key.
+ Restricting Studio and notebook pre-signed URLs to IPs — Launching SageMaker AI Studio or SageMaker AI Notebook instance can be restricted by the `aws:SourceIp`.

### Disabling internet access
<a name="disabling-internet-access"></a>
+ **Disabling SageMaker AI Notebook internet access** — If you want to disable internet access when the notebook is created, you can use `sagemaker:DirectInternetAccess` to achieve this.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerPreventDirectInternet",
        "Effect": "Deny",
        "Action": "sagemaker:CreateNotebookInstance",
        "Resource": "*",
        "Condition": {
          "StringEquals": {
            "sagemaker:DirectInternetAccess": [
              "Enabled"
            ]
          }
        }
      }
    ]
  }
  ```

------
+ **Disabling Studio Domain internet access** — For SageMaker AI Studio, the following condition key may be used to disable internet access from the Studio domain:

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerPreventDirectInternetforStudio",
        "Effect": "Deny",
        "Action": "sagemaker:CreateDomain",
        "Resource": "*",
        "Condition": {
          "StringEquals": {
            "sagemaker:AppNetworkAccessType": [
              "PublicInternetOnly"
            ]
          }
        }
      }
    ]
  }
  ```

------

### Preventing privilege escalation
<a name="preventing-privilege-escalation"></a>
+ **Disabling SageMaker AI Notebook root access** — AWS recommends disabling the root access to SageMaker AI Notebooks for the data scientists and ML engineers. The following policy prevents a user from launching a SageMaker AI Notebook if `RootAccess` is not disabled.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerDenyRootAccess",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateNotebookInstance",
          "sagemaker:UpdateNotebookInstance"
        ],
        "Resource": "*",
        "Condition": {
          "StringEquals": {
            "sagemaker:RootAccess": [
              "Enabled"
            ]
          }
        }
      }
    ]
  }
  ```

------

### Enforcing tags
<a name="enforcing-tags"></a>
+ **Requiring tag for API call in dev environment** - the following policy requires a “dev” environment tag to be attached to the SageMaker AI endpoint.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerRequireEnvTag",
        "Effect": "Deny",
        "Action": "sagemaker:CreateEndpoint",
        "Resource": "arn:aws:sagemaker:*:*:endpoint/*",
        "Condition": {
          "StringNotEquals": {
            "aws:RequestTag/environment": "dev"
          }
        }
      }
    ]
  }
  ```

------
+ **Requiring tag for Studio domains in data science accounts** - To ensure that administrators appropriately tag Studio domains, kernels, and notebooks on creation, you can use the following policy. For example, for developers in data science accounts inside an OU, a Studio created in these accounts should be tagged as follows. 

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "RequireAppTag",
              "Effect": "Deny",
              "Action": [
                  "sagemaker:CreateDomain"
              ],
              "Resource": "*",
              "Condition": {
                  "StringNotLike": {
                      "aws:RequestTag/Project": "data_science"
                  }
              }
          }
      ]
  }
  ```

------

### Controlling cost
<a name="controlling-cost"></a>
+ **Enforcing instance type for a SageMaker AI Notebook instance** — The following policy ensures that only the listed instances types can be used to create a notebook instance.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerLimitInstanceTypes",
        "Effect": "Deny",
        "Action": "sagemaker:CreateNotebookInstance",
        "Resource": "*",
        "Condition": {
          "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
              "ml.c5.xlarge",
              "ml.m5.xlarge",
              "ml.t3.medium"
            ]
          }
        }
      }
    ]
  }
  ```

------
+ **Enforcing instance type for Studio Notebook instance** — The following policy helps enforce the type of instances used for SageMaker AI Studio notebook.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Sid": "SageMakerAllowedInstanceTypes",
        "Effect": "Deny",
        "Action": [
          "sagemaker:CreateApp"
        ],
        "Resource": "*",
        "Condition": {
          "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
              "ml.c5.large",
              "ml.m5.large",
              "ml.t3.medium"
            ]
          }
        }
      }
    ]
  }
  ```

------

# Model inventory management
<a name="model-inventory-management"></a>

Model inventory management is an important component of model risk management (MGM). All models deployed in production need to be accurately registered and versioned to enable model lineage tracking and auditing. SageMaker AI provides a model registry feature for cataloging models for production and managing different model versions. With SageMaker AI model registry, you can also associate metadata with a model, such as training metrics, model owner name, and approval status.

There are several approaches for managing the model inventory across different accounts and for different environments. Following are two different approaches within the context of building a ML platform.
+ **Distributed model management approach** — With this approach, the model files are stored in the account / environment in which it is generated, and the model is registered in the SageMaker AI model registry belonging to each account. For example, each business unit can have its own ML UAT / Test account, and the models generated by the automation pipelines are stored and registered in the business unit’s own UAT / Test account.
+ **Central model management approach** — With this approach, all models generated by the automated pipelines are stored in the Shared Services account along with the associated inference Docker container images, and a model package group is created to track different versions of a model. When model deployment is required in the production account, create a model in the production account using a versioned model Amazon Resource Name (ARN) from the central model package repository and then deploy the model in the production environment. 

![\[A diagram that shows central model inventory management.\]](http://docs.aws.amazon.com/whitepapers/latest/build-secure-enterprise-ml-platform/images/build-ml-18.png)


*Audit trail management*

# Audit trail management
<a name="audit-trail-management"></a>

Operations against AWS services are logged by AWS CloudTrail, and log files are stored in S3. Access details such as Event Name, User Identity, Event Time, Event Source, and Source IP are all captured in CloudTrail.

![\[A diagram that shows a sample audit trail architecture.\]](http://docs.aws.amazon.com/whitepapers/latest/build-secure-enterprise-ml-platform/images/build-ml-19.png)


*Sample audit trail architecture*

CloudTrail provides features for accessing and viewing CloudTrail events directly in the console. CloudTrail can also integrate with log analysis tools such as Splunk to further processing and reporting.

SageMaker AI services such as notebook, processing job, or training job report the IAM roles assumed by these individual services against the different API events. To associate these activities with each individual user, consider creating a separate IAM role for each user for the different SageMaker AI services to assume.

# Data and artifacts lineage tracking
<a name="data-and-artifacts-lineage-tracking"></a>

For regulated customers, tracking all the artifacts used for a production model is an essential requirement for reproducing the model to meet regulatory and control requirements. The following diagram shows the various artifacts that need to be tracked and versioned to recreate the data processing, model training, and model deployment process.

![\[A diagram showing artifacts needed for tracking.\]](http://docs.aws.amazon.com/whitepapers/latest/build-secure-enterprise-ml-platform/images/build-ml-20.png)


*Artifacts needed for tracking*
+ **Code versioning** — Code repositories such as GitLab, Bitbucket, and CodeCommit support versioning of the code artifacts. You can check-in / check-out code by commit ID. Commit ID uniquely identifies a version of source code in a repository.
+ **Dataset versioning** — Training data can be version controlled by using a proper S3 data partition scheme. Every time there is a new training dataset, a unique S3 bucket / prefix can be created to uniquely identify the training dataset. 

  [DVC](https://dvc.org/), an open-source data versioning system for machine learning, can track different versions of a dataset. The DVC repository can be created with a code repository such as GitHub and CodeCommit, and S3 can be used as the back-end store for data.

  Metadata associated with the datasets can also be tracked along with the dataset using services such as [SageMaker AI Lineage Tracking](https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html).
+ **Container versioning** — Amazon ECR uniquely identifies each image with an Image URI (repository URI \$1 image digest ID). Images can also be pulled with repository URI \$1 tag (if a unique tag is properly enforced for each image push). Additional tags can be added to track other metadata for source code repo URI and commit ID for traceability.
+ **Training job versioning** — Each SageMaker AI training job is uniquely identified with an ARN, and other metadata such as hyperparameters, container URI, data set URI, and model output URI are automatically tracked with the training job.
+ **Model versioning** — A model package can be registered in SageMaker AI using the URL of the model files in S3 and URI to the container image in ECR. An ARN is assigned to the model package to identify the model package uniquely. Additional tags can be added to the SageMaker AI model package with the model name and version ID to track different versions of the model. 
+ **Endpoint versioning** — Each SageMaker AI endpoint has a unique ARN, and other metadata, such as the model used, are tracked as part of the endpoint configuration.

## Infrastructure configuration change management
<a name="infrastructure-configuration-change-management"></a>

As part of the ML platform operation, Cloud Engineering / MLOps team needs to monitor and track resource configuration changes in core architecture components such as automation pipelines and model hosting environment.

[AWS Config](https://aws.amazon.com/config/) is a service that enables you to assess, audit, and evaluate the configuration of AWS resources. It tracks changes in CodePipeline pipeline definitions, CodeBuild projects, and CloudFormation stacks, and it can report configuration changes across timelines.

AWS CloudTrail can track and log SageMaker AI API events for the create, delete, and update operations for model registry, model hosting endpoint configuration, and model hosting endpoint to detect production environment changes.