# 3 – Designing data platforms for governance and compliance
<a name="design-principle-3"></a>

 **How do you protect data in your organization’s analytics workload?** Privacy by Design (PbD) is an approach in system engineering that takes privacy into account throughout the whole engineering process. PbD especially focuses on systems or applications that capture and process personal data. Many countries or political unions enforce data protection regulations. The main data protection regulations are: GDPR (General Data Protection Regulation), CCPA (California Consumer Privacy), LGPD (Lei geral da Protecao de Dados Pessoasis in Brazil), POPIA (South Africa), Australian Privacy Act and DPA (UK Data Protection Act). 

 As an organization you must have an understanding what data protection regulations you must adhere to and implement them into your solution accordingly. If your organization operates across territories, then you must adhere to multiple data regulations. 

 This whitepaper covers the common themes shared amongst these regulations; however this is not an exhaustive list. Therefore you must consult your organization’s Data Protection Office to determine what additional regional and company-wide data protection and data governance requirements must be implemented. 

 For more details regarding the different types of data protection regulations, refer to the following: 
+  GDPR - [General Data Protection Regulation Center](https://aws.amazon.com/compliance/gdpr-center/) 
+  CCPA - [California Consumer Privacy Act](https://aws.amazon.com/compliance/california-consumer-privacy-act/) 
+  LGPD - [The General Data Protection Law](https://aws.amazon.com/blogs/security/lgpd-workbook-for-aws-customers-managing-personally-identifiable-information-in-brazil/) 
+  POPIA - [South Africa Data Privacy](https://aws.amazon.com/compliance/south-africa-data-privacy/) 


|   **ID**   |   **Priority**   |  **Best practice**  | 
| --- | --- | --- | 
|  ☐ BP 3.1   |   Required   |  Privacy by design.  | 
|  ☐ BP 3.2   |   Required   |  Classify and protect data  | 
|  ☐ BP 3.3   |   Required   |  Understand data classifications and their protection policies.  | 
|  ☐ BP 3.4   |   Required   |  Identify the source data owners and have them set the data classifications.  | 
|  ☐ BP 3.5   |   Required   |  Record data classifications into the Data Catalog so that analytics workload can understand.  | 
|  ☐ BP 3.6   |   Required   |  Implement encryption policies.  | 
|  ☐ BP 3.7   |   Required   |  Implement data retention policies for each class of data in the analytics workload.  | 
|  ☐ BP 3.8   |   Recommended   |  Enforce downstream systems to honor the data classifications.  | 

 For more details, refer to the following information: 
+  AWS GDPR Center: [Introducing the New GDPR Center and “Navigating GDPR Compliance on AWS” Whitepaper](https://aws.amazon.com/blogs/security/introducing-the-new-gdpr-center-and-navigating-gdpr-compliance-on-aws-whitepaper/) 
+  AWS Database Blog: [Best practices for securing sensitive data in AWS data stores](https://aws.amazon.com/blogs/database/best-practices-for-securing-sensitive-data-in-aws-data-stores/) 
+  AWS Security Blog: [Discover sensitive data by using custom data identifiers with Amazon Macie](https://aws.amazon.com/blogs/security/discover-sensitive-data-by-using-custom-data-identifiers-with-amazon-macie/) 
+  Amazon Macie User Guide: [What is Amazon Macie?](https://docs.aws.amazon.com/macie/latest/user/what-is-macie.html) 
+  AWS Key Management Service Developer Guide: [What is AWS Key Management Service?](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) 
+  AWS Whitepaper: [Data Classification: Secure Cloud Adoption](https://docs.aws.amazon.com/whitepapers/latest/data-classification/welcome.html) 
+  AWS Clean Rooms: [What is AWS Clean Rooms](https://docs.aws.amazon.com/clean-rooms/latest/userguide/what-is.html) 

# Best practice 3.1 – Privacy by Design
<a name="best-practice-3.1-privacy-by-design."></a>

 Privacy by Design is an approach in system engineering that takes privacy into account throughout the whole engineering process. It especially focuses on systems or applications that capture and process personal data. 

 There is an increased focus on ensuring that personal data is processed lawfully, fairly, and in a transparent manner in relation to the data subject. Another concern is that the data processing is adequate, relevant, and limited in relation to the purpose for which the information is used. 

## Suggestion 3.1.1 – Data minimization
<a name="suggestion-3.1.1-data-minimization."></a>

Organizations should only receive, process, and store information that is relevant for the task rather than processing all information when only a portion of the file is required. For example, if a client provided a full extract of all information from their source system containing sensitive personal information, and if a portion of the file is deemed irrelevant in meeting the overall project requirements, the remainder of the file should not be stored or processed. 

 Data minimization coincides with data access controls in that applying data minimization rules can be implemented using data access controls. A suggestion is to create and maintain a data access matrix aligned with your data classification catalogs. This helps ensure that the correct groups of people have access to the right data. As most compliant frameworks encourage evidence that rules have been applied, a data access matrix can demonstrate to auditors that your organization has gone through the proper thought process to determine who can access what information. 

 Data minimization can be applied at the point of capture. It can also be applied at the point of access by presenting a restricted data model or implementing role-based access controls (RBAC). For more information on controlling data access, see [4 – Implement data access control](design-principle-4.md).

 Test and user acceptance test (UAT) environments, as well as training model datasets, must have a restricted dataset and not contain any personal information. If the structure of the data model must remain the same as production, then consider anonymizing or masking information to meet your data minimization requirements. 

 It is common practice to create test and development environments using a backup of production and restore to the respective development or test environment. If this is the case, anonymization of personally identifiable information (PII) and other sensitive information must occur using inbuilt logic or services such as AWS Glue DataBrew to obfuscate the information. 

 For more details, refer to the following documentation: 
+  Amazon Redshift RBAC - [Amazon Redshift role-based access control](https://docs.aws.amazon.com/redshift/latest/dg/t_Roles.html)s 
+  AWS Lake Formation RBAC - [Lake Formation role-based access controls](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-overview.html) 
+  Amazon Athena RBAC - [Amazon Athena fine-grained access controls](https://docs.aws.amazon.com/athena/latest/ug/fine-grained-access-to-glue-resources.html) 
+  AWS Glue DataBrew - [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) Visual Data Preparation 

## Suggestion 3.1.2 – Anonymization, pseudonymization, and tokenization
<a name="suggestion-3.1.2-anonymization-pseudonymization-and-tokenization."></a>

 Anonymization, pseudonymization, or tokenisation refers to the method of either rendering data anonymous or encoding data in such a manner that the data is no longer identifiable 

### Suggestion 3.1.2.1 – Anonymization
<a name="suggestion-3.1.2.1"></a>

**Anonymization is defined as the process of turning data into a form that does not identify individuals and where identification is not likely to take place.**

 This results in changing personal data into data that is no longer personal. An important factor in this process is that the anonymization must be irreversible. The anonymized value should be supported by the current field data type, have similar length, and retain some characteristics of the original value. For example, if a Vehicle Registration Number such as `OU51 SMR` was being anonymized, the result would look similar to `BB88 9AA`.

Organizations need the ability to anonymize full datasets as well as single records. Single record anonymization functionality can help deliver right to erasure and meet data retention requirements. In this case, full batch anonymization is typically used when obfuscating development and UAT environments. 

 The function to anonymize information should support the flexibility to anonymize certain fields, but not all.

Operational databases, reporting databases, and analytical data marts should all be considered for anonymization, although reports and analytical cubes should never typically contain PII information regardless. 

 Audit the reason why information was anonymized, for example, data portability, or data retention removal. The time, date, and user ID of when and who the anonymization process has affected should be recorded in an audit table.

 For more details, see AWS Big Data Blog: [Anonymize and manage data in your data lake with Amazon Athena and AWS Lake Formation](https://aws.amazon.com/blogs/big-data/anonymize-and-manage-data-in-your-data-lake-with-amazon-athena-and-aws-lake-formation/) 

### Suggestion 3.1.2.2 – Pseudonymization
<a name="suggestion-3.1.2.2"></a>

**Pseudonymized data is not the same as anonymized data. **

When data has been pseudonymized, it still retains a level of detail in the target data that allows tracking back of the data to its original state. With anonymized data, the level of detail is reduced rendering a reverse compilation impossible. Pseudonymization is the processing of personal data in such a way that the data can only be attributed to a specific data subject by using additional information. To pseudonymize a dataset, the additional information must be kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable person.

In summary, pseudonymized data is a privacy-enhancing technique where directly identifying data, such as IP addresses and contact information, are held separately and securely from processed data to ensure non-attribution. Similar to anonymization, referential integrity must not be affected. Therefore, both of the following are required: an audit trail of the pseudonymization process, and a pseudonymization function that supports both single item and batch processing.

 For more detail, see [Amazon Redshift Data Masking](https://docs.aws.amazon.com/redshift/latest/dg/t_ddm.html). 

### Suggestion 3.1.2.3 – Tokenization
<a name="suggestion-3.1.2.3"></a>

*Tokenization*, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent. This is referred to as a token, which has no extrinsic or exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system. Tokenization is typically used in finance to tokenize the payment account number (PAN). 

 For more details, refer to the following information: 
+  AWS Blog – [AWS Glue DataBrew detection data masking transformations](https://aws.amazon.com/about-aws/whats-new/2021/11/aws-glue-databrew-detection-data-masking-transformations/) 
+ AWS Blog - [ Data Tokenization with Amazon Redshift and Protegrity ](https://aws.amazon.com/blogs/apn/data-tokenization-with-amazon-redshift-and-protegrity/)

## Suggestion 3.1.3 – Rights of the individual, citizen, or subject
<a name="suggestion-3.1.3-rights-of-the-individual-citizen-or-subject."></a>

 Your organization should consider the process to address the rights of the individual, citizen, or subject for their respective regional regulation. 

### Suggestion 3.1.3.1 – Subject Access Request (SAR)
<a name="suggestion-3.1.3.a-subject-access-request-sar."></a>

 This particular right is for an individual to request information from the data controller, that is, how their personal data is being processed. If an individual’s information is being processed, the personal data and associated metadata must be provided to that individual. 

If the individual’s information is stored in a database, then an automated process, such as a stored procedure or User-Defined Function (UDF), should be developed to answer the Subject Access Request (SAR). There will, however, be situations when the individual’s information is stored in Amazon S3. If the information is stored in Amazon S3, the proposed solution to identify which S3 object contains the respective information is to build a lookup table in a database containing the reference number, individual contact details, and the S3 object location. This approach allows your organization to ingest the information into Amazon EMR, infer the schema using Apache Spark, and extract the information required to fulfill the request. Alternatively, your organization must process all S3 objects to identify the information to fulfill the request. 

 If your regional regulations require that your organization handle a right to data portability request, then the SAR logic can double up to support that as well.

 For more details, see Apache Spark Documentation - [Inferring the Schema Using Reflection](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#:~:text=Inferring%20the%20Schema%20Using%20Reflection,-Scala&text=The%20Scala%20interface%20for%20Spark,the%20names%20of%20the%20columns.) 

### Suggestion 3.1.3.2 – Right to be forgotten or erasure
<a name="suggestion-3.1.3.b-right-to-be-forgotten-or-erasure"></a>

Individuals have the right to erasure (the right to be forgotten), where an individual can request that all of their personal data is erased by the data controller organization. In some countries, there are instances where the data controller can refuse to comply with a right to erasure request, such as where the data is used for financial governance. 

 The right to erasure does not strictly mean that the individual’s information must be deleted. Instead, it can be permanently masked so that the personal data is no longer in the clear and the update is irreversible. 

 The organization must consider all data repositories when responding to a SAR as an individual’s information can reside in back up and source system databases. All these records must have the individual’s information removed or anonymized. 

 If there are concerns about the impact of database referential integrity being affected by removing the individual’s information, then you can consider anonymization of the specific data attributes for the given individual. There are benefits to anonymization, such as being able to maintain an audit history of what actions have been performed against the individual by referencing a system ID. The same steps that are performed in production environments must also be run in UAT, development, OLTP, and back up repositories. 

 The schedule of running the procedure in the other environments depends on the refresh schedules of those other environments.

# Best practice 3.2 – Classify and protect data
<a name="best-practice-3.2---classify-and-protect-data."></a>

 **How do you classify and protect data in analytics workload?** Because analytics workloads ingest data from source systems, the owner of the source data should define the data classifications. As the analytics workload owner, you should honor the source data classifications and implement the corresponding data protection policies of your organization. Share the data classifications with the downstream data consumers to permit them to honor the data classifications in their organizations and policies as well. 

 Data classification helps to categorize organizational data based on sensitivity and criticality, which then helps determine appropriate protection and retention controls on that data. 

# Best practice 3.3 – Understand data classifications and their protection policies
<a name="best-practice-3.3---understand-data-classifications-and-their-protection-policies."></a>

 Data classification in your organization is key to determining how data must be protected while at rest and in transit. For example, since an analytics workload necessarily copies and shares data between operations and systems, we recommend that access be controlled to certain data classifications. Such a data protection strategy helps to prevent data loss, theft, and corruption, and helps to minimize the impact caused by malicious activities or unintended access. 

## Suggestion 3.3.1 – Identify classification levels
<a name="suggestion-3.3.1---identify-classification-levels."></a>

 Use the [Data Classification whitepaper](https://docs.aws.amazon.com/whitepapers/latest/data-classification/data-classification.html) to help you identify different classification levels. Four common levels used are restricted, confidential, internal, and public, however, these levels can vary based on the industry and compliance requirements of your organization. 

## Suggestion 3.3.2 – Define access rules
<a name="suggestion-3.3.2---define-access-rules."></a>

 The data owners should define the data access rules based on the sensitivity and criticality of the data. For example, with AWS Lake Formation, you can define and enforce access controls that operate at the table, column, row, and cell level for all the users that access your data lake. 

 For more details, refer to the following information: 
+  AWS Security Blog: [How to scale your authorization needs by using attribute-based access control with](https://aws.amazon.com/blogs/security/how-to-scale-authorization-needs-using-attribute-based-access-control-with-s3/) [S3](https://aws.amazon.com/blogs/security/how-to-scale-authorization-needs-using-attribute-based-access-control-with-s3/). 
+  AWS Big Data Blog: [Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation.](https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/) 
+  AWS Big Data Blog: [Control data access and permissions with AWS Lake Formation and Amazon EMR](https://aws.amazon.com/blogs/big-data/control-data-access-and-permissions-with-aws-lake-formation-and-amazon-emr/). 
+  AWS Big Data Blog: [Enforce column-level authorization with Quick and AWS Lake](https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/) [Formation](https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/). 

## Suggestion 3.3.3 – Identify security zone models to isolate data based on classification
<a name="suggestion-3.3.3---identify-security-zone-models-to-isolate-data-based-on-classification."></a>

 Design the security zone models from AWS account levels down to AWS resource levels. For example, consider building AWS multi-account models to isolate different classes of data from AWS account level. Or, you can consider separating out development and test resources from production ones from AWS account level or from resource levels. 

 For more details, refer to the following information: 
+  AWS Whitepaper: [An Overview of the AWS Cloud Adoption Framework](https://docs.aws.amazon.com/whitepapers/latest/overview-aws-cloud-adoption-framework/welcome.html). 
+  AWS Whitepaper: [Organizing Your AWS Environment Using Multiple Accounts](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/organizing-your-aws-environment.html). 
+  AWS Whitepaper: [Security Pillar – AWS Well-Architected Framework](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/welcome.html). 

## Suggestion 3.3.4 – Identify sensitive information and define protection policies
<a name="suggestion-3.3.4---identify-sensitive-information-and-define-protection-policies."></a>

 Discover sensitive data by using custom data identifiers in Amazon Macie or using AWS Glue sensitive data detection. Based on the sensitivity and criticality of the data, implement data protection policies to prevent unauthorized access. Due to compliance requirements, data might be masked or deleted after processing in some cases.

 For more details, refer to the following information: 
+  AWS Blog: [Introducing PII data identification and handling using AWS Glue DataBrew](https://aws.amazon.com/blogs/big-data/introducing-pii-data-identification-and-handling-using-aws-glue-databrew/) 
+  AWS Blog: [Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation](https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/) 
+  AWS Info: [AWS Glue detect and process sensitive data ](https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html) 

# Best practice 3.4 – Identify the source data owners and have them set the data classifications
<a name="best-practice-3.4---identify-the-source-data-owners-and-have-them-set-the-data-classifications."></a>

 Identify the owners of the source data, like business data owners, and agree what level of protection is required for the data within the analytics platform. 

 Data classifications follow the data as it moves throughout the analytics workﬂow to ensure that the data is protected, and to determine who and what systems are allowed to access the data. By following the organization’s classification policies, the analytics workload should be able to differentiate the data protection implementations for each class of data. Because each organization has different kinds of classification, the analytics workload should provide a strong logical boundary between processing data of different sensitivity levels. These classifications include *restricted*, *confidential*, and *sensitive*. 

## Suggestion 3.4.1 – Assign owners per each dataset
<a name="suggestion-3.4.1---assign-owners-per-each-dataset."></a>

 A dataset, or a table in relational database, is a collection of data. A Data Catalog is a collection of metadata that helps centralize share and search information about the data within your platform. In addition to assigned classifications, this capability allows teams to search for data assets and decide whether the data asset is valuable for their analyze or data science workload. 

 The administrator of the analytics workload should know who are the owners for each dataset, and should assign the dataset ownership in the Data Catalog. 

## Suggestion 3.4.2 – Define attestation scope and reviewer as additional scope for sensitive data
<a name="suggestion-3.4.2---define-attestation-scope-and-reviewer-as-additional-scope-for-sensitive-data."></a>

 As the owner of the analytics workload, you should know the data owner for each dataset. For example, when a dataset classified as highly sensitive has permission issues within the organization, you might have to talk to the dataset owners and have them resolve the issues. 

## Suggestion 3.4.3 – Set expiry for data ownership and attestation, and have owners reconfirm periodically
<a name="suggestion-3.4.3---set-expiry-for-data-ownership-and-attestation-and-have-owners-reconfirm-periodically."></a>

 As businesses change, the data owners and the data classifications might change as well. Run campaigns periodically, such as quarterly or yearly, to request each of the dataset owners to reconfirm that they are still the right owners, and that the data classifications are still accurate. 

# Best practice 3.5 – Record data classifications into the Data Catalog so that analytics workloads can understand
<a name="best-practice-3.5---record-data-classifications-into-the-data-catalog-so-that-analytics-workloads-can-understand."></a>

 Allow processes to update the Data Catalog so it can provide a reliable record of where the data is located and its precise classification. To protect the data effectively, analytics systems should know the classifications of the source data so that the systems can govern the data according to business needs. For example, if the business requires that confidential data be encrypted using team-owned private keys, such as from AWS Key Management Service (AWS KMS), then the analytics workload should be able to determine which data is classified as confidential by referencing its data catalog. 

## Suggestion 3.5.1 – Use tags to indicate the data classifications
<a name="suggestion-3.5.1---use-tags-to-indicate-the-data-classifications."></a>

 Use a tagging ontology to designate the classiﬁcation of sensitive data in data stores with a data catalog. A tagging ontology allows discoverability of data sensitivity without directly exposing the underlying data. They also can be used to authorize access in [tag-based access control (TBAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) schemes. 

 For more details, refer to the following information: 
+  AWS Lake Formation Developer Guide: [What Is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) 
+ AWS Whitepaper: [Tagging Best Practices](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/tagging-best-practices.html)
+  AWS Lake Formation: [Easily manage your data lake at scale using AWS Lake Formation Tag-based access control](https://aws.amazon.com/blogs/big-data/easily-manage-your-data-lake-at-scale-using-tag-based-access-control-in-aws-lake-formation/) 

## Suggestion 3.5.2 – Record lineage of data to track changes in the Data Catalog
<a name="suggestion-3.5.2---record-lineage-of-data-to-track-changes-in-the-data-catalog."></a>

 Data lineage is a relation among data and the processing systems. For example, the data lineage tells where the source system of the data has come from, what changes occurred to the data, and which downstream systems have access to it. Your organization should be able to discover, record, and visualize the data lineage from source to target systems. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Metadata classification, lineage, and discovery using Apache Atlas on Amazon EMR](https://aws.amazon.com/blogs/big-data/metadata-classification-lineage-and-discovery-using-apache-atlas-on-amazon-emr/) 

# Best practice 3.6 – Implement encryption policies
<a name="best-practice-3.6---implement-encryption-policies"></a>

 Data encryption is a way of translating data from plaintext (unencrypted) to ciphertext (encrypted). Encryption is a critical component of a *defense in depth* strategy. Therefore, it is highly recommended that your organization implement a well-designed encryption and key management system by separating access to the decryption key from access to your data to provide data security. 

## Suggestion 3.6.1 – Implement encryption policies for data at rest and in transit
<a name="suggestion-3.6.1---implement-encryption-policies-for-data-at-rest-and-in-transit"></a>

 Each analytics service provides different types of encryption methods. Review the viable encryption methods of your solutions and implement as necessary. 

 For more details, refer to the following information: 
+  [AWS Key Management Service (AWS KMS) encryption best practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/encryption-best-practices/kms.html) 
+  AWS Big Data Blog: [Best Practices for Securing Amazon EMR](https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/) 
+  AWS Big Data Blog: [Encrypt Your Amazon Redshift Loads with Amazon S3 and AWS KMS](https://aws.amazon.com/blogs/big-data/encrypt-your-amazon-redshift-loads-with-amazon-s3-and-aws-kms/) 
+  AWS Big Data Blog: [Encrypt and Decrypt Amazon Kinesis Records Using AWS KMS](https://aws.amazon.com/blogs/big-data/encrypt-and-decrypt-amazon-kinesis-records-using-aws-kms/) 
+  AWS Partner Network (APN) Blog: [Data Tokenization with Amazon Redshift and Protegrity](https://aws.amazon.com/blogs/apn/data-tokenization-with-amazon-redshift-and-protegrity/) 

# Best practice 3.7 – Implement data retention policies for each class of data in the analytics workload
<a name="best-practice-3.7---implement-data-retention-policies-for-each-class-of-data-in-the-analytics-workload."></a>

 The business’s data classification policies determine how long the analytics workload should retain the data and how long backups should be kept. These policies help ensure that every system follows the data security rules and compliance requirements. The analytics workload should implement data retention and backup policies according to these data classification policies. For example, if the policy requires every system to retain the operational data for five years, the analytics systems should implement rules to keep the in-scoped data for five years. More information on data retention can be found in [Sustainability](sustainability.md). 

## Suggestion 3.7.1 – Create backup requirements and policies based on data classifications
<a name="suggestion-3.7.1---create-backup-requirements-and-policies-based-on-data-classifications."></a>

 Data backup should be based on business requirements, such as recovery point objective (RPO), recovery time objective (RTO), data classifications, and the compliance and audit requirements. 

## Suggestion 3.7.2 – Create data retention requirement policies based on the data classifications
<a name="suggestion-3.7.2-create-data-retention-requirement-policies-based-on-the-data-classifications."></a>

 Avoid creating blanket retention policies. Instead, policies should be tailored to individual data assets based on their retention requirements. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Building a cost efficient, petabyte-scale lake house with Amazon S3 Lifecycle rules](https://aws.amazon.com/blogs/big-data/part-1-building-a-cost-efficient-petabyte-scale-lake-house-with-amazon-s3-lifecycle-rules-and-amazon-redshift-spectrum/) [and Amazon Redshift Spectrum: Part 1](https://aws.amazon.com/blogs/big-data/part-1-building-a-cost-efficient-petabyte-scale-lake-house-with-amazon-s3-lifecycle-rules-and-amazon-redshift-spectrum/) 
+  AWS Big Data Blog: [Retaining data streams up to one year with Amazon Kinesis Data Streams](https://aws.amazon.com/blogs/big-data/retaining-data-streams-up-to-one-year-with-amazon-kinesis-data-streams/) 
+  AWS Big Data Blog: [Retain more for less with UltraWarm for Amazon OpenSearch Service](https://aws.amazon.com/blogs/big-data/retain-more-for-less-with-ultrawarm-for-amazon-opensearch-service/) 

## Suggestion 3.7.3 – Create data version requirements and policies
<a name="suggestion-3.7.3-create-data-version-requirements-and-policies."></a>

 Implement a process that captures the data version to address, based on compliance, security, and operational requirements. 

 For more details, refer to the following information: 
+  AWS Storage Blog: [Reduce storage costs with fewer noncurrent versions using Amazon S3 Lifecycle](https://aws.amazon.com/blogs/storage/reduce-storage-costs-with-fewer-noncurrent-versions-using-amazon-s3-lifecycle/) 
+  AWS Storage Blog: [Simplify your data lifecycle by using object tags with Amazon S3 Lifecycle](https://aws.amazon.com/blogs/storage/simplify-your-data-lifecycle-by-using-object-tags-with-amazon-s3-lifecycle/) 
+  AWS Database Blog: [Implementing version control using Amazon DynamoDB](https://aws.amazon.com/blogs/database/implementing-version-control-using-amazon-dynamodb/) 

# Best practice 3.8 – Enforce downstream systems to honor the data classifications
<a name="best-practice-3.8---enforce-downstream-systems-to-honor-the-data-classifications."></a>

 Since other data-consuming systems will access the data that the analytics workload shares, the workload should require the downstream systems to implement the required data classification policies. For example, if the analytics workload shares the data that is required to be encrypted using customer managed private keys in AWS Key Management Service (AWS KMS), then the downstream systems should also acknowledge and implement such a data protection policy. 

 This helps to ensure that the data is protected throughout the data pipelines. 

## Suggestion 3.8.1 – Have a centralized, shareable catalog with cross-account access to ensure that data owners manage permissions for downstream systems
<a name="suggestion-3.8.1---have-a-centralized-shareable-catalog-with-cross-account-access-to-ensure-that-data-owners-manage-permissions-for-downstream-systems."></a>

 Downstream systems can run on independent AWS accounts, different from the AWS account running the majority of the analytics workload. Downstream systems should be able to discover the data, acknowledge the required data protection policies, and enforce those policies across the analytics platform. 

 To allow the downstream systems to use the data from analytics workload, the analytics workload should provide cross-account access based on least privileges for each dataset. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Cross-account AWS Glue Data Catalog access with Amazon Athena](https://aws.amazon.com/blogs/big-data/cross-account-aws-glue-data-catalog-access-with-amazon-athena/) 
+  AWS Big Data Blog: [How JPMorgan Chase built a data mesh architecture to drive significant value to](https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/) [enhance their enterprise data platform](https://aws.amazon.com/blogs/big-data/how-jpmorgan-chase-built-a-data-mesh-architecture-to-drive-significant-value-to-enhance-their-enterprise-data-platform/) 

## Suggestion 3.8.2 – Monitor the downstream systems’ eligibility to access classified data from the analytics workload
<a name="suggestion-3.8.2-monitor-the-downstream-systems-eligibility-to-access-classified-data-from-the-analytics-workload."></a>

 Monitor the downstream systems’ eligibility to handle sensitive data. For example, you do not want development or test Amazon Redshift clusters to read sensitive data from the analytics workload. If your organization runs a program that certifies which systems are eligible to process various classes of data, periodically verify that each downstream system’s data processing eligibility levels are correct and the list of data that it accesses are appropriate.