

# 7 – Govern data and metadata changes
7 – Govern data and metadata changes

 **How do you govern data and metadata changes?** Controlled changes are not only necessary for infrastructure, but also required for data quality assurance. If the data changes are uncontrolled, it becomes difficult to anticipate the impact of these changes. It also makes downstream systems harder to manage data quality issues of their own. 


|  **ID**  |  **Priority**  |  **Best practice**  | 
| --- | --- | --- | 
|  ☐ BP 7.1   |  Required  |  Build a central Data Catalog to store, share, and track metadata changes.  | 
|  ☐ BP 7.2   |  Required  |  Monitor for data quality anomalies.  | 
|  ☐ BP 7.3   |  Required  |  Trace data lineage.  | 

# Best practice 7.1 – Build a central Data Catalog to store, share, and track metadata changes
BP 7.1 – Build a central Data Catalog to store, share, and track metadata changes

 Building a central Data Catalog to store, share, and manage metadata across the organization is an integral part of data governance. This will promote standardization and reuse. Tracing metadata change history in the central Data Catalog helps you manage and control version changes in the metadata. A Data Catalog is often required for auditing and compliance but by incorporating business context to a Data Catalog, it allows users in the organization to discover data assets using business terms rather than technical naming conventions. 

## Suggestion 7.1.1 – Changes on the metadata in the Data Catalog should be controlled and versioned
Suggestion 7.1.1 – Changes on the metadata in the Data Catalog should be controlled and versioned

 Use the Data Catalog change tracking features. For example, when the schema changes, AWS Glue Data Catalog will track the version change. You can use AWS Glue to compare schema versions, if needed. In addition, we recommend a change control process that only allows those authorized to make schema changes in your Data Catalog. The AWS Glue Schema registry allows you to centrally discover and control data schemas. You can create a schema contract between producers and consumers to improve data consumer awareness to data format changes.

## Suggestion 7.1.2 – Capture and publish business metadata of your data assets
Suggestion 7.1.2 – Capture and publish business metadata of your data assets

 Capturing business metadata and publishing it with metadata assets is essential for data consumers and data stewards alike. Metadata such as regulatory compliance statuses, data classification, and other important data governance characteristics, guides consumers on how to best process the data and informs data governance processes conducted by data stewards. Establishing a business glossary across the organization creates a collection of business terms that can be associated with the data assets. This ensures that business definitions are common across the organization. 

 For more details, see AWS Data Zone: [Governed Analytics](https://aws.amazon.com/datazone/). 

# Best practice 7.2 – Monitor for data quality anomalies
BP 7.2 – Monitor for data quality anomalies

 Data quality is critical for organizations to accurately measure important business metrices, bad data can impact the accuracy of analytics insights and ML predictions. Monitor data quality and detect data anomalies as early as possible. 

 For more details, see AWS Glue: [Getting started with AWS Glue Date Quality](https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-from-the-aws-glue-data-catalog/). 

## Suggestion 7.2.1 – Include a data quality check stage in the ETL pipeline as early as possible
Suggestion 7.2.1 – Include a data quality check stage in the ETL pipeline as early as possible

 A data quality check helps ensure that bad data is identified and fixed as soon as possible to prevent bad data from propagating downstream. 

## Suggestion 7.2.2 – Understand the nature of your data and determine the types of data anomalies that must be monitored and fixed based on the business requirements
Suggestion 7.2.2 – Understand the nature of your data and determine the types of data anomalies that must be monitored and fixed based on the business requirements

 The analytics workload can process various types of data, such as structured, unstructured, picture, audio, and video formats. Some data might arrive to the workload periodically, or some might constantly arrive in real time. It is pragmatic to assume that data does not always arrive to the analytics workload in perfect shape, and only a portion – not the whole set – of data matters to your workload. 

 Understand the characteristics of data, and determine what forms of data anomalies you want to remediate. For example, if you expect the data always contains an important attribute like customer ID, you can deﬁne that a datum is abnormal if it doesn’t contain the `customer_id` attribute. Common data anomalies include duplicate data, missing data, incomplete data, incorrect data format, and diﬀerent measurement units. 

## Suggestion 7.2.3 – Select an existing data quality solution or develop your own based on the requirements
Suggestion 7.2.3 – Select an existing data quality solution or develop your own based on the requirements

 There are data quality solutions that can only detect single ﬁeld data quality issues. Other solutions can handle complex stateful data quality issues related to multiple ﬁelds. 

# Best practice 7.3 – Trace data lineage
BP 7.3 – Trace data lineage

 Have a clear understanding about where your organization’s data is coming from, how the data is transformed, who and what systems have access to the data, and how the data is used, is critical to increasing the business value of data. To achieve this goal, data lineage should be tracked, managed, and visualized. 

## Suggestion 7.3.1 – Track and control data lineage information
Suggestion 7.3.1 – Track and control data lineage information

 Data lineage information should include where data has come from, where the data is going, and who has access to the data. Data changes and the business logic used should also be tracked in the data lineage. 

## Suggestion 7.3.2 – Use visualization tools to investigate data lineage
Suggestion 7.3.2 – Use visualization tools to investigate data lineage

 Data lineage can become complicated when multiple systems are interacting with each other. Building a data lineage tool to visualize data lineage can reduce troubleshooting time and help identify downstream dependencies. 

## Suggestion 7.3.3 – Build a data lineage report to satisfy compliance and audit requirements
Suggestion 7.3.3 – Build a data lineage report to satisfy compliance and audit requirements

 If some derestriction data lineage is required for compliance or audit purposes, your organization should either build a data lineage process using AWS services or investigate third-party applications. 

 For more details, refer to the following information: 
+  AWS data lineage blog**:** [Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline](https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/) 