# 4 – Implement data access control
4 – Implement data access control

 **How do you manage access to data within your organization’s source, analytics, and downstream systems?** 

 An analytics workload is a centralized repository of data from different source systems. As the analytics workload owner, you should honor the source systems’ access management policies when connecting to, and ingesting from, the source systems. 


|   **ID**   |   **Priority**   |   **Best practice**   | 
| --- | --- | --- | 
|  ☐ BP 4.1   |  Required  |  Allow data owners to determine which people or systems can access data in analytics and downstream workloads.  | 
|  ☐ BP 4.2   |  Required  |  Build user identity solutions that uniquely identify people and systems.  | 
|  ☐ BP 4.3   |  Required  |  Implement the required data authorization models.  | 
|  ☐ BP 4.4   |  Recommended  |  Establish an emergency access process to ensure that admin access is managed and used when required.  | 
|  ☐ BP 4.5   |  Recommended  |  Track data and database changes.  | 

 For more details, refer to the following documentation: 
+  AWS Lake Formation Developer Guide: [Lake Formation Access Control Overview](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-overview.html) 
+  Amazon Athena User Guide: AWS [Identity and Access Management in Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/security-iam-athena.html) 
+  Amazon Athena User Guide: [Enabling Federated Access to the Amazon Athena API](https://docs.aws.amazon.com/athena/latest/ug/access-federation-saml.html) 
+  Amazon Redshift Database Developer Guide: [Managing database security](https://docs.aws.amazon.com/redshift/latest/dg/r_Database_objects.html) 
+  Amazon EMR Management Guide: [AWS Identity and Access Management for Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-access-iam.html) 
+  Amazon EMR Management Guide: [Use Kerberos authentication](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html) 
+  Amazon EMR Management Guide: [Use an Amazon EC2 key pair for SSH credentials](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-access-ssh.html) 

# Best practice 4.1 – Allow data owners to determine which people or systems can access data in analytics and downstream workloads
BP 4.1 – Allow data owners to determine which people or systems can access data in analytics and downstream workloads

 Data owners are the people that have direct responsibility for data protection. For instance, the data owners want to determine which data is publicly accessible, or which data is restricted access to whom or what systems. The data owners should be able to provide data access rules, so that the analytics workload can implement the rules. 

## Suggestion 4.1.1 – Identify data owners and assign roles
Suggestion 4.1.1 – Identify data owners and assign roles

 Data ownership is the management and oversight of an organization's data assets to help provide business users with high-quality data that is easily accessible in a consistent manner. Because the analytics workload consolidates multiple datasets into a central place, each dataset is owned by different teams or people. So, it is important for the analytics workload to identify which dataset is owned by whom to have the owners control the data access permissions. 

## Suggestion 4.1.2 – Identify permission using a permission matrix for users and roles based on actions performed on the data by users and downstream systems
Suggestion 4.1.2 – Identify permission using a permission matrix for users and roles based on actions performed on the data by users and downstream systems

 To aid in identifying and communicating data-access permissions, an Access Control Matrix is a helpful method to document which users, roles, or systems have access to which datasets, and to describe what actions they can perform. Below is a sample matrix for two users, and two roles for two schemas with a table in them: 

 Table 1: Example Access Control Matrix for Users and Roles 


|   **Permissions**   |   **Read**   |   **Write**   | 
| --- | --- | --- | 
|  Schema 1  |  User1, User2, Role1, Role2  |  Role1  | 
|  Schema 1 / Table 1  |  User1, User2, Role1, Role2  |  Role2  | 
|  Schema 2  |  User1, User2, Role1, Role2  |  User1, Role1  | 
|  Schema 2 / Table 2v  |  User1, User2, Role1, Role2  |  User2, Role2  | 

 The matrix format can help identify the least permissions that are required by various resources and to avoid overlaps. An Access Control Matrix should be thought of as an abstract model of permissions at a given point in time. Periodically review the actual access permissions against the permission matrix document to ensure accuracy. 

# Best practice 4.2 – Build user identity solutions that uniquely identify people and systems
BP 4.2 – Build user identity solutions that uniquely identify people and systems

 To control data access effectively, the analytics workload should be able to uniquely identify the people or systems. For example, the workload should be able to tell who accessed to the data by looking at the user identifiers (such as user names, tags, or IAM role names) with confidence that the identifier represents only one person or system. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Amazon Redshift identity federation with multi-factor authentication](https://aws.amazon.com/blogs/big-data/amazon-redshift-identity-federation-with-multi-factor-authentication/) 
+  AWS Big Data Blog: [Federating single sign-on access to your Amazon Redshift cluster with PingIdentity](https://aws.amazon.com/blogs/big-data/federating-single-sign-on-access-to-your-amazon-redshift-cluster-with-pingidentity/) 
+  AWS Database Blog: [Get started with Amazon OpenSearch Service: Use Amazon Cognito for Kibana](https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-use-amazon-cognito-for-kibana-access-control/) [access control](https://aws.amazon.com/blogs/database/get-started-with-amazon-elasticsearch-service-use-amazon-cognito-for-kibana-access-control/) 
+  AWS Partner Network (APN) Blog: [Implementing SAML AuthN for Amazon EMR Using Okta and](https://aws.amazon.com/blogs/apn/implementing-saml-authn-for-amazon-emr-using-okta-and-column-level-authz-with-aws-lake-formation/) [Column-Level AuthZ with AWS Lake Formation](https://aws.amazon.com/blogs/apn/implementing-saml-authn-for-amazon-emr-using-okta-and-column-level-authz-with-aws-lake-formation/) 
+  AWS CloudTrail User Guide: [How AWS CloudTrail works with IAM](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/security_iam_service-with-iam.html) 

## Suggestion 4.2.1 – Centralize workforce identities
Suggestion 4.2.1 – Centralize workforce identities

 It’s a best practice to centralize your workforce identities, which allows you to federate with AWS Identity and Access Management (IAM) using AWS IAM Identity Center or another federation provider. In Amazon Redshift, IAM roles can be mapped to Amazon Redshift database groups. In Amazon EMR, IAM roles can be mapped to an Amazon EMR security configuration or an Apache Ranger Microsoft Active Directory group-based policy. In AWS Glue, IAM roles can be mapped to AWS AWS Glue Data Catalog resource policies. 

 AWS analytics services – such as Amazon OpenSearch Service and Amazon DynamoDB – allow integration with Amazon Cognito for authentication. Amazon Cognito lets you add user sign-up, sign- in, and access control to your web and mobile apps. Amazon Cognito scales to millions of users and supports sign-in with social identity providers, such as Apple, Facebook, Google, and Amazon, and enterprise identity providers via SAML 2.0 and OpenID Connect. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Federate Database User Authentication Easily with IAM and Amazon Redshift](https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/) 
+  WS Big Data Blog: [Federating single sign-on access to your Amazon Redshift cluster with PingIdentity](https://aws.amazon.com/blogs/big-data/federating-single-sign-on-access-to-your-amazon-redshift-cluster-with-pingidentity/) 
+  Amazon EMR Management Guide: [Allow AWS IAM Identity Center for Amazon EMR Studio](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-enable-sso.html) 

# Best practice 4.3 – Implement the required data access authorization models
BP 4.3 – Implement the required data access authorization models

 User authorization determines what actions that a user is permitted to take on the data or resource. The data owners should be able to use the authorization methods to protect their data as needed. For example, if the data owners must control which users are allowed to view certain columns of data, the analytics workload should provide column-wise data access authorization along with user group management for an effective control. 

## Suggestion 4.3.1 – Implement IAM policy-based data access controls
Suggestion 4.3.1 – Implement IAM policy-based data access controls

 Limit access to sensitive data stores with IAM policies where possible. Provide systems and people with rotating short-term credentials via role-based access control (RBAC). 

 For more details, see AWS Big Data Blog: [Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies](https://aws.amazon.com/blogs/big-data/restrict-access-to-your-aws-glue-data-catalog-with-resource-level-iam-permissions-and-resource-based-policies/) 

## Suggestion 4.3.2 – Implement dataset-level data access controls
Suggestion 4.3.2 – Implement dataset-level data access controls

 As dataset owners require independent rules of granting data access, you should build the analytics workloads to have the dataset owners control the data access per each dataset level. For example, if the analytics workload hosts a shared Amazon Redshift cluster, the owners of the individual table should be able to authorize the table read and write independently. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Validate, evolve, and control schemas in Amazon MSK and Amazon Kinesis Data](https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/) [Streams with AWS Glue Schema Registry](https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/). 
+  Amazon Redshift: [Amazon Redshift announces support for Row-Level Security (RLS)](https://aws.amazon.com/about-aws/whats-new/2022/07/amazon-redshift-row-level-security/) [Streams with AWS Glue Schema Registry](https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/). 

## Suggestion 4.3.3 – Implement column-level data access controls
Suggestion 4.3.3 – Implement column-level data access controls

 Care should be taken that end users of analytics applications are not exposed to sensitive data. Downstream consumers of data should only access the limited view of data necessary for that analytics purpose. Enforce that sensitive data is not exposed using column-level restrictions, for example, mask the sensitive columns to downstream systems so an accidental exposure is avoided. 

 For more details, refer to the following information: 
+  AWS Big Data Blog: [Allow fine-grained permissions for Quick authors in AWS Lake](https://aws.amazon.com/blogs/big-data/enable-fine-grained-permissions-for-amazon-quicksight-authors-in-aws-lake-formation/) [Formation](https://aws.amazon.com/blogs/big-data/enable-fine-grained-permissions-for-amazon-quicksight-authors-in-aws-lake-formation/) 
+  Amazon Redshift: [Role-based access controls](https://docs.aws.amazon.com/redshift/latest/dg/t_Roles.html) 
+  AWS Partner Network (APN) Blog: [Implementing SAML AuthN for Amazon EMR Using Okta and](https://aws.amazon.com/blogs/apn/implementing-saml-authn-for-amazon-emr-using-okta-and-column-level-authz-with-aws-lake-formation/) [Column-Level AuthZ with AWS Lake Formation](https://aws.amazon.com/blogs/apn/implementing-saml-authn-for-amazon-emr-using-okta-and-column-level-authz-with-aws-lake-formation/) 
+  AWS Big Data Blog: [Implementing Authorization and Auditing using Apache Ranger on Amazon EMR](https://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/) 

# Best practice 4.4 – Establish an emergency access process to ensure that admin access is managed and used when required
BP 4.4 – Establish an emergency access process to ensure that admin access is managed and used when required

 Emergency access allows expedited access to your workload in the unlikely event of an automated process or pipeline issue. This will help you rely on least privilege access, but still provide users the right level of access when they require it. 

## Suggestion 4.4.1 – Ensure that risk analysis is performed on your analytics workload by identifying emergency situations and a procedure to allow emergency access
Suggestion 4.4.1 – Ensure that risk analysis is performed on your analytics workload by identifying emergency situations and a procedure to allow emergency access

 Identify the potential events that can happen from source systems, analytics workload, and downstream systems. Quantify the risk of each event such as likelihood (low, medium, or high) and the size of the business impact (small, medium, or large). 

 For example, after you identified priority risks, discuss with the source and downstream system owners on how to allow analytics workload access to the source and downstream systems to continue the data processing business. 

# Best practice 4.5 – Track data and database changes
BP 4.5 – Track data and database changes

 Data auditing involves monitoring a database to track the actions of a user or process, and to audit the changes that have occurred to the data. 

## Suggestion 4.5.1 – Database triggering for data auditing
Suggestion 4.5.1 – Database triggering for data auditing

 A database trigger is procedural code that is automatically run in response to certain events on a particular table or view in a database. Database triggers can then be used to update an audit table with the changes that have occurred. The types of information that should be included in the auditing process include: the original and updated value of what has been updated, the process or stored procedure that made the update, and the time and date the update occurred. 

## Suggestion 4.5.2 – Enable advanced auditing
Suggestion 4.5.2 – Enable advanced auditing

 If your database engine supports auditing as a native feature, you should enable the feature to record and audit database events such as connections, disconnections, tables queried, or types of queries issued. 

## Suggestion 4.5.3 – AWS Lake Formation time travel queries
Suggestion 4.5.3 – AWS Lake Formation time travel queries

 Apache Iceberg and Apache Hudi provide a high-performance data lake table format that works just like a SQL table. Iceberg and Hudi make it simple to manage your data lake information and support SQL type analytics. Data that is managed by Iceberg or Hudi is version-controlled, therefore there is a complete history of all data updates. A good example is if you need to know the status of an individual at a certain time, then a time travel query allows you to select a date range to return the value that existed at that time, rather than the current value. 

 For more details, see [Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel](https://aws.amazon.com/blogs/big-data/use-aws-glue-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/). 

## Suggestion 4.5.4 – Change Data Capture (CDC)
Suggestion 4.5.4 – Change Data Capture (CDC)

 CDC records `INSERT`s, `UPDATE`s, and `DELETE`s applied to relational database tables, and makes a log available of which relational database objects changed, where, and when. These change tables contain columns that reflect the column structure of the source table you have chosen to track, along with the metadata required to understand the changes that have been made. 

 For more details, refer to the following information: 
+  AWS CloudTrail - [Secure Standardized Logging](https://aws.amazon.com/cloudtrail/) 
+  Amazon RDS Aurora - [Advanced Auditing with an Amazon Aurora](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Auditing.html) 
+  Amazon RDS Aurora - [Configuring an audit log to capture database activities for Amazon RDS](https://aws.amazon.com/blogs/database/configuring-an-audit-log-to-capture-database-activities-for-amazon-rds-for-mysql-and-amazon-aurora-with-mysql-compatibility/) 
+  AWS Database Migration Service (AWS DMS) [AWS Database Migration Service](https://aws.amazon.com/dms/)