Get started with SageMaker Lakehouse integrated access controls for Athena federated queries in Amazon SageMaker Unified Studio
Scaling data infrastructure creates challenges with data silos, fragmented access controls, and complex connectivity requirements. Data analysts need to access information across multiple storage systems but are frequently hindered by:
-
Complex connectivity setup - Configuring connections to various data sources requires technical expertise and access to configuration details that analysts may not have.
-
Fragmented governance - Different data sources have their own access control mechanisms, making consistent security policies difficult to implement.
-
Data duplication - Copying data between systems for analysis increases costs and creates data consistency risks.
To address the challenges of data silos and fragmented access, SageMaker Lakehouse with integrated access controls for Amazon Athena (Athena) federated queries offers:
-
Streamlining the creation of connections to diverse data sources through a unified interface
-
Centralizing access control management through AWS Lake Formation
-
Enabling in-place querying through federated catalogs without data movement
-
Providing fine-grained permissions at the catalog, database, table, and column levels
-
Exploring data for ad hoc reporting and proof of concept before setting up new zero-ETL pipelines
SageMaker Lakehouse provides a unified environment for accessing, discovering, preparing,
and analyzing data from various sources for machine learning (ML) and analytics workloads.
Athena complements this as a serverless query service that analyzes data lake and federated data
sources such as Amazon DynamoDB
What you'll learn
This guide shows you how to use SageMaker Lakehouse with integrated access controls for Athena federated queries. In this guide, you create an environment where data analysts can discover and query data across sources while administrators maintain consistent governance and appropriate security controls. This guide includes the following steps:
-
Set up federated connections between SageMaker Lakehouse and DynamoDB.
-
Create connections that serve as bridges between your SageMaker Lakehouse and external data sources.
-
Enable seamless data access while maintaining security boundaries.
-
Learn how connections eliminate the need for data movement or duplication.
-
-
Create federated catalogs for data discovery.
-
Establish catalogs that contains metadata and views about tables from your connected data sources.
-
Access data from the connected data source within your SageMaker Lakehouse environment.
-
Make external tables queryable through the Lakehouse interface.
-
Use catalogs as directories of available data assets to simplify discovery and access.
-
-
Implement column-level security using AWS Lake Formation
-
Configure fine-grained permissions for sensitive data.
-
Apply data access controls based on user roles and responsibilities.
-
Ensure consistent security policies across all data sources.
-
-
Validate security controls through Athena queries
-
Test access permissions with different user personas.
-
Verify that you properly protect sensitive data.
-
Confirm that authorized users can access appropriate data.
-
Prerequisites
Before you begin, make sure you have the following:
-
An AWS account with permission to create IAM roles and IAM policies.
-
An AWS Identity and Access Management
(IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI). -
Your administrator role added as a data lake administrator in AWS Lake Formation. For more information about how to create and add a data lake administrator, see Create a data lake administrator in AWS Lake Formation Developer Guide.
-
Administrator access to Amazon SageMaker Unified Studio. For more information about permissions of the administrator role, see Lake Formation personas and IAM permissions reference in the AWS Lake Formation Developer Guide. For more information about using the IAM Identity Center directory as your identity source, see Configure user access with the default IAM Identity Center directory in the AWS IAM Identity Center User Guide. For more information about how to access SageMaker, see Accessing Amazon SageMaker Unified Studio in the Amazon SageMaker Unified Studio Administrator Guide.
-
A SageMaker Unified Studio domain with the SQL Analytics profile enabled. For more information about creating an Amazon SageMaker Unified Studio domain and a project, see Setting up Amazon SageMaker in the Amazon SageMaker User Guide. For more information about SQL analytics project profile, see SAQL analytics project profile in the Amazon SageMaker Unified Studio Administrator Guide.
Note
Add your administrator as an SSO user to your domain. For more information about how to add an SSO user as a root domain owner, see Step 1 - Create an Amazon SageMaker unified domain in the Amazon SageMaker User Guide and Managing users in Amazon SageMaker Unified Studio in the Amazon SageMaker Unified Studio Administrator Guide.
-
Two SageMaker Unified Studio projects set up for this guide:
-
An Admin project for creating connections. This project has a SQL analytics project profile.
-
A Data Analyst project for analyzing data, which includes both administrator and analysts as members. This project has a SQL analytics project profile.
For more information about how to create a project in SageMaker Unified Studio, see Setting up Amazon SageMaker in the Amazon SageMaker User Guide.
-
Note
-
To find the project role ARN for each project, in the SageMaker Unified Studio, choose the name of the project, choose Project overview, and find Project role ARN under Project details. For more information, see Get project details in the Amazon SageMaker Unified Studio User Guide.
-
For more information about how to add members to your projects, see Add project members in the Amazon SageMaker Unified Studio User Guide.
-
Administrator access to a data source. SageMaker Lakehouse connections support several popular data sources, such as Amazon DynamoDB, PostgreSQL, and Amazon DocumentDB
. In this guide, we use DynamoDB as the data source. -
To set up data sources in DynamoDB:
-
You can create a new table in DynamoDB with the partition key
cust_idand the sort keyzipcodeand another columnmobilethrough AWS CloudShellby using the following command: aws dynamodb create-table \ --table-name customer_ddb \ --attribute-definitions \ AttributeName=cust_id,AttributeType=N \ AttributeName=zipcode,AttributeType=N \ --key-schema \ AttributeName=cust_id,KeyType=HASH \ AttributeName=zipcode,KeyType=RANGE \ --provisioned-throughput \ ReadCapacityUnits=5,WriteCapacityUnits=5 \ --table-class STANDARD -
You can populate the DynamoDB table with sample data by using the following commands:
# First item aws dynamodb put-item \ --table-name customer_ddb \ --item '{"cust_id": {"N": "11"}, "zipcode": {"N": "2000"}, "mobile": {"N": "11113333"}}' # Second item aws dynamodb put-item \ --table-name customer_ddb \ --item '{"cust_id": {"N": "12"}, "zipcode": {"N": "2000"}, "mobile": {"N": "22224444"}}' # Third item aws dynamodb put-item \ --table-name customer_ddb \ --item '{"cust_id": {"N": "13"}, "zipcode": {"N": "3000"}, "mobile": {"N": "33335555"}}' # Fourth item aws dynamodb put-item \ --table-name customer_ddb \ --item '{"cust_id": {"N": "14"}, "zipcode": {"N": "4000"}, "mobile": {"N": "55556666"}}'For more information about setting up a DynamoDB data source by using AWS CloudShell, see Amazon DynamoDB tutorial for AWS Cloud9 in the AWS Cloud9 User Guide.
-
To allow the appropriate actions for the SageMaker Unified Studio projects to take on your DynamoDB data source, add a resource-based policy to your DynamoDB data source. Attach the following policy for the table
customer_ddb.This example policy allows connecting to DynamoDB tables as a federated source. Replace
us-west-2with your AWS Region,111122223333with the AWS account ID where DynamoDB is deployed,customer_ddbwith the DynamoDB table that you intend to query from SageMaker Unified Studio,datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyywith the admin project role, anddatazone_usr_role_zzzzzzzzzzzzzz_aaaaaaaaaaaaaawith the data analyst project role in SageMaker Unified Studio. For more information about how to attach a policy to a DynamoDB data source, see Attach a policy to a DynamoDB existing table in the Amazon DynamoDB Developer Guide.
-
-
Step 1: Set up federated catalogs
The first step is to set up federated catalogs for our data sources using an administrator account.
To set up federated catalogs
-
On the SageMaker Unified Studio console, for the domain you created in the prerequisite, choose Open unified studio.
-
Choose your admin project name under Your projects.
-
Choose Data in the navigation pane.
-
In the Data explorer, choose the plus icon to add a data source.
-
Under Add data, choose Add connection, choose Next.
-
Choose Amazon DynamoDB, and choose Next.
-
For Name, enter the name for your data source of DynamoDB.
-
Choose Add data.
SageMaker Unified Studio connects to the DynamoDB data source that you created in the prerequisites, registers the data source as a federated catalog with SageMaker Lakehouse, and displays it in your data explorer. The catalog references your DynamoDB data source.
To explore and query your data
-
Choose your admin project from SageMaker Unified Studio.
-
Choose Data in the navigation pane.
-
Choose the SageMaker Lakehouse catalog that you just created to view its contents. Use the data explorer to drill down to a table and choose Query with Athena.
-
In the query editor, run a sample SQL query to understand your data.
For example, run the following query. Replace
your_federated_catalog_namewith the name of the federated catalog that you just created,defaultwith the name of your database, andyour_table_namewith the name of your DynamoDB table. To learn more, see SQL analytics in the Amazon SageMaker Unified Studio User Guide.select * fromyour_federated_catalog_name.default.your_table_namelimit 10;
Note
Access to the data source in the SageMaker Unified Studio project is governed by the policies for the project role. Users whoever become the member of this admin project use the same project role ARN and have the same full access level permissions to the data source. For more information about how to add members to your projects, see Add project members in the Amazon SageMaker Unified Studio User Guide. To grant fine-grained access permissions to different user personas, such as data analysts, create a separate data analyst project and add the data analyst users as project members of the data analyst project. Step 2 shows how to set up the fine-grained permissions.
For more information about creating connections in SageMaker Lakehouse, see Creating a connection in SageMaker Lakehouse in the Amazon SageMaker Unified Studio User Guide. For more information about creating catalogs, see Creating a catalog in the Amazon SageMaker Unified Studio User Guide.
Step 2: Set up fine-grained access permissions on federated catalogs
Security is a critical aspect of data access. SageMaker Lakehouse provides integrated access controls that work with federated queries in Athena to ensure proper governance. You can manage permissions at the catalog, database, and table levels. Administrators can apply access controls at different levels of granularity to ensure sensitive data remains protected while expanding data access.
This step is to delegate access permissions on your DynamoDB federated catalogs to other users. You grant permissions to the data analyst persona. To set up the fine-grained access permissions to the data analyst persona, you need to add permissions on your DynamoDB federated catalogs to the SageMaker Unified Studio data analyst project role that you created in the prerequisites section. This will ensure that access controls that you specify are enforced when the data is queried. For more information about the Lake Formation personas and IAM permissions, see Lake Formation personas and IAM permissions reference in the AWS Lake Formation Developer Guide.
To set up fine-grained access permissions on federated catalog and database
-
Navigate to Lake Formation in the AWS Management Console
as an administrator. -
In the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
-
Choose the federated catalog name that you set up in Step 1: Set up federated catalogs. You'll see the databases.
-
Choose the database name in the catalog. You can see details for the database and manage permissions.
-
To set up permissions for the federated catalog and database to your SageMaker Unified Studio data analyst project (the data analyst project that you set up in prerequisites), from the Actions menu, choose Grant.
-
For Principal type, choose Principals.
-
For Principals, choose IAM users and roles.
-
For IAM users and roles, choose the project role ARN that you got from your data analyst project in the prerequisites section.
-
For LF-Tags or catalog resources, choose Named Data Catalog resources.
-
For Catalogs, choose the federated catalog name for the source (the federated catalog that you set up in Step 1) to grant permissions on.
-
For Databases, the console populates the databases for your DynamoDB data source.
-
For Database permissions - Database permissions, select Describe.
-
Choose Grant.
To set up fine-grained access permissions on the tables
For example, if you wish to restrict access to a sensitive column containing the mobile phone number for each customer, the steps are as follows.
-
Navigate to Lake Formation in the AWS Management Console
as an administrator. -
In the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
-
Under Choose catalog, choose the federated catalog name that you set up in Step 1.
-
Choose the table name in the catalog. You can see details for the table and manage permissions.
-
From the Actions menu, choose Grant.
-
For Principal type, choose Principals.
-
For Principals, choose IAM users and roles.
-
For IAM users and roles, choose the project role ARN that you got from your data analyst project in the prerequisites section.
-
For LF-Tags or catalog resources, choose Named Data Catalog resources.
-
For Catalogs, choose the federated catalog name for the source (the federated catalog that you set up in Step 1) to grant permissions on.
-
For Databases, the console populates the databases for our DynamoDB data source.
-
For Tables, the console populates the tables for your DynamoDB data source.
-
For Table permissions - Table permissions, select Select.
-
For Data permissions, choose Column-based access.
-
For Choose permission filter, choose Include columns.
-
For Select columns, choose columns
zipcodeandcust_id. -
Choose Grant.
In this example, we demonstrate how to set up a basic column-level filter to restrict access to sensitive data. However, SageMaker Lakehouse supports a broad range of fine-grained access control scenarios beyond column filters that allow you to meet complex security and compliance requirements across diverse data sources. For more information about managing permissions on catalogs, see Adding existing databases and catalogs using AWS Lake Formation permissions in the Amazon SageMaker Unified Studio User Guide and Managing Lake Formation Permissions in the AWS Lake Formation Developer Guide.
By implementing these fine-grained access controls, you can ensure that users only access data they're authorized to see, maintaining compliance with your organization's security policies. This creates a consistent security model across your data sources. Now, you have successfully set up fine-grained access permissions on your DynamoDB federated catalog.
Step 3: Validate fine-grained access permissions on federated catalogs
After you set up federated catalogs with fine-grained access permissions in Step 2, run queries to confirm access permissions are working as expected.
To validate fine-grained access permissions on federated catalogs
-
On the SageMaker Unified Studio console, for the domain you created in the prerequisite, choose Open unified studio.
-
Choose your data analyst project name under Your projects.
-
From the Build menu, choose Query Editor.
-
In the Data explorer, expand Lakehouse, choose the DynamoDB catalog that you created in Step 1.
-
Drill down to the table that you set up fine-grained access permissions in Step 2, and choose Query with Athena to run a sample query.
For example, run the following query. Replace
your_federated_catalog_namewith the name of your catalog,defaultwith the name of your database, andyour_table_namewith the name of your DynamoDB table. To learn more, see SQL analytics in the Amazon SageMaker Unified Studio User Guide.select * fromyour_federated_catalog_name.default.your_table_namelimit 10;Note how permissions are working as expected because the query result doesn't include the mobile phone number column that was visible in the admin project view.
To have other users under the data analyst persona get the fine-grained access permissions
-
Create data analyst SSO users or groups. For more information about how to add an SSO user to your domain, see Managing users in Amazon SageMaker Unified Studio in the Amazon SageMaker Unified Studio Administrator Guide.
-
Add these SSO users to your SageMaker Unified Studio domain. For more information about how to add an SSO user to your domain, see Managing users in Amazon SageMaker Unified Studio in the Amazon SageMaker Unified Studio Administrator Guide.
-
Add these users as members ("Contributor") to your SageMaker Unified Studio data analyst project. The data analyst users can have access to this data analyst project and will only have access to a subset of data that's defined by the data lake administrator in Step 2. For more information about how to add members to your projects, see Add project members in the Amazon SageMaker Unified Studio User Guide.
Step 4: Clean up
Make sure you remove the SageMaker Lakehouse resources to mitigate any unexpected costs. Delete the following resources:
-
The connections and catalogs that you created in Step 1.
Specifically, choose your project from SageMaker Unified Studio. Choose Data in the navigation pane. Choose the SageMaker Lakehouse catalog that you created in Step 1. Choose the Actions menu and choose Remove. Type "Confirm" and choose Remove connection.
-
The underlying DynamoDB data sources that you created in the prerequisites. For more information about deleting a DynamoDB table, see Delete your DynamoDB table to clean up resources in the Amazon DynamoDB Developer Guide.
-
The SageMaker Unified Studio admin and data analyst projects that you created in the prerequisites. For more information about deleting projects, see Delete a project in the Amazon SageMaker Unified Studio User Guide.
-
The SageMaker Unified Studio domain that you created in the prerequisites.
Next steps
Now that you've successfully set up SageMaker Lakehouse integrated access controls for Athena federated queries, consider these next steps to further enhance your data governance and analytics capabilities:
-
Expand your data sources - Connect supported data sources such as PostgreSQL, MySQL, or Amazon DocumentDB, to create a unified data ecosystem with consistent access controls.
-
Implement advanced security patterns - Explore row-level security, cell-level filtering, and attribute-based access control
to meet complex compliance requirements across your organization. For more information, see Managing Lake Formation Permissions in the AWS Lake Formation Developer Guide. -
Build analytics workflows - Create end-to-end analytics pipelines that leverage federated queries for data preparation and ML model training.
-
Integrate with visualization tools - Connect Amazon QuickSight
to your federated catalogs to create dashboards and visualizations with the same security controls. -
Automate governance processes - Use the Amazon Athena REST API (CreateDataCatalog), AWS CloudFormation (
AWS::Athena::DataCatalog) or the AWS CDK (CfnDataCatalog) to automate the creation and management of federated connections and access controls. After creating a data catalog, you need to create a data source connection and register your connection as a Glue Data Catalog.
This integration between SageMaker Lakehouse and Athena federated queries provides
significant benefits for organizations with diverse data ecosystems. Data scientists can now
analyze customer behavior by combining transaction data from PostgreSQL with clickstream data
in Amazon S3
For more information about federated queries in Athena and the data sources that support
fine-grained access controls, see Register
your connection as a Glue Data Catalog in the Athena User
Guide. For more information about extending your SageMaker Lakehouse environment,
see Add Data to SageMaker Lakehouse and Publishing Data in the Amazon SageMaker Unified Studio User
Guide. For more information about specific use cases and implementation examples,
see Simplify data access for your enterprise using SageMaker Lakehouse