

# Getting started with the lakehouse architecture of Amazon SageMaker
<a name="lakehouse-get-started"></a>

This guide helps you accomplish common tasks like finding relevant datasets, running SQL queries against your data warehouse and data lake simultaneously, collaborating with team members through data publishing, and maintaining data governance standards. Your administrator will provide the necessary access permissions and project roles to get started.

**Topics**
+ [Prerequisites](#lakehouse-get-started-prerequisites)
+ [Create a project](#lakehouse-create-project)
+ [Browse data](#lakehouse-get-started-browse)
+ [Upload data](#lakehouse-getting-started-upload-data)
+ [Query data](#lakehouse-get-started-query)
+ [Adding data sources in lakehouse architecture](lakehouse-add-data.md)
+ [Publishing data in lakehouse architecture](lakehouse-publish.md)

## Prerequisites
<a name="lakehouse-get-started-prerequisites"></a>
+ Your administrator must grant you access to the [lakehouse architecture](https://docs.aws.amazon.com//sagemaker-unified-studio/latest/userguide/getting-started-access-the-portal.html).

  If you don't have access to it, contact your administrator. For more information, see [https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started-access-the-portal.html](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started-access-the-portal.html).
+ You must have a Amazon SageMaker Unified Studio project and with the proper project membership role.

  If you don't have proper access to a project, contact your administrator. To view your project membership role, choose **Actions** on the top right corner of the project overview page, then choose **Manage members**. You will see your membership role in the **Role** column.

## Create a project
<a name="lakehouse-create-project"></a>

You can create a project from a project profile, which defines a template for projects in your domain. To use lakehouse architecture, your project must be created using either [Data analytics and AI-ML model development](https://docs.aws.amazon.com//sagemaker-unified-studio/latest/adminguide/data-analytics-ai-model-development.html) or [SQL analytics](https://docs.aws.amazon.com//sagemaker-unified-studio/latest/adminguide/project-profiles.html) project profile. For more information about creating a project, see [Create a project](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started-create-a-project.html) from lakehouse architecture User Guide.

When using lakehouse architecture, you can create the following resources in the lakehouse:

1. **Databases in AWS Glue Data Catalog**

   lakehouse architecture is implemented on AWS Glue and AWS Lake Formation in your AWS account. 

1. **A catalog to store data in Redshift Managed Storage (RMS) format**

   You will create a catalog in RMS format. To view the catalog, navigate to the AWS Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/), you should be able to see the catalog from the **Catalogs** list.

1. **Provisioning permissions**

   You will create an IAM role when you create a project. Each project has a dedicated IAM role. This IAM role has permission to the resources that are created from this project. The Amazon Resource Name (ARN) of this IAM role is visible from **Project details** section of the **Project overview** page.

## Browse data
<a name="lakehouse-get-started-browse"></a>

You can browse data in lakehouse architecture by completing the following steps.

**To browse data**

1. Choose a project to view the data.

1. On project page, from the left navigation, choose **Data**. This opens the **Data** explorer in the middle of the page.

   The **Data** explorer includes: **Lakehouse**, **Redshift**, and **S3**.

1. Expand **Lakehouse** to view catalogs, databases, tables.

## Upload data
<a name="lakehouse-getting-started-upload-data"></a>

You can upload data in CSV or JSON format to a catalog. To upload data, follow the instructions in [Uploading data](lakehouse-upload-data.md).

After uploading data is complete, you will see the table listed within the database under **AwsDataCatalog**.

## Query data
<a name="lakehouse-get-started-query"></a>

You can query data using supported query editor.

**To query data**

1. On **Lakehouse**, choose **AwsDataCatalog** on top. Expand the catalog to view the list of databases. Choose a database.

1. From a selected database, choose a table. Then choose the three dot menu to the right of the table to view supported tools for data query.

1. Choose **Query with Athena**. This opens the **Data explorer** page where you can run SQL queries. You might find information in [SQL reference for Athena](https://docs.aws.amazon.com//athena/latest/ug/ddl-sql-reference.html) helpful.

1. Choose **Query with Amazon Redshift**. This opens the **Data explorer** page where you can run SQL queries. You might find information in [Querying a database using the query editor v2](https://docs.aws.amazon.com//redshift/latest/mgmt/query-editor-v2.html) helpful.

   

To subscribe an asset, see [Request subscription to assets in Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/subscribe-to-data-assets-managed.html).

To publish data to the catalog from the lakehouse inventory, see [Publishing data in lakehouse architecture](lakehouse-publish.md).

# Adding data sources in lakehouse architecture
<a name="lakehouse-add-data"></a>

lakehouse architecture supports several data sources. If you are new to data connections in lakehouse architecture, see [Data connections in the lakehouse architecture of Amazon SageMaker](lakehouse-data-connection.md).

**Topics**
+ [Creating connections in lakehouse architecture](lakehouse-create-connection.md)
+ [Uploading data](lakehouse-upload-data.md)
+ [Creating a catalog](lakehouse-create-catalog.md)
+ [Adding existing databases and catalogs using AWS Lake Formation permissions](lakehouse-add-catalog.md)
+ [Creating an AWS Glue database via Amazon SageMaker Unified Studio](lakehouse-add-new-database.md)
+ [Deleting an AWS Glue database via Amazon SageMaker Unified Studio](lakehouse-delete-new-database.md)
+ [Amazon S3 tables integration](lakehouse-s3-tables-integration.md)

# Creating connections in lakehouse architecture
<a name="lakehouse-create-connection"></a>

Amazon SageMaker Unified Studio provides an interface for managing and utilizing data connections across various AWS services and external data sources. With Amazon SageMaker Unified Studio, you create, configure, and manage connections to databases, data warehouses, and applications all from a single platform. Amazon SageMaker Unified Studio allows you to explore your connected data sources, preview sample data, and seamlessly use these connections in SQL queries and Spark notebooks without having to switch between different interfaces or manage complex connection details manually.

## Access the data explorer in a project
<a name="access-data-explorer"></a>

1. Open your web browser and navigate to Amazon SageMaker Unified Studio.

1. Enter your corporate credentials (usually integrated with Amazon IAM Identity Center).

1. After successful authentication, you'll be directed to the Amazon SageMaker Unified Studio home page. On the home page, you'll see a list of projects you have access to. Select the project you want to work with by clicking on its name.

1. From the dropdown menu, select the **Data** or **Data Management** option. This will open the Data section of the project overview page. In this data explorer, you can see a tree-like structure representing your data sources.

## Create a new connection to add data sources
<a name="create-new-connection"></a>

**To add a new data source**

1. In the data explorer, select the **\$1** button. Click this button to start adding a new data source.

1. In the modal, select **Add connection**. You'll be presented with a gallery of connector options. Select the connector you need. For supported data sources, see []().
**Note**  
lakehouse architecture currently supports lowercase table, column, and database names. For optimal experience in lakehouse architecture, ensure that all database identifiers are in lowercase.

1. You must configure your connector details. For example, if you choose to use a DynamoDB connection (preview), fill in the required fields, which can include:
   + Name: A unique identifier for this connection in Amazon SageMaker Unified Studio.
   + Description (optional): A description of the connection.
**Note**  
Each supported data source can have different parameters for the connection. Contact your administrator if you need them.

To see your DynamoDB tables displayed in lakehouse architecture after you add the connection, your administrator must grant you access through resource policies in the Amazon DynamoDB console.

**To grant access to a DynamoDB table, your administrator can complete the following steps.**

1. Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. On the left navigation of the DynamoDB console, choose **Tables**.

1. From the **Tables** page, choose the table to add access to.

1. On the details page of the selected table, choose **Permission**.

1. On the **Resource-based policy for table** section, update the policy with the project role ARN in `Condition`.
**Note**  
You can find the project ARN on the Page details page in the lakehouse architecture.

   The following is an example policy. It allows access of the IAM role named `datazone_user_role_projectid` to perform the allowed actions (`Query`, `Scan`, `DescribeTable`, `PartiQLSelect`) on the specified DynamoDB table. Administrators should choose to allow or deny the set of actions.

   ```
   {
       "Sid": "Statement1",
       "Effect": "Allow",
       "Principal": "*",
       "Action": [
           "dynamodb:Query",
           "dynamodb:Scan",
           "dynamodb:DescribeTable",
           "dynamodb:PartiQLSelect"
        ],
       "Resource": "arn:aws:dynamodb:region:account:table/table_name",
       "Condition": {
           "ArnEquals": {
           "aws:PrincipalArn": "arn:aws:iam::region:role/datazone_user_role_projectid"
           }
       }
   }
   ```

## Explore a connected data source
<a name="explore-connected-data-source"></a>

After you have connected your data source, you can explore the data source in the data explorer.

1. After your connection is created, return to the data explorer.

1. You should now see your new connection listed in **Lakehouse**.

1. Expand the new connection to view available databases.

1. Expand a database to explore its schema.

1. You can select a table name to view more details about that table, such as Schema details and a list of tables. You can then examine the tables themselves by selecting a table.

1. You will be able to see tabs for **Columns** and **Sample data**. In the **Columns** view, you can view a list of columns in the table, as well as the data types for each column. In the **Sample data** view, you can see the rows of data from the table and use built-in sorting and filtering options to explore the data.

## Authentication and tagging for creating connections
<a name="data-connection-authentication"></a>

You administrator must create credentials and configure the secret tags for you before you create a connection.

**Credentials**

When creating a connection, if you choose a data source that requires the credentials for **Authentication**, contact your administrator because they must create and provide these credentials. There are two types of the credentials:
+ User name and password
+ AWS Secrets Manager

**Secret tags**
+ To ensure the secret can only be used for a particular project, your administrator must tag with the `AmazonDataZoneProject` tag key and the value will be `projectId`.
+ To use the secret across multiple projects, your administrator must tag the secret with `for-use-with-all-datazone-projects = true`.

# Uploading data
<a name="lakehouse-upload-data"></a>

You can upload data to the lakehouse architecture.

**To upload data**

1. On the **Data** section in the middle of the project page, choose **\$1** on the top. This opens **Add data source** on the right.

1. On **Add data source**, choose **Upload data**. 

1. Choose **Click to upload** or drag and drop a CSV or JSON file. Complete the information in the form.

1. Choose **Upload data**.

# Creating a catalog
<a name="lakehouse-create-catalog"></a>

You can create a catalog for your Redshift Managed Storage (RMS) objects.

**To create a catalog**

1. On the **Data** explorer in the middle of the project page, choose **\$1** on the top. This opens **Add data source** on the right.

1. On **Add data source**, choose **Create catalog**. Enter a name for your catalog.

1. Choose **Create**. 

# Adding existing databases and catalogs using AWS Lake Formation permissions
<a name="lakehouse-add-catalog"></a>

You can add existing databases and catalogs to the lakehouse architecture.

**To add existing databases and catalogs using AWS Lake Formation permissions**

1. Sign in to the lakehouse architecture by using the link your administrator gave you. If you don't have access to it, contact your administrator.

1. Choose a project to open the project page.

1. On the left navigation, choose **Project overview**. On **Project details**, copy the project role ARN.

1. Open the AWS Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation, from **Data catalog**, choose **Catalogs**.

1. On the **Catalogs** list view, choose a catalog you want to add to lakehouse architecture. From **Actions** on the right, choose **Grant**.

1. On the **Grant data lake permissions** page, choose **IAM users and roles** from **Principals**. Paste the IAM role you copied in the step 3.

1. On **Catalog permissions**, choose **Super user**. Choose **Grant**.

After you complete all the steps successfully, go back to the project page in the lakehouse architecture. You should see the Lake Formation catalog added to your lakehouse.

# Creating an AWS Glue database via Amazon SageMaker Unified Studio
<a name="lakehouse-add-new-database"></a>

You can use this procedure to create a new AWS Glue database in your catalog in lakehouse. 

**Note**  
In order to complete this task, make sure that the project where you're creating a new database must be created with a project profile that contains the LakeHouseDatabase blueprint configured with the **On-Demand** mode. For more information, see [Supported blueprints](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/supported-blueprints.html).

1. On the **Data** explorer in the middle of the project page, choose the ellipsis icon to the right of your catalog, and then choose **Create database**.

1. In the **Create database** pop up window, specify the name for the new database and then choose **Create database**.

# Deleting an AWS Glue database via Amazon SageMaker Unified Studio
<a name="lakehouse-delete-new-database"></a>

You can use this procedure to delete an AWS Glue database in your catalog in lakehouse. 

**Note**  
In order to complete this task, make sure that the project where you're creating a new database must be created with a project profile that contains the LakeHouseDatabase blueprint configured with the **On-Demand** mode. For more information, see [Supported blueprints](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/supported-blueprints.html).

1. On the **Data** explorer in the middle of the project page, choose the ellipsis icon to the right of the database within your catalog that you want to delete, and then choose **Remove**. 

1. In the **Drop database** pop up window, confirm the deletion and then choose **Drop database**.

# Amazon S3 tables integration
<a name="lakehouse-s3-tables-integration"></a>

The lakehouse architecture unifies all your data across Amazon S3 data lakes, Amazon Redshift data warehouses, and third-party data sources without having to copy data. Amazon S3 Tables delivers the first cloud object store with built-in Apache Iceberg support. The lakehouse architecture integrates with Amazon S3 Tables so you can access S3 Tables from AWS analytics services, such as Amazon Redshift, Athena, Amazon EMR, AWS Glue, or Apache Iceberg-compatible engines (Apache Spark or PyIceberg).

The lakehouse architecture integration with Amazon S3 Tables helps you secure analytic workflows by joining data from Amazon S3 Tables with sources, such as Amazon Redshift data warehouses, third-party, and federated data sources (Amazon DynamoDB or PostgreSQL). The lakehouse architecture also enables centralized management of fine-grained data access permissions for S3 Tables and other data, and consistently applies them across all engines. To get started, complete the steps in the following sections.

**Prerequisites** - complete all the steps in the [Getting started with the lakehouse architecture of Amazon SageMaker](lakehouse-get-started.md).

**Enable Amazon S3 integration**

1. Navigate to the [Amazon S3 console](https://console.aws.amazon.com/s3). In the left navigation pane, choose **Table buckets**. 

1. Choose **Create table bucket**.

1. On the **Create table bucket** page, enter a **Table bucket name** and select **Enable integration**.

1. Choose **Create table bucket**. 

1. You will see confirmation when Amazon S3 completes integration of your table buckets with the lakehouse architecture.

**Onboard S3 Tables in the lakehouse architecture**

To provide access to S3 tables, complete the following steps:

1. Navigate to the [AWS Lake Formation](http://console.aws.amazon.com/lakeformation) console.

1. In the left navigation pane, choose **Catalogs** and choose **S3tablescatalog**.

1. From **S3tablescatalog**, under **Objects**, choose the name of your newly created **table bucket**.

1. From the **Actions** menu, select **Grant**.

1. In the **Grant permissions**, under IAM users and roles, select your Amazon SageMaker Unified Studio Project role. To grant full access, under **Catalog Permissions > Grant**, select **Super user**. 

**Create S3 Table and add data in the lakehouse architecture**

1. Navigate to Amazon SageMaker Unified Studio, and select the project.

1. From the **Build** menu, select **Query Editor**, and ensure you have **Athena** selected in **Connections**.

1. Create a database using SQL.

   ```
   CREATE DATABASE "s3tablescatalog/<Your Bucket Name>".<YourDBName>;
   ```

1. Create an S3 table using SQL.

   ```
   CREATE TABLE "s3tablescatalog/<Your Bucket Name>".<YourDBName>.<YourTableName> 
   ( c_salutation string, 
     c_login string, 
     c_first_name string, 
     c_last_name string, 
     c_email_address string)
     TBLPROPERTIES ( 
     'table_type'='ICEBERG'  );
   ```

1. Add data using SQL.

   ```
   INSERT INTO "s3tablescatalog/<Your Bucket Name>".<YourDBName>.<YourTableName>
    VALUES('Dr.','1381546','Joyce','Deaton','Joyce.Deaton@qhtrwert.edu');
   ```

You can now use the following integrated analytics services:
+ [Amazon Athena](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-athena.html) - create databases, tables, query and add data in S3 Tables.
+ [Amazon Redshift](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-redshift.html) - query data from S3 Tables.
+ [Amazon EMR](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-emr.html) - create table, namespace, query and add data in S3 Tables.
+ [AWS Glue](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-glue.html) - create table, namespace, query and add data in S3 Tables.
+ [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/create-s3-tables-catalog.html) - grant fine-grained permissions for S3 table catalogs, databases, tables, columns, and cells.

**Note**  
Access to S3 Tables with the lakehouse architecture is available in the [AWS Regions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-regions-quotas.html) where S3 Tables are available. 

# Publishing data in lakehouse architecture
<a name="lakehouse-publish"></a>

After you have added data in the lakehouse architecture, you can publish the data to share it with other users in the lakehouse architecture. Data that is published is viewable as an asset in the project catalog and the Amazon SageMaker Catalog, and other users can create subscription requests in the Amazon SageMaker Catalog to include that data in their projects.

To publish data in the lakehouse architecture, complete the following steps:

1. Navigate to lakehouse architecture using the URL from your admin and log in using your SSO or AWS credentials.

1. Navigate to the project that contains the data that you want to publish in the lakehouse architecture. To do this, use the center menu at the top of the landing page and choose **Browse all projects**, then choose the name of the project that you want to navigate to.

1. In the center menu, choose **Data**. This takes you to the Data page.

1. Do either of the following:
   + If you want to publish a regular AWS Glue table, expand the catalog in the data navigation to view the list of databases in lakehouse architecture, then choose a database that contains the asset that you want to publish. Choose this table from the selected database and then proceed to the rest of the steps in this procedure to publish this table to the catalog. 
   + If you want to publish an Amazon S3 table to the catalog, you must first complete the following steps to create a data source for the S3 Tables catalog and schedule its run job. Then you can proceed to the rest of the steps in this procedure to publish the S3 table to the catalog.
     + Navigate to **Data sources** and then choose **Create data source**. 
     + On the **Step 1: Define source** page, specify the name for this data source, then under **Data source type** - choose **AWS Glue (Lakehouse)**, under **Data Selection** - choose **Enter the catalog name** and then speciy the name of your S3 tables catalog (s3tablescatalog/<catalog name>, then choose your database from that catalog (use the drop down menu), and then choose **Next**. 
     + On the **Step 2: Add details** page, leave all the default settings and choose **Next**.
     + On the **Step 3: Set up schedule** page, choose a run preference and then choose **Next**. 
     + On the **Step 4: review** page, review your selections and then choose **Create**.

       Once the data source for the S3 tables catalog is created and run, you can proceed with the rest of the steps below to locate your S3 table and publish it to the catalog.

1. Expand the **Actions** menu, then choose **Publish to catalog**.

1. Confirm the action in the pop-up window by choosing **Publish to catalog**.

   The lakehouse architecture then fetches metadata for the asset. After a few minutes, the metadata is fetched and a success message appears.

1. (Optional) Choose **View details** to view the asset in the project catalog.

When it is successfully published you can view it in the **Assets** section of the project catalog and users in other projects can subscribe to it from the Amazon SageMaker Catalog.

You can use the project catalog to re-publish the data if you make changes, or to unpublish the data from Amazon SageMaker Catalog. For more information, see [Data inventory and publishing](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/data-publishing.html).