

# Adding data sources in lakehouse architecture
<a name="lakehouse-add-data"></a>

lakehouse architecture supports several data sources. If you are new to data connections in lakehouse architecture, see [Data connections in the lakehouse architecture of Amazon SageMaker](lakehouse-data-connection.md).

**Topics**
+ [Creating connections in lakehouse architecture](lakehouse-create-connection.md)
+ [Uploading data](lakehouse-upload-data.md)
+ [Creating a catalog](lakehouse-create-catalog.md)
+ [Adding existing databases and catalogs using AWS Lake Formation permissions](lakehouse-add-catalog.md)
+ [Creating an AWS Glue database via Amazon SageMaker Unified Studio](lakehouse-add-new-database.md)
+ [Deleting an AWS Glue database via Amazon SageMaker Unified Studio](lakehouse-delete-new-database.md)
+ [Amazon S3 tables integration](lakehouse-s3-tables-integration.md)

# Creating connections in lakehouse architecture
<a name="lakehouse-create-connection"></a>

Amazon SageMaker Unified Studio provides an interface for managing and utilizing data connections across various AWS services and external data sources. With Amazon SageMaker Unified Studio, you create, configure, and manage connections to databases, data warehouses, and applications all from a single platform. Amazon SageMaker Unified Studio allows you to explore your connected data sources, preview sample data, and seamlessly use these connections in SQL queries and Spark notebooks without having to switch between different interfaces or manage complex connection details manually.

## Access the data explorer in a project
<a name="access-data-explorer"></a>

1. Open your web browser and navigate to Amazon SageMaker Unified Studio.

1. Enter your corporate credentials (usually integrated with Amazon IAM Identity Center).

1. After successful authentication, you'll be directed to the Amazon SageMaker Unified Studio home page. On the home page, you'll see a list of projects you have access to. Select the project you want to work with by clicking on its name.

1. From the dropdown menu, select the **Data** or **Data Management** option. This will open the Data section of the project overview page. In this data explorer, you can see a tree-like structure representing your data sources.

## Create a new connection to add data sources
<a name="create-new-connection"></a>

**To add a new data source**

1. In the data explorer, select the **\$1** button. Click this button to start adding a new data source.

1. In the modal, select **Add connection**. You'll be presented with a gallery of connector options. Select the connector you need. For supported data sources, see []().
**Note**  
lakehouse architecture currently supports lowercase table, column, and database names. For optimal experience in lakehouse architecture, ensure that all database identifiers are in lowercase.

1. You must configure your connector details. For example, if you choose to use a DynamoDB connection (preview), fill in the required fields, which can include:
   + Name: A unique identifier for this connection in Amazon SageMaker Unified Studio.
   + Description (optional): A description of the connection.
**Note**  
Each supported data source can have different parameters for the connection. Contact your administrator if you need them.

To see your DynamoDB tables displayed in lakehouse architecture after you add the connection, your administrator must grant you access through resource policies in the Amazon DynamoDB console.

**To grant access to a DynamoDB table, your administrator can complete the following steps.**

1. Sign in to the AWS Management Console and open the Amazon DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. On the left navigation of the DynamoDB console, choose **Tables**.

1. From the **Tables** page, choose the table to add access to.

1. On the details page of the selected table, choose **Permission**.

1. On the **Resource-based policy for table** section, update the policy with the project role ARN in `Condition`.
**Note**  
You can find the project ARN on the Page details page in the lakehouse architecture.

   The following is an example policy. It allows access of the IAM role named `datazone_user_role_projectid` to perform the allowed actions (`Query`, `Scan`, `DescribeTable`, `PartiQLSelect`) on the specified DynamoDB table. Administrators should choose to allow or deny the set of actions.

   ```
   {
       "Sid": "Statement1",
       "Effect": "Allow",
       "Principal": "*",
       "Action": [
           "dynamodb:Query",
           "dynamodb:Scan",
           "dynamodb:DescribeTable",
           "dynamodb:PartiQLSelect"
        ],
       "Resource": "arn:aws:dynamodb:region:account:table/table_name",
       "Condition": {
           "ArnEquals": {
           "aws:PrincipalArn": "arn:aws:iam::region:role/datazone_user_role_projectid"
           }
       }
   }
   ```

## Explore a connected data source
<a name="explore-connected-data-source"></a>

After you have connected your data source, you can explore the data source in the data explorer.

1. After your connection is created, return to the data explorer.

1. You should now see your new connection listed in **Lakehouse**.

1. Expand the new connection to view available databases.

1. Expand a database to explore its schema.

1. You can select a table name to view more details about that table, such as Schema details and a list of tables. You can then examine the tables themselves by selecting a table.

1. You will be able to see tabs for **Columns** and **Sample data**. In the **Columns** view, you can view a list of columns in the table, as well as the data types for each column. In the **Sample data** view, you can see the rows of data from the table and use built-in sorting and filtering options to explore the data.

## Authentication and tagging for creating connections
<a name="data-connection-authentication"></a>

You administrator must create credentials and configure the secret tags for you before you create a connection.

**Credentials**

When creating a connection, if you choose a data source that requires the credentials for **Authentication**, contact your administrator because they must create and provide these credentials. There are two types of the credentials:
+ User name and password
+ AWS Secrets Manager

**Secret tags**
+ To ensure the secret can only be used for a particular project, your administrator must tag with the `AmazonDataZoneProject` tag key and the value will be `projectId`.
+ To use the secret across multiple projects, your administrator must tag the secret with `for-use-with-all-datazone-projects = true`.

# Uploading data
<a name="lakehouse-upload-data"></a>

You can upload data to the lakehouse architecture.

**To upload data**

1. On the **Data** section in the middle of the project page, choose **\$1** on the top. This opens **Add data source** on the right.

1. On **Add data source**, choose **Upload data**. 

1. Choose **Click to upload** or drag and drop a CSV or JSON file. Complete the information in the form.

1. Choose **Upload data**.

# Creating a catalog
<a name="lakehouse-create-catalog"></a>

You can create a catalog for your Redshift Managed Storage (RMS) objects.

**To create a catalog**

1. On the **Data** explorer in the middle of the project page, choose **\$1** on the top. This opens **Add data source** on the right.

1. On **Add data source**, choose **Create catalog**. Enter a name for your catalog.

1. Choose **Create**. 

# Adding existing databases and catalogs using AWS Lake Formation permissions
<a name="lakehouse-add-catalog"></a>

You can add existing databases and catalogs to the lakehouse architecture.

**To add existing databases and catalogs using AWS Lake Formation permissions**

1. Sign in to the lakehouse architecture by using the link your administrator gave you. If you don't have access to it, contact your administrator.

1. Choose a project to open the project page.

1. On the left navigation, choose **Project overview**. On **Project details**, copy the project role ARN.

1. Open the AWS Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation, from **Data catalog**, choose **Catalogs**.

1. On the **Catalogs** list view, choose a catalog you want to add to lakehouse architecture. From **Actions** on the right, choose **Grant**.

1. On the **Grant data lake permissions** page, choose **IAM users and roles** from **Principals**. Paste the IAM role you copied in the step 3.

1. On **Catalog permissions**, choose **Super user**. Choose **Grant**.

After you complete all the steps successfully, go back to the project page in the lakehouse architecture. You should see the Lake Formation catalog added to your lakehouse.

# Creating an AWS Glue database via Amazon SageMaker Unified Studio
<a name="lakehouse-add-new-database"></a>

You can use this procedure to create a new AWS Glue database in your catalog in lakehouse. 

**Note**  
In order to complete this task, make sure that the project where you're creating a new database must be created with a project profile that contains the LakeHouseDatabase blueprint configured with the **On-Demand** mode. For more information, see [Supported blueprints](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/supported-blueprints.html).

1. On the **Data** explorer in the middle of the project page, choose the ellipsis icon to the right of your catalog, and then choose **Create database**.

1. In the **Create database** pop up window, specify the name for the new database and then choose **Create database**.

# Deleting an AWS Glue database via Amazon SageMaker Unified Studio
<a name="lakehouse-delete-new-database"></a>

You can use this procedure to delete an AWS Glue database in your catalog in lakehouse. 

**Note**  
In order to complete this task, make sure that the project where you're creating a new database must be created with a project profile that contains the LakeHouseDatabase blueprint configured with the **On-Demand** mode. For more information, see [Supported blueprints](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/supported-blueprints.html).

1. On the **Data** explorer in the middle of the project page, choose the ellipsis icon to the right of the database within your catalog that you want to delete, and then choose **Remove**. 

1. In the **Drop database** pop up window, confirm the deletion and then choose **Drop database**.

# Amazon S3 tables integration
<a name="lakehouse-s3-tables-integration"></a>

The lakehouse architecture unifies all your data across Amazon S3 data lakes, Amazon Redshift data warehouses, and third-party data sources without having to copy data. Amazon S3 Tables delivers the first cloud object store with built-in Apache Iceberg support. The lakehouse architecture integrates with Amazon S3 Tables so you can access S3 Tables from AWS analytics services, such as Amazon Redshift, Athena, Amazon EMR, AWS Glue, or Apache Iceberg-compatible engines (Apache Spark or PyIceberg).

The lakehouse architecture integration with Amazon S3 Tables helps you secure analytic workflows by joining data from Amazon S3 Tables with sources, such as Amazon Redshift data warehouses, third-party, and federated data sources (Amazon DynamoDB or PostgreSQL). The lakehouse architecture also enables centralized management of fine-grained data access permissions for S3 Tables and other data, and consistently applies them across all engines. To get started, complete the steps in the following sections.

**Prerequisites** - complete all the steps in the [Getting started with the lakehouse architecture of Amazon SageMaker](lakehouse-get-started.md).

**Enable Amazon S3 integration**

1. Navigate to the [Amazon S3 console](https://console.aws.amazon.com/s3). In the left navigation pane, choose **Table buckets**. 

1. Choose **Create table bucket**.

1. On the **Create table bucket** page, enter a **Table bucket name** and select **Enable integration**.

1. Choose **Create table bucket**. 

1. You will see confirmation when Amazon S3 completes integration of your table buckets with the lakehouse architecture.

**Onboard S3 Tables in the lakehouse architecture**

To provide access to S3 tables, complete the following steps:

1. Navigate to the [AWS Lake Formation](http://console.aws.amazon.com/lakeformation) console.

1. In the left navigation pane, choose **Catalogs** and choose **S3tablescatalog**.

1. From **S3tablescatalog**, under **Objects**, choose the name of your newly created **table bucket**.

1. From the **Actions** menu, select **Grant**.

1. In the **Grant permissions**, under IAM users and roles, select your Amazon SageMaker Unified Studio Project role. To grant full access, under **Catalog Permissions > Grant**, select **Super user**. 

**Create S3 Table and add data in the lakehouse architecture**

1. Navigate to Amazon SageMaker Unified Studio, and select the project.

1. From the **Build** menu, select **Query Editor**, and ensure you have **Athena** selected in **Connections**.

1. Create a database using SQL.

   ```
   CREATE DATABASE "s3tablescatalog/<Your Bucket Name>".<YourDBName>;
   ```

1. Create an S3 table using SQL.

   ```
   CREATE TABLE "s3tablescatalog/<Your Bucket Name>".<YourDBName>.<YourTableName> 
   ( c_salutation string, 
     c_login string, 
     c_first_name string, 
     c_last_name string, 
     c_email_address string)
     TBLPROPERTIES ( 
     'table_type'='ICEBERG'  );
   ```

1. Add data using SQL.

   ```
   INSERT INTO "s3tablescatalog/<Your Bucket Name>".<YourDBName>.<YourTableName>
    VALUES('Dr.','1381546','Joyce','Deaton','Joyce.Deaton@qhtrwert.edu');
   ```

You can now use the following integrated analytics services:
+ [Amazon Athena](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-athena.html) - create databases, tables, query and add data in S3 Tables.
+ [Amazon Redshift](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-redshift.html) - query data from S3 Tables.
+ [Amazon EMR](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-emr.html) - create table, namespace, query and add data in S3 Tables.
+ [AWS Glue](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-glue.html) - create table, namespace, query and add data in S3 Tables.
+ [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/create-s3-tables-catalog.html) - grant fine-grained permissions for S3 table catalogs, databases, tables, columns, and cells.

**Note**  
Access to S3 Tables with the lakehouse architecture is available in the [AWS Regions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-regions-quotas.html) where S3 Tables are available. 