

# What is the lakehouse architecture of Amazon SageMaker?
<a name="what-is-smlh"></a>

The lakehouse architecture of Amazon SageMaker unifies data across Amazon S3 data lakes and Amazon Redshift data warehouses so you can work with your data in one place. You can bring data from operational databases and business applications into your lakehouse in near real-time through zero-ETL integrations. Additionally, run federated queries on data stored across multiple external data sources to access and query your data in-place. The lakehouse architecture is compatible with the Apache Iceberg open standard, giving you the flexibility to use your preferred analytics engine. Secure your data in the lakehouse architecture by defining fine-grained permissions that are enforced across all analytics and machine learning (ML) tools and engines.

The lakehouse architecture works by creating a single catalog where you can discover and query all your data. When you run a query, AWS Lake Formation checks your permissions while the query engine processes data directly from its original storage location, whether that's Amazon S3 or Amazon Redshift.

The lakehouse architecture leverages [Apache Iceberg](https://iceberg.apache.org/) table format for enhanced big data storage and analysis across multiple analytics engines. lakehouse architecture introduces Apache Iceberg REST API interface as part of the AWS Glue Data Catalog to support Iceberg compatible analytics query engines, both AWS and non-AWS. You can access both the Amazon S3 data lakes including Amazon S3 Tables and Amazon Redshift warehouse tables as Iceberg tables using the supported integrated engines, such as Amazon Athena and Spectrum.

## What is a data lakehouse?
<a name="lakehouse-intro"></a>

A data lakehouse is an architecture that unifies the scalability and cost-effectiveness of data lakes with the performance and reliability characteristics of data warehouses. This approach eliminates the traditional trade-offs between storing diverse data types and maintaining query performance for analytical workloads.

The lakehouse architecture provides the following key benefits:
+ **Transactional consistency** – ACID compliance ensures reliable concurrent operations
+ **Schema management** – Flexible schema evolution without breaking existing queries
+ **Compute-storage separation** – Independent scaling of processing and storage resources
+ **Open standards** – Compatibility with Apache Iceberg open standard
+ **Single source of truth** – Eliminates data silos and redundant storage costs
+ **Real-time and batch processing** – Supports both streaming and historical analytics
+ **Direct file access** – Enables both SQL queries and programmatic data access
+ **Unified governance** – Consistent security and compliance across all data types

# Key components of the lakehouse architecture of Amazon SageMaker
<a name="lakehouse-components"></a>

The lakehouse architecture has the following key components, in addition to the components of AWS Glue Data Catalog and AWS Lake Formation.

**Storage**  
You can read and write data into Amazon S3 or Redshift Managed Storage (RMS) based on the storage type you choose to store data in the lakehouse.

**Catalog**  
A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views such as from Amazon Redshift. You can create nested catalogs to mirror the hierarchical structure of your data sources within the lakehouse architecture.  
There are two types of catalogs in Lakehouse: federated catalogs and managed catalogs. A federated catalog mounts existing data sources you add to the lakehouse. A federated catalog can bring existing data in data sources such as Amazon Redshift, Amazon DynamoDB, and Snowflake. A managed catalog refers to a new catalog you create using Lakehouse. A managed catalog manages data using RMS or S3, as shown in the following diagram.  

![\[Catalog type in the lakehouse architecture\]](http://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/images/lakehouse/catalog-type.png)


**Database**  
Databases organize metadata tables in a catalog in the lakehouse architecture.

**Table/View**  
Tables and views are database objects that define how to access and represent the underlying data. They specify details such as schema, partitions, storage location, storage format, and the SQL query required to access the data.  
The following is a diagram of how catalogs, databases, tables/views work in Lakehouse.  

![\[How catalogs, databases, tables/views work in the lakehouse architecture\]](http://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/images/lakehouse/catalog-database.png)


# How the lakehouse architecture of Amazon SageMaker works
<a name="lakehouse-how"></a>

The lakehouse architecture organizes data from various sources into catalogs. Each catalog represents data from existing sources like Amazon Redshift data warehouses, Amazon S3 data lakes, databases, or business applications. You can also create new catalogs in the lakehouse to store data in S3 or Redshift Managed Storage (RMS). Additionally, these catalogs are mounted as databases in Amazon Redshift, so you can connect and analyze your lakehouse data using SQL tools.

You can access the data as Apache Iceberg tables and query it using analytics engines that are integrated with the lakehouse architecture. These include Amazon Athena, Amazon Redshift Spectrum, Spark in Amazon EMR and AWS Glue 5.0 ETL, Amazon SageMaker Unified Studio, and other Iceberg compatible engines.

The lakehouse architecture is built on AWS Glue Data Catalog and AWS Lake Formation in your AWS account. With the lakehouse architecture, you can access and query your existing data in Amazon Redshift data warehouses and store new data in RMS from any Apache Iceberg compatible engine.

The following diagram shows how the lakehouse architecture works.

![\[The lakehouse architecture\]](http://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/images/lakehouse/lakehouse-architecture.png)


# Data connections in the lakehouse architecture of Amazon SageMaker
<a name="lakehouse-data-connection"></a>

The lakehouse architecture provides a unified approach to managing data connections across AWS services and enterprise applications. These connections provide a consistent experience for creating, testing, and exploring data sources, regardless of the underlying data platform.

## Capabilities
<a name="lakehouse-data-connection-capabilities"></a>

With the lakehouse architecture connections, you can do the following:
+ Create connections to a variety of data sources, including business applications and external data sources
+ Manage data connections in a single place
+ Test the connectivity of your data sources to ensure they are working as expected
+ Browse the metadata and preview the data from your connected sources
+ Reuse the same connection across different AWS services like AWS Glue, Amazon Athena and Amazon SageMaker AI
+ Manage credentials using AWS Secrets Manager
+ Authenticate using basic authentication methods such as OAuth2 and IAM

## Supported data sources
<a name="lakehouse-data-connection-supported"></a>

The lakehouse architecture connections support several popular data sources, including the following:


**Supported Data Sources**  

| Data Source | Type | 
| --- | --- | 
| Google BigQuery | Database | 
| Amazon DocumentDB | Database | 
| Amazon DynamoDB | Database | 
| Amazon Redshift | Database | 
| MySQL | Database | 
| PostgreSQL | Database | 
| SQL Server | Database | 
| Snowflake | Database | 
| Oracle | Database | 
| Amazon Aurora MySQL | Database | 
| Amazon Aurora PostgreSQL | Database | 
| Microsoft Azure SQL | Database | 

**Note**  
The lakehouse architecture currently supports lowercase table, column, and database names. For optimal experience in the lakehouse architecture, ensure that all database identifiers are in lowercase.

## Use the lakehouse architecture connections
<a name="using-data-connections"></a>

After you've added data sources to the lakehouse architecture, you can use it in various AWS services:
+ Amazon SageMaker Unified Studio – Browse metadata, preview sample data, and run SQL queries against the connected data.
+ AWS Glue – Use the connection for ETL jobs and crawlers.
+ Amazon Athena – Query data directly using Athena's federated query capabilities. For more information, see [Register federated catalogs in Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/gdc-register-federated.html).
+ Amazon SageMaker AI – Access data for building machine learning models.

## Understanding resources for the lakehouse architecture of Amazon SageMaker
<a name="understanding-created-resources"></a>

When you create a connection in Amazon SageMaker Unified Studio, several resources are created in your AWS account(s) behind the scenes. These resources can include:
+ AWS Glue connection - A connection object is created in the AWS Glue crawler. This stores the core connection information and is used by various AWS services.
+ Athena data catalog - For connections that will be used with Athena , an Athena data catalog is created. This allows Athena to query the external data source.
+ AWS Glue data catalog entries - Databases, tables, and schemas from your external data source are registered in the Data Catalog. This enables AWS services to understand the structure of your external data.
+ Lambda (for Athena Federated Query) - For some data sources, a Lambda function is created to facilitate federated queries. This function acts as a bridge between Athena and the external data source.

To view these resources, access the respective AWS service consoles (AWS Glue, Athena, IAM, etc.) in the AWS account associated with your Amazon SageMaker Unified Studio project.

In these consoles, look for resources with names that include your Amazon SageMaker Unified Studio project ID or connection name.

For more information about how to create a data connection and explore a connected data source, see [Adding data sources in lakehouse architecture](lakehouse-add-data.md).

# Support for the Apache Iceberg open standard in the lakehouse architecture of Amazon SageMaker
<a name="lakehouse-iceberg"></a>

The lakehouse architecture provides support for [Apache Iceberg](https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html), enabling organizations to unify data across Amazon S3 data lakes and Amazon Redshift data warehouses while building powerful analytics and AI/ML applications on a unified data layer. 

With the lakehouse architecture, you gain the flexibility to access and query your data in-place using all Apache Iceberg compatible tools and engines, including open-source Apache Spark. This integration uses the AWS Glue Iceberg REST Catalog, which provides a standardized REST API interface for managing Iceberg table metadata and enables seamless connectivity with third-party engines. For more information, see [how to use AWS Glue Iceberg Rest Catalog for accessing Iceberg tables in Amazon S3](https://docs.aws.amazon.com/glue/latest/dg/connect-glu-iceberg-rest.html).

Through fine-grained permissions enforced across all analytics and ML tools, the lakehouse architecture ensures secure data access while supporting advanced Iceberg features like ACID transactions, schema evolution, time travel queries, and efficient row-level operations—all essential capabilities for modern data-driven organizations seeking to process and analyze vast amounts of information efficiently. 

The lakehouse architecture also supports multiple table optimization options with Glue Catalog to enhance the management and performance of Apache Iceberg tables that the AWS analytical engines and ETL jobs uses. These optimizers provide efficient storage utilization, improved query performance, and effective data management. For more information, see [Optimizing Iceberg tables](https://docs.aws.amazon.com/glue/latest/dg/table-optimizers.html).

With the lakehouse architecture, you can calculate and update number of distinct values (NDVs) for each column in Iceberg tables with the AWS Glue Data Catalog. These statistics can facilitate better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets. For more information, see [Optimizing query performance for Iceberg tables](https://docs.aws.amazon.com/glue/latest/dg/iceberg-column-statistics.html).