# Modern data architecture
<a name="modern-data-architecture"></a>

 Organizations have been building data lakes to analyze massive amounts of data for deeper insights into their data. To do this, they bring data from multiple silos into their data lake, and then run analytics and AI/ML directly on it. It is also common for these organizations to have data stored in specialized data stores, such as a NoSQL database, a search service, or a data warehouse, to support different use cases. To analyze all of the data that is spread across the data lake and other data stores efficiently, businesses often move data in and out of the data lake and between these data stores. This data movement can get complex and messy as the data grows in these data stores. 

 To address this, businesses need a data architecture that allows building scalable, cost-effective data lakes. The architecture can also support simplified governance and data movement between various data stores. We refer to this as a *modern data architecture*. Modern data architecture integrates a data lake, a data warehouse, and other purpose-built data stores while enabling unified governance and seamless data movement. 

 As shown in the following diagram, with a modern data architecture, organizations can store their data in a data lake and use purpose-built data stores that work with the data lake. This approach allows access to all of the data to make better decisions with agility. 

![\[Diagram showing a modern data architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/modern-data-architecture.png)


 There are three different patterns for data movement. They can be described as follows: 

 **Inside-out data movement:** A subset of data in a data lake is sometimes moved to a data store, such as an Amazon OpenSearch Service cluster or an Amazon Neptune cluster. This pattern supports specialized analytics, such as search analytics, building knowledge graphs, or both. For example, enterprises send information from structured sources (such as relational databases), unstructured sources (such as metadata, media, or spreadsheets) and other assets to a data lake. From there, it is moved to Amazon Neptune to build a knowledge graph. We refer to this kind of data movement as inside-out. 

 **Outside-in data movement:** Organizations use data stores that best fit their applications and later move that data into a data lake for analytics. For example, to maintain game state, player data, session history, and leader boards, a gaming company might choose Amazon DynamoDB as the data store. This data can later be exported to a data lake for additional analytics to improve the gaming experience for their players. We refer to this kind of data movement as outside-in. 

 **Around the perimeter:** In addition to the two preceding patterns, there are scenarios where the data is moved from one specialized data store to another. For example, enterprises might copy customer profile data from their relational database to a NoSQL database to support their reporting dashboards. We refer to this kind of data movement as around the perimeter.

# Characteristics
<a name="characteristics-1"></a>

 **Scalable data lake:** A data lake should be able to scale easily to petabytes and exabytes as data grows. Use a scalable, durable data store that provides the fastest performance at the lowest cost, supports multiple ways to bring data in, and has a good partner ecosystem. 

 **Data diversity:** Applications generate data in many formats. A data lake should support diverse data types—structured, semi-structured, or unstructured. 

 **Schema management:** A modern data architecture should support schema on read for a data lake with no strict source data requirement. The choice of storage structure, schema, ingestion frequency, and data quality should be left to the data producer. A data lake should also be able to incorporate changes to the structure of the incoming data that is referred to as schema evolution. In addition, schema enforcement helps businesses ensure data quality by preventing writes that do not match the schema. 

 **Metadata management:** Data should be self-discoverable with the ability to track lineage as data ﬂows through tiers within the data lake. A comprehensive Data Catalog that captures the metadata and provides a queryable interface for all data assets is recommended. 

 **Unified governance:** A modern data architecture should have a robust mechanism for centralized authorization and auditing. Configuring access policies in the data lake and across all the data stores can be overly complex and error prone. Having a centralized location to define the policies and enforce them is critical to a secure modern data architecture. 

 **Transactional semantics:** In a data lake, data is often ingested nearly continuously from multiple sources and is queried concurrently by multiple analytic engines. Having atomic, consistent, isolated, and durable (ACID) transactions is pivotal to keeping data consistent. 

 **Transactional Data Lake:** Data lakes offer one of the best options for cost, scalability, and flexibility to store data at a low cost, and to use this data for different types of analytics workloads. However, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies. Open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, while still supporting storage on cost-effective systems. These features include: 
+  **ACID transactions:** Allowing a write to completely succeed or be rolled back in its entirety 
+  **Record-level operations:** Allowing for single rows to be inserted, updated, or deleted 
+  **Indexes:** Improving performance in addition to data lake techniques like partitioning 
+  **Concurrency control:** Allowing for multiple processes to read and write the same data at the same time 
+  **Schema evolution:** Allowing for columns of a table to be added or modified over the life of a table 
+  **Time travel:** Query data as of a point in time in the past 

 The three most common and prevalent open table formats are Apache Hudi, Apache Iceberg, and Delta Lake. 

# Reference architecture
<a name="reference-architecture"></a>

![\[Reference architecture diagram for a modern data architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/modern-data-architecture-reference-architecture.png)


# Configuration notes
<a name="configuration-notes"></a>
+  To organize data for efficient access and easy management: 
  +  The storage layer can store data in different states of consumption readiness, including raw, trusted, conformed, enriched, and modeled. It’s important to segment your data lake into landing, raw, trusted, and curated zones to store data depending on its consumption readiness. Typically, data is ingested and stored as is in the data lake (without having to first define schema) to accelerate ingestion and reduce time needed for preparation before data can be explored. 
  +  Partition data with keys that align to common query criteria. 
  +  Convert data to an open columnar file format, and apply compression. This will lower storage usage, and increase query performance. 
+  Choose the proper storage tier based on data temperature. Establish a data lifecycle policy to delete old data automatically to meet your data retention requirements. 
+  Decide on a location for data lake ingestion, for example, an S3 bucket. Select a frequency and isolation mechanism that meet your business needs. 
+  Depending on your ingestion frequency and data mutation rate, schedule file compaction to maintain optimal performance. 
+  Use AWS Glue crawlers to discover new datasets, track lineage, and avoid a data swamp. 
+  Manage access control and security using AWS Lake Formation, IAM role setting, AWS KMS, and AWS CloudTrail. 
+  There is no need to move data between a data lake and the data warehouse for the data warehouse to access it. Amazon Redshift Spectrum can directly access the dataset in the data lake. 
+  For more details, refer to the [Derive Insights from AWS Modern Data](https://docs.aws.amazon.com/whitepapers/latest/derive-insights-from-aws-modern-data/derive-insights-from-aws-modern-data.html) whitepaper. 

   
## User personas
<a name="user-personas"></a>

 To get the full value from your modern data architecture, there are various personas who will access the data and perform data analytics. For example, the chief data officer (CDO) of an organization is responsible for driving digital innovation and transformation across lines of business. This CDO should set a data-driven vision for the organization and be a champion of using data, analytics, and AI/ML to inform business decisions. 

 Table 4: Key personas for a modern data architecture 


|  |  |  |  | 
| --- |--- |--- |--- |
|  Personas  |  Responsibility  |  Areas of interest  |  Modern data architecture purpose-built AWS services  | 
| Chief data officer (CDO) |  Build a culture of using data to solve problems and accelerate innovation.  |  Data quality, data governance, data and AI strategy, evangelize the value of data to the business.  |   AWS Lake Formation, Amazon OpenSearch Service   | 
| Data architect  |  Driven to architect technical solutions to meet business needs. Focuses on solving complex data challenges to help the CDO deliver on their vision.  |  Data pipeline, data processing, data integration, data governance, and data catalogs.  |  AWS Glue, Amazon EMR, Amazon Redshift, Amazon Athena, Amazon OpenSearch Service  | 
| Data engineer  |  Deliver usable, accurate dataset to organization in a secure and performant manner. |  Variety of tools to build data pipeline, ease of use, configuration, and maintenance.  |  AWS Glue, Amazon EMR, Amazon Kinesis, Amazon Redshift, Amazon Athena, Amazon OpenSearch Service  | 
|  Data security officer  |   Data security, privacy, and governance must be strictly defined and adhered to.   |  Keeping information secure. Comply with data privacy regulations and protecting personally identifiable information (PII), applying fine-grained access controls and data masking.  |  AWS Lake Formation, AWS Identity and Access Management (IAM).  | 
|  Data scientist  |  Construct the means for extracting business-focused insight from data quickly for the business to make better decision.  |  Tools that simplify data manipulation, and provide deeper insight than visualization tools. Tools that help build the ML pipeline.  |   Amazon SageMaker AI, Amazon Athena,   Quick,  AWS Glue Studio, AWS Glue DataBrew  | 
|  Data analyst  |  React to market conditions in real time, must have the ability to find data and perform analytics quickly and easily.  |  Querying data and performing analysis to create new business insights, producing reports and visualizations that explain the business insights.  |   Amazon Athena,   Quick,   AWS Glue Studio, Amazon Redshift   |