

# 10 – Choose the best-performing file format and partitioning
<a name="design-principle-10"></a>

 **How do you select the best-performing file formats and partitioning?** Selecting the best-performing file format and data partitioning for data-at-rest can have a large impact on the overall analytics workload efficiency. 


|  **ID**  |  **Priority**  |  **Best practice**  | 
| --- | --- | --- | 
|  ☐ BP 10.1   |  Recommended  |  Select format based on data write frequency and patterns for append-only compared to in-place update.  | 
|  ☐ BP 10.2   |  Recommended  |  Choose data formatting based on your data access pattern  | 
|  ☐ BP 10.3   |  Recommended  |  Utilize compression techniques to both decrease storage requirements and enhance I/O efficiency.  | 
|  ☐ BP 10.4   |  Recommended  |  Partition your data to enable efficient data pruning and reduce unnecessary file reads.  | 

 For more details, refer to the following information: 
+  Amazon Redshift Database Developer Guide: [Creating data files for queries in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html) [Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html) 
+  Amazon EMR Release Guide: [Hudi](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html) 
+  AWS Big Data Blog: [Apply record level changes from relational databases to Amazon S3 data lake](https://aws.amazon.com/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/) [using Apache Hudi on Amazon EMR and AWS Database Migration service](https://aws.amazon.com/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/) 

# Best practice 10.1 – Select format based on data write frequency and patterns for append-only compared to in-place update
<a name="best-practice-10.1-select-format-based-on-data-write-frequency-and-patterns-for-append-only-vs.-in-place-update."></a>

 Review your data storage write patterns and performance requirements for streaming and batch workloads. Streaming workloads may require you to write smaller files at a higher frequency compared to batch workloads. This enables your streaming applications to reduce latency but can impact read and write performance of the data. 

## Suggestion 10.1.1 – Understand your analytics workload data’s write characteristics
<a name="suggestion-10.1.1---understand-your-analytics-workload-datas-write-characteristics."></a>

 If storing data in Amazon S3, evaluate if an append-only method, such as Apache Hudi, is right for your needs. 

 There are also table formats available, such as Apache Hudi, Apache Iceberg and Delta Lake that can, amongst other capabilities, provide transactional semantics over data tables in Amazon S3. These formats can also provide improved query times through the use of additional metadata. For more detail on getting started with these formats, see [Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started](https://aws.amazon.com/blogs/big-data/part-1-getting-started-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/). 

## Suggestion 10.1.2 – Avoid querying data stored in many small files
<a name="suggestion-10.1.2"></a>

 Rather than running queries over many small data ﬁles, periodically combine the small ﬁles into a single larger compressed ﬁle for analytics. This approach provides better data retrieval performance when using analytics services. Keep in mind that in streaming use cases there is a tradeoff between latency and throughput, as time is required to batch records. The production of larger files can be done as a post process job rather than necessarily at the point of ingestion. 

# Best practice 10.2 – Choose data formatting based on your data access pattern
<a name="best-practice-10.2-choose-data-formatting-based-on-your-data-access-pattern."></a>

 Choosing the right data type for your workload is important. There are many different data types available to support your workload. Choosing the right format is a key step in the performance optimization of your analytics workloads. 

## Suggestion 10.2.1 – Decide the correct data format for your analytics workload
<a name="suggestion-10.2.1-decide-the-correct-data-format-for-your-analytics-workload."></a>

 You can work on unstructured, semi-structured, and structured data formats (CSV, JSON, or columnar formats such as Apache Parquet and Apache ORC) with your data stored in Amazon S3 by using Amazon Athena, which lends itself to querying data as-is without the need for data preparation or ETL processes. 

 You should also consider compression when choosing data formats. Efficient compression can help queries run faster and reduce cost. It can also lead to reductions in the amount of data stored in a storage layer, alongside improved network and I/O throughput. For more information on when to use compression, see 10.3.2. 

 Using splittable formats is also an option. These formats allow individual files to be broken up so that they can be processed in parallel by multiple workers. Similarly to compression, this can also lead to reductions in query time. Often, you need to choose between compression or splittable formats because applying both is currently not well supported for analytics workloads. 

## Suggestion 10.2.2 – API-driven data access pattern constraints, such as the amount of data retrieved per API call, can impact overall performance
<a name="suggestion-10.2.2---be-aware-of-api-driven-data-access-pattern-constraints-such-as-the-amount-of-data-retrieved-per-api-request-which-could-impact-overall-performance-due-to-api-throttling."></a>

 If you are calling APIs to ingest, transform or access data, many implement a maximum amount of data or records that can be returned in a call. So, your solution may need to page through and make subsequent API calls to retrieve all results. If a large amount of data is returned this can lead to a long amount of time being spent retrieving the data in this manner. Most APIs have limits and constraints, such as number of calls in a particular time limit, so it is important to consider this, and relevant strategies for dealing with these conditions. 

 Result caching on API sources can help speed up reads if the same or similar data is frequently queried. Using asynchronous methods can help avoid blocking calls in your processing that would otherwise have to wait for synchronous operations to complete. 

## Suggestion 10.2.3 – Use data, results, and query cache to improve performance and reduce reads from the storage tier
<a name="suggestion-10.2.3-use-data-results-and-query-cache-to-improve-performance-and-reduce-reads-from-the-storage-tier."></a>

 Caching services can speed up the responses to common queries and reduce the load on the storage tier. Use Amazon ElastiCache, DynamoDB Accelerator (DAX), API gateway caching, Athena query result reuse, Amazon Redshift Advanced Query Accelerator (AQUA), or other relevant caching services. 

# Best practice 10.3 – Utilize compression techniques to both decrease storage requirements and enhance I/O efficiency
<a name="best-practice-10.3---use-file-compression-to-reduce-number-of-files-and-to-improve-file-io-efficiency."></a>

 Store data in a compressed format to reduce the burden on the underlying storage host and network. For example, for columnar data stored in Amazon S3, use a compatible compression algorithm that supports parallel reads. 

 We recommend that your organization test the performance and storage overhead of both uncompressed and compressed datasets to determine best fit prior to implementing this approach. 

## Suggestion 10.3.1 – Compress data to reduce the transfer time
<a name="suggestion-10.3.1"></a>

 When storage read/write performance becomes a bottleneck, use compression to reduce data transfer time. Consider the tradeoffs between compute time needed to perform compression and decompression versus the storage I/O bottleneck in your estimates of overall improvements in performance efficiency. 

## Suggestion 10.3.2 – Evaluate the available compression options for each resource of the workload
<a name="suggestion-10.3.2-evaluate-the-available-compression-options-for-each-resource-of-the-workload."></a>

 Compressing data can improve the performance as there are fewer bytes transferred between the disk and compute layers. The trade-off using this approach is that it requires more compute for data compression and decompression. You can, however, obtain a net efficiency improvement if compression performs as well as or better than uncompressed data transfer time. Compression also requires much less storage, depending on the data type in use, thus saving on data storage latency and costs. 

 

# Best practice 10.4 – Partition your data to enable efficient data pruning and reduce unnecessary file reads
<a name="best-practice-10.4---partition-your-data-to-avoid-unnecessary-file-reads."></a>

 Storing your data in structured partitions will allow compute to identify the location of only that portion of the data relevant to the query. Determine the most frequent query parameters and store this data in the appropriate location suited to your data retrieval needs. For example, if an analytics workload regularly generates daily, weekly, and monthly reports, then store your data using partitions with a year/month/day format. 

## Suggestion 10.4.1 – Partition data to support the most common query predicates
<a name="suggestion-10.4.1---store-data-thats-partitioned-based-on-columns-with-incremental-processing-constraints."></a>

 When your query uses a particular predicate in a WHERE clause, if your data is partitioned according to the field then the query engine can prune the data that it needs to look at and go directly to the relevant data partition. This means a full table scan is avoided, meaning faster performance and lower query cost. 

## Suggestion 10.4.2 – Store data partitioned based on time attributes with earlier data stored in tiers that are accessed infrequently
<a name="suggestion-10.4.2"></a>

 Use the tiering capabilities of the storage service to put infrequently-accessed data into the tier that is most appropriate for the workload. For example, in an Amazon Redshift data warehouse, data that is accessed infrequently can be stored in Amazon S3. Then you can query it with Amazon Redshift Spectrum, while more frequently-accessed data can be stored in local Amazon Redshift storage. 