# 9 – Choose the best-performing storage solution **How do you select the best-performing storage options for your workload?** An analytics workload’s optimal storage solution is influenced by several factors such as: + Compute engine (Amazon EMR, Amazon Redshift, Amazon RDS, and so on) + Access patterns (random or sequential) + Required throughput + Access frequency (online, offline, archival) + CRUD (create, read, update, delete) operation requirements + Data durability requirements + Archival requirements Choose the best-performing storage solution for your analytics workload’s own characteristics. | **ID** | **Priority** | **Best practice** | | --- | --- | --- | | ☐ BP 9.1 | Highly recommended | Identify critical performance criteria for your storage workload. | | ☐ BP 9.2 | Highly recommended | Identify and evaluate the available storage options for your compute solution. | | ☐ BP 9.3 | Recommended | Choose the optimal storage based on access patterns, data growth, and the performance requirements. | For more details, refer to the following information: + Amazon Elastic Compute Cloud User Guide for Linux Instances: [Amazon EBS volume types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html) + Amazon Redshift Database Developer Guide: [Amazon Redshift best practices for loading data PDF](https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html) + Amazon EMR Management Guide: [Instance storage](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) + Amazon Simple Storage Service User Guide: [Best practices design patterns: Optimizing Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html) [performance](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html) # Best practice 9.1 – Identify critical performance criteria for your storage workload In data analytics, throughput is often a constraining factor to enable your workloads to run effectively. Throughput is measured by the amount of information that has successfully moved through the network, compute, or storage layers. Improving throughput in each of these layers generally results in better query performance. ## Suggestion 9.1.1 – Use performance monitoring tools to determine if the analytics system performance is limited by compute, storage, or networking Use a metric collection and reporting system, such as Amazon CloudWatch, to analyze the performance characteristics of the analytics system. Evaluate the measured performance metrics relative to system reference documentation to characterize the system constraints for the workload as a percentage of maximum performance. # Best practice 9.2 – Identify and evaluate the available storage options for your compute solution Many AWS data analytics services allow you to use more than one type of storage. For example, Amazon Redshift allows access to data stored in the compute nodes in addition to data stored in Amazon S3. When performing research on each data analytics service, evaluate relevant storage options to determine the most performance efficient solution that meets business requirements. ## Suggestion 9.2.1 – Review the available storage options for the analytics services being considered There are often multiple storage options available for each service, each offering different characteristics and potentially performance benefits. It is important to review these available options and determine which may best fit your requirements. For example, Amazon EMR provides local storage via HDFS file system and Amazon S3 as an external storage via EMRFS. For more information, refer to the AWS documentation for your compute solution: + Amazon EMR Management Guide: [Work with storage and file systems](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html) + Amazon Redshift Cluster Management Guide: [Overview of Amazon Redshift clusters](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#working-with-clusters-overview) + Amazon OpenSearch Service Developer Guide: [Managing](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managing-indices.html) [indices in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managing-indices.html) + Amazon Aurora User Guide: [Overview of Aurora storage](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html#Aurora.Overview.Storage) ## Suggestion 9.2.2 – Evaluate the performance of the selected storage option To ensure that the overall analytics system design meets your non-functional requirements, evaluate the performance by running simulated real-world tests in a test environment. # Best practice 9.3 – Choose the optimal storage based on access patterns, data growth, and the performance requirements Storage options for data analytics can have performance tradeoffs based on access patterns and data size. For example, in Amazon S3, can be much more efficient to retrieve a smaller number of larger objects, as opposed to a larger number of smaller objects. Evaluate your workload needs and usage patterns to determine if the method or location of storing your data can improve the overall efficiency of your solution. ## Suggestion 9.3.1 – Identify available solution options for the performance improvement When data I/O is limiting performance and business requirements are not being met, improve I/O through the options available within that service. For example, with EBS volumes of GP3 type, increase Provisioned IOPS or throughput, or for Amazon Redshift, increase the number of nodes.