

# 8 – Choose the best-performing compute solution
<a name="design-principle-8"></a>

## How do you select the best-performing options for your analytics workload?
<a name="how-do-you-select-the-best-performing-options-for-your-analytics-workload"></a>

 The definition of best-performing will mean different things to different stakeholders, so gathering all stakeholders’ input in the decision process is key. Define performance and cost goals by balancing business and application requirements. Then evaluate the overall efficiency of the compute solution against those goals using metrics emitted from the solution. 


|  **ID**  |  **Priority**  |  **Best practice**  | 
| --- | --- | --- | 
|  ☐ BP 8.1   | Recommended  | Identify analytics solutions that best suit your technical challenges.  | 
|  ☐ BP 8.2   | Recommended  | Provision compute resources to the location of the data storage.  | 
|  ☐ BP 8.3   | Recommended  | Define and measure the computing performance metrics.  | 
| ☐ BP 8.4  | Recommended  | Continually identify under-performing components and fine-tune the infrastructure or application logic.  | 

 For more details, refer to the following information: 
+  AWS Whitepaper – Overview of Amazon Web Services: [Analytics](https://docs.aws.amazon.com/whitepapers/latest/aws-overview/analytics.html) 
+  AWS Big Data Blog: [Building high-quality benchmark tests for Amazon Redshift using Apache JMeter](https://aws.amazon.com/blogs/big-data/building-high-quality-benchmark-tests-for-amazon-redshift-using-apache-jmeter/) 
+  AWS Big Data Blog: [Top 10 performance tuning techniques for Amazon Redshift](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/) 

 

# Best practice 8.1 – Identify analytics solutions that best suit your technical challenges
<a name="best-practice-8.1-identify-analytics-solutions-that-best-suit-your-technical-challenges."></a>

 AWS has multiple analytics processing services that are built for specific purposes. These include Amazon Redshift for data warehousing, Amazon Kinesis for streaming data, and Quick for data visualization. Your organization should consider each step of the data analytics process as an opportunity to identify the right tool for the job. 

## Suggestion 8.1.1 – Identify the requirements based on the collected business metrics
<a name="suggestion-8.1.1-identify-the-requirements-based-on-the-collected-business-metrics."></a>

 Applications and services are designed to overcome specific challenges. It’s essential that your organization identifies the right tool for the right job to meet your business and technical requirements. Choosing inappropriate technology can introduce performance issues, especially when processing data at scale. 

 For more details, refer to the following information: 
+  AWS Right Tool for the Job: [Databases on AWS: The Right Tool for the Right Job](https://www.youtube.com/watch?v=WE8N5BU5MeI) 
+  AWS Right Tool for the Job: [How to Choose the Right Database](https://aws.amazon.com/startups/start-building/how-to-choose-a-database/) 

# Best practice 8.2 – Provision the compute resources to the location of the data storage
<a name="best-practice-8.2---provision-the-compute-resources-to-the-location-of-the-data-storage."></a>

 Data analytics workloads require moving data through a pipeline, either for ingesting data, processing intermediate results, or producing curated datasets. It is often more efficient to select the location of data processing services near where the data is stored. This approach is preferred instead of copying or streaming large amounts of data to the processing location. For example, if an Amazon Redshift cluster frequently ingests data from a data lake, ensure that the Amazon Redshift cluster is in the same Region as your data lake S3 buckets. 

 This extends to considering where your compute and storage are located at the Availability Zone level. Co-locating in the same Availability Zone allows fast, lower latency access. It is still important, however, to replicate data across zones when required. 

## Suggestion 8.2.1 – Migrate or copy primary data stores from on-premises environments to AWS so that cloud compute and storage are closely located
<a name="suggestion-8.2.1---migrate-or-copy-primary-data-stores-from-on-premises-environments-to-aws-so-that-cloud-compute-and-storage-are-closely-located."></a>

 Minimize duplication of data when transferring datasets from on-premises storage to the cloud. Instead, create copies of your data near the analytics platform to avoid data transfer latency and improve overall performance of the analytics solution. For optimal performance, keep your data and analytics systems in the same AWS Region. If they are in separate Regions, relocate one of them. 

## Suggestion 8.2.2 – Consider where your analytics resources are placed
<a name="suggestion-8.2.2-consider-where-your-analytics-resources-are-placed."></a>

 For optimal performance, your organization should align the location of the data with the location of the resources that process it. Where possible, your organization should consider using a permanent Region for all data analytics processing as this will help with data transferring overhead. 

## Suggestion 8.2.3 – Consider the use of provisioned compared to serverless offerings to match your workload pattern
<a name="suggestion-8.2.3"></a>

 When considering services for ingesting, transforming, and analyzing your data, there is often the choice between provisioned or serverless solutions. There are many trade-offs and potential advantages of each, but from a performance perspective, it can be beneficial to use serverless offerings when your workloads are consistently and unpredictably spikey. Whereas provisioned deployments may offer advantages when you have more stable, predictable workloads. 

# Best practice 8.3 – Define and measure the computing performance metrics
<a name="best-practice-8.3---define-and-measure-the-computing-performance-metrics."></a>

 Define how you will measure performance of the analytics solutions for each step in the process. For example, if the computing solution is a transient Amazon EMR cluster, you can take the following approach. Define the performance as the Amazon EMR job runtime from the launch of the EMR cluster, process the job, then shut down the cluster. As another example, if the computing solution is an Amazon Redshift cluster that is shared by a business unit, you can define the performance as the runtime duration for each SQL query. 

## Suggestion 8.3.1 – Define performance efficiency metrics
<a name="suggestion-8.3.1---define-performance-efficiency-metrics."></a>

 Collect and use metrics to scale the resources to meet business requirements. By doing so, your team can track unexpected spikes to make future improvements. 

## Suggestion 8.3.2 – Continually identify under-performing components and ﬁne-tune the infrastructure or application logic
<a name="suggestion-8.3.2"></a>

 After you have deﬁned the performance measurement, you should identify which infrastructure components or jobs are running below the performance criteria. Performance ﬁne-tuning varies for each AWS service, but generally, optimizing queries or workloads can enhance performance without necessitating infrastructure modifications. For example, if it is an Amazon EMR cluster running a Spark application, you could explore tuning your Spark configuration. If after fine-tuning you still need more performance, you can change to a larger cluster instance type, or increase the number of cluster nodes. 

 For an Amazon Redshift cluster, you can ﬁne-tune the SQL queries that are running below the performance criteria and if required, increase the number of cluster nodes to increase parallel computing capacity. 