Guidance for Scale-Out Computing on AWS

Overview

This Guidance demonstrates how teams of engineers, scientists, and researchers with a cloud environment can host licensed engineering tools required for comprehensive product development workloads. It shows how, in a matter of hours, engineering application teams can deploy scalable engineering collaboration chambers customized to meet organizational security requirements for joint development with trusted suppliers. With this Guidance, engineers can interact with a familiar catalog of tools, seamlessly integrated into an intuitive web portal.

How it works

This architecture diagram shows how to accelerate the product development process.

Architecture diagram Step 1
Elastic Load Balancing helps ensure accessibility across Availability Zones. It can be deployed in public subnets (by default) or private subnets.
Step 2
The Amazon Elastic Compute Cloud (Amazon EC2) instance implements a high-performance computing (HPC) workload manager (OpenPBS), which dynamically provisions AWS resources required for jobs submitted by users. Amazon EC2 Auto Scaling automatically provisions the resources necessary to run cluster user tasks such as scale-out compute jobs.
Step 3
The controller instance hosts a web interface that allows users and administrators to interact with the environment.
Step 4
Amazon ElastiCache helps optimize overall performance for the web user interface (UI) and orchestration tools by setting up in-memory cache.
Step 5
The entire configuration is stored on Parameter Store, a capability of AWS Systems Manager, and is easily retrievable through APIs.
Step 6
Launch a Linux or Windows virtual desktop that uses Amazon DCV to submit batch jobs and run graphical user interface (GUI) tools.
Step 7
This Guidance lets you deploy Amazon Elastic File System (Amazon EFS), Amazon Simple Storage Service (Amazon S3), or Amazon FSx as storage providers.
Step 8
AWS Budgets and AWS Cost Explorer give you insights about AWS spend generated by your cluster and set up cost guardrails to prevent exceeding an allocated budget.
Step 9
HPC and virtual desktop information is automatically indexed on an optional Amazon OpenSearch Service.
Step 10
Use security services and resources, such as AWS Secrets Manager, AWS Certificate Manager (ACM), and AWS Identity and Access Management (IAM).
Step 11
Use Amazon Cognito or AWS Directory Service as identity providers. Additionally, you can deploy a stand-alone OpenLDAP server if needed. An OpenSearch Service cluster stores job and host information in addition to metadata.
Step 12
Native integration with AWS Backup automatically snapshots your environment resources, such as key EC2 instances and file systems.

Deploy with confidence

Everything you need to launch this Guidance in your account is right here.

We'll walk you through it

Dive deep into the implementation guide for additional customization options and service configurations to tailor to your specific needs.

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

OpenSearch Service automatically ingests and retains critical cluster and job metadata, enabling long-term data analysis and business recommendations. Amazon CloudWatch monitors HPC and visualization node metrics in near real-time, empowering the detection of anomalies and optimization of system performance. Visualization of job information, including runtime, license utilization, pricing, and resource allocation, optimizes compute infrastructure.

Read the Operational Excellence whitepaper

Security

Scoped IAM policies help ensure minimum required permissions for a secure environment. Multiple Amazon EC2 security groups limit network traffic and enhance protection. Sensitive information, such as HTTPS certificates and directory service credentials, is securely stored in ACM and Secrets Manager, respectively. If single sign-on (SSO) is enabled, SAML authentication is offloaded to Amazon Cognito, providing a secure and scalable authentication solution.

Read the Security whitepaper

Reliability

ELB distributes traffic across multiple Availability Zones, enhancing the reliability of HPC and virtual desktop infrastructure (VDI) workloads. Deployment of the virtual private clouds (VPCs) with multiple subnets provides high availability and access to Amazon EC2 capacity, mitigating the risk of capacity constraints that could impact tightly coupled jobs.

Read the Reliability whitepaper

Performance Efficiency

Optimal AWS infrastructure, including compute, storage, and networking, accommodates the unique performance requirements of computer-aided engineering (CAE) simulations. Elastic Fabric Adapter (EFA) optimizes inter-node latency communication for large-scale HPC workloads. High-performance or parallel file systems, such as Amazon FSx for Lustre, handle I/O-intensive workloads. Leveraging the high-performance remote display protocol of Amazon DCV helps you optimize existing experience with graphically intensive workloads, such as CAD.

Read the Performance Efficiency whitepaper

Cost Optimization

AWS Budgets provides guardrails to prevent over-provisioning of compute and storage resources beyond the allocated budget threshold. This service is tightly integrated with HPC job submission queues, so that allocated budget per queue or project cannot exceed customer-defined thresholds. AWS cost allocation tags provide administrators with visibility into current spend at the project, team, user, or service level to help ensure accurate accounting across AWS resources.

Read the Cost Optimization whitepaper

Sustainability

Amazon EFS automatically transitions infrequent access data to a lower storage tier, reducing your system footprint and associated costs. EC2 Auto Scaling Groups replace persistent EC2 instances, minimizing wasted compute. Additionally, the breadth of Amazon EC2 compute options allows you to optimize per application, further reducing your carbon footprint.

Read the Sustainability whitepaper

Amazon Lab126 Creates HPC Solution to Help Teams Speed Development and Innovation

This case study shows how Amazon Lab126 accelerated hardware product development by using AWS HPC solutions to run large-scale thermal and mechanical simulations.

Rivian Pushes the Pace of Automotive Innovation with AWS

AWS re:Invent 2020: Rivian pushes the pace of automotive innovation with AWS