

# How AWS ParallelCluster works
<a name="functional-v3"></a>

AWS ParallelCluster was built not only as a way to manage clusters, but as a reference on how to use AWS services to build your HPC environment. The following topics describe AWS ParallelCluster processes, the AWS services that AWS ParallelCluster uses and how, and the internal directories.

**Topics**
+ [AWS ParallelCluster processes](processes-v3.md)
+ [AWS services used by AWS ParallelCluster](aws-services-v3.md)
+ [AWS ParallelCluster internal directories](directories-v3.md)

# AWS ParallelCluster processes
<a name="processes-v3"></a>

This section applies to clusters that are deployed with Slurm. When used with this scheduler, AWS ParallelCluster interacts with the underlying job scheduler to manage compute node provisioning and removal.

For HPC clusters that are based on AWS Batch, AWS ParallelCluster relies on the capabilities provided by AWS Batch to manage compute nodes.

## `clustermgtd`
<a name="clustermgtd-v3"></a>

The cluster management daemon (`clustermgtd`) performs these tasks:
+ Clean up inactive partitions
+ Manage Slurm reservations and nodes associated with Capacity Blocks (see the following section)
+ Manage static capacity to make sure it is always up and healthy
+ Sync scheduler with Amazon EC2.
+ Clean up orphaned instances
+ Restore scheduler node status upon an Amazon EC2 termination that happens outside of the suspend workflow
+ Manage unhealthy Amazon EC2 instances (those that fail Amazon EC2 health checks)
+ Manage scheduled maintenance events
+ Manage unhealthy scheduler nodes (those that fail scheduler health checks)

### Management of Slurm reservations and nodes associated with Capacity Blocks
<a name="mgmtofSlurmReservationNodesForCB-v3"></a>

ParallelCluster supports On-Demand Capacity Reservations (ODCR) and Capacity Blocks for Machine Learning (CB). Unlike ODCR, CB can have a future start time and is time-bound.

`clustermgtd` searches for unhealthy nodes in a loop, terminates any Amazon EC2 instances that are down, and replaces them with new instances if they are static nodes.

AWS ParallelCluster manages static nodes associated with Capacity Blocks differently– it creates a cluster even if the CB is not yet active, and automatically launches instances once the CB is active.

The Slurm nodes that correspond to compute resources associated with CBs that are not yet active are kept in the maintenance state until the CB start time is reached. These Slurm nodes remain in a reservation/maintenance state associated with the Slurm admin user, which means they can accept jobs, but the jobs remain pending until the Slurm reservation is removed.

`clustermgtd` automatically creates or deletes Slurm reservations– it puts the related CB nodes in a maintenance state based on the CB state. When the CB becomes active, the Slurm reservation is removed, the nodes start and become available for the pending jobs or for new job submissions.

When the CB end time is reached, the nodes are moved back to a reservation/maintenance state. It's up to users to resubmit/requeue the jobs to a new queue/compute resource when the CB is no longer active and instances are terminated.

## `clusterstatusmgtd`
<a name="clusterstatusmgtd-v3"></a>

The cluster status management daemon (`clusterstatusmgtd`) manages the compute fleet status update. Every minute it fetches the fleet status stored in a DynamoDB table and manages any STOP/START request.

## `computemgtd`
<a name="computemgtd-v3"></a>

The compute management daemon (`computemgtd`) processes run on each of the cluster compute nodes. Every five (5) minutes, the compute management daemon confirms that the head node can be reached and is healthy. If five (5) minutes pass during which the head node cannot be reached or is not healthy, the compute node is shut down.

# AWS services used by AWS ParallelCluster
<a name="aws-services-v3"></a>

The following Amazon Web Services (AWS) services are used by AWS ParallelCluster.

**Topics**
+ [Amazon API Gateway](#aws-api-gateway-v3)
+ [AWS Batch](#aws-batch-v3)
+ [CloudFormation](#aws-services-cloudformation-v3)
+ [Amazon CloudWatch](#amazon-cloudwatch-v3)
+ [Amazon CloudWatch Events](#amazon-cloudwatch-events-v3)
+ [Amazon CloudWatch Logs](#amazon-cloudwatch-logs-v3)
+ [AWS CodeBuild](#aws-codebuild-v3)
+ [Amazon DynamoDB](#amazon-dynamodb-v3)
+ [Amazon Elastic Block Store](#amazon-elastic-block-store-ebs-v3)
+ [Amazon Elastic Compute Cloud](#amazon-ec2-v3)
+ [Amazon Elastic Container Registry](#amazon-elastic-container-registry-ecr-v3)
+ [Amazon EFS](#amazon-efs-v3)
+ [Amazon FSx for Lustre](#amazon-fsx-for-lustre-v3)
+ [Amazon FSx for NetApp ONTAP](#amazon-fsx-ontap-v3)
+ [Amazon FSx for OpenZFS](#amazon-fsx-openzfs-v3)
+ [AWS Identity and Access Management](#aws-identity-and-access-management-iam-v3)
+ [AWS Lambda](#aws-lambda-v3)
+ [Amazon RDS](#aws-rds-v3)
+ [Amazon Route 53](#amazon-route-53-v3)
+ [Amazon Simple Notification Service](#aws-sns-v3)
+ [Amazon Simple Storage Service](#amazon-s3-v3)
+ [Amazon VPC](#amazon-vpc-v3)
+ [Elastic Fabric Adapter](#aws-efa-v3)
+ [EC2 Image Builder](#aws-image-builder-v3)
+ [Amazon DCV](#nice-dcv-v3)

## Amazon API Gateway
<a name="aws-api-gateway-v3"></a>

Amazon API Gateway is an AWS service that makes it possible to create, publish, maintain, monitor, and secure REST, HTTP, and WebSocket APIs at any scale

AWS ParallelCluster uses API Gateway to host the AWS ParallelCluster API.

For more information about Amazon API Gateway, see [ https://aws.amazon.com/api-gateway/](https://aws.amazon.com/api-gateway/) and [ https://docs.aws.amazon.com/apigateway/](https://docs.aws.amazon.com/apigateway/).

## AWS Batch
<a name="aws-batch-v3"></a>

AWS Batch is an AWS managed job scheduler service. It dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances) in AWS Batch clusters. These resources are provisioned based on the specific requirements of your batch jobs, including volume requirements. With AWS Batch, you don't need to install or manage additional batch computing software or server clusters to run your jobs effectively.

AWS Batch is used only with AWS Batch clusters.

For more information about AWS Batch, see [https://aws.amazon.com/batch/](https://aws.amazon.com/batch/) and [https://docs.aws.amazon.com/batch/](https://docs.aws.amazon.com/batch/).

## CloudFormation
<a name="aws-services-cloudformation-v3"></a>

CloudFormation is an infrastructure-as-code service that provides a common language to model and provision AWS and third-party application resources in your cloud environment. It is the main service used by AWS ParallelCluster. Each cluster in AWS ParallelCluster is represented as a stack, and all resources required by each cluster are defined within the AWS ParallelCluster CloudFormation template. In most cases, AWS ParallelCluster CLI commands directly correspond to CloudFormation stack commands, such as create, update, and delete. Instances that are launched within a cluster make HTTPS calls to the CloudFormation endpoint in the AWS Region where the cluster is launched.

For more information about CloudFormation, see [ https://aws.amazon.com/cloudformation/](https://aws.amazon.com/cloudformation/) and [ https://docs.aws.amazon.com/cloudformation/](https://docs.aws.amazon.com/cloudformation/).

## Amazon CloudWatch
<a name="amazon-cloudwatch-v3"></a>

Amazon CloudWatch (CloudWatch) is a monitoring and observability service that provides you with data and actionable insights. These insights can be used to monitor your applications, respond to performance changes and service exceptions, and optimize resource utilization. In AWS ParallelCluster, CloudWatch is used for a dashboard, to monitor and log Docker image build steps and the output of the AWS Batch jobs.

Before AWS ParallelCluster version 2.10.0, CloudWatch was used only with AWS Batch clusters.

For more information about CloudWatch, see [ https://aws.amazon.com/cloudwatch/](https://aws.amazon.com/cloudwatch/) and [ https://docs.aws.amazon.com/cloudwatch/](https://docs.aws.amazon.com/cloudwatch/).

## Amazon CloudWatch Events
<a name="amazon-cloudwatch-events-v3"></a>

Amazon CloudWatch Events (CloudWatch Events) delivers a near real-time stream of system events that describe changes in Amazon Web Services (AWS) resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams. In AWS ParallelCluster, CloudWatch Events is used for AWS Batch jobs.

For more information about CloudWatch Events, see [https://docs.aws.amazon.com//eventbridge/latest/userguide/eb-cwe-now-eb](https://docs.aws.amazon.com//eventbridge/latest/userguide/eb-cwe-now-eb.html).

## Amazon CloudWatch Logs
<a name="amazon-cloudwatch-logs-v3"></a>

Amazon CloudWatch Logs (CloudWatch Logs) is one of the core features of Amazon CloudWatch. You can use it to monitor, store, view, and search the log files for many of the components used by AWS ParallelCluster. 

Before AWS ParallelCluster version 2.6.0, CloudWatch Logs was only used with AWS Batch clusters.

For more information, see [Integration with Amazon CloudWatch Logs](cloudwatch-logs-v3.md).

## AWS CodeBuild
<a name="aws-codebuild-v3"></a>

AWS CodeBuild (CodeBuild) is an AWS managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. In AWS ParallelCluster, CodeBuild is used to automatically and transparently build Docker images when clusters are created.

CodeBuild is used only with AWS Batch clusters.

For more information about CodeBuild, see [https://aws.amazon.com/codebuild/](https://aws.amazon.com/codebuild/) and [https://docs.aws.amazon.com/codebuild/](https://docs.aws.amazon.com/codebuild/).

## Amazon DynamoDB
<a name="amazon-dynamodb-v3"></a>

Amazon DynamoDB (DynamoDB) is a fast and flexible NoSQL database service. It is used to store the minimal state information of the cluster. The head node tracks provisioned instances in a DynamoDB table.

DynamoDB is not used with AWS Batch clusters.

For more information about DynamoDB, see [https://aws.amazon.com/dynamodb/](https://aws.amazon.com/dynamodb/) and [https://docs.aws.amazon.com/dynamodb/](https://docs.aws.amazon.com/dynamodb/).

## Amazon Elastic Block Store
<a name="amazon-elastic-block-store-ebs-v3"></a>

Amazon Elastic Block Store (Amazon EBS) is a high-performance block storage service that provides persistent storage for shared volumes. All Amazon EBS settings can be passed through the configuration. Amazon EBS volumes can either be initialized empty or from an existing Amazon EBS snapshot.

For more information about Amazon EBS, see [https://aws.amazon.com/ebs/](https://aws.amazon.com/ebs/) and [https://docs.aws.amazon.com/ebs/](https://docs.aws.amazon.com/ebs/).

## Amazon Elastic Compute Cloud
<a name="amazon-ec2-v3"></a>

Amazon Elastic Compute Cloud (Amazon EC2 ) provides the computing capacity for AWS ParallelCluster. The head and compute nodes are Amazon EC2 instances. Any instance type that supports hardware virtual machine (HVM) can be selected. The head and compute nodes can be different instance types. Moreover, if multiple queues are used, some or all of compute nodes can also be launched as a Spot Instance. Instance store volumes found on the instances are mounted as striped Logical Volume Manager (LVM) volumes.

For more information about Amazon EC2 , see [https://aws.amazon.com/ec2/](https://aws.amazon.com/ec2/) and [https://docs.aws.amazon.com/ec2/](https://docs.aws.amazon.com/ec2/).

## Amazon Elastic Container Registry
<a name="amazon-elastic-container-registry-ecr-v3"></a>

Amazon Elastic Container Registry (Amazon ECR) is a fully managed Docker container registry that makes it easy to store, manage, and deploy Docker container images. In AWS ParallelCluster, Amazon ECR stores the Docker images that are built when clusters are created. The Docker images are then used by AWS Batch to run the containers for the submitted jobs.

Amazon ECR is used only with AWS Batch clusters.

For more information, see [https://aws.amazon.com/ecr/](https://aws.amazon.com/ecr/) and [https://docs.aws.amazon.com/ecr/](https://docs.aws.amazon.com/ecr/).

## Amazon EFS
<a name="amazon-efs-v3"></a>

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, and fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. Amazon EFS is used when the [`EfsSettings`](SharedStorage-v3.md#SharedStorage-v3-EfsSettings) are specified. Support for Amazon EFS was added in AWS ParallelCluster version 2.1.0.

For more information about Amazon EFS, see [https://aws.amazon.com/efs/](https://aws.amazon.com/efs/) and [https://docs.aws.amazon.com/efs/](https://docs.aws.amazon.com/efs/).

## Amazon FSx for Lustre
<a name="amazon-fsx-for-lustre-v3"></a>

FSx for Lustre provides a high-performance file system that uses the open-source Lustre file system. FSx for Lustre is used when the [`FsxLustreSettings` properties](SharedStorage-v3.md#SharedStorage-v3-FsxLustreSettings.properties) are specified. Support for FSx for Lustre was added in AWS ParallelCluster version 2.2.1.

For more information about FSx for Lustre, see [https://aws.amazon.com/fsx/lustre/](https://aws.amazon.com/fsx/lustre/) and [https://docs.aws.amazon.com/fsx/](https://docs.aws.amazon.com/fsx/).

## Amazon FSx for NetApp ONTAP
<a name="amazon-fsx-ontap-v3"></a>

FSx for ONTAP provides a fully managed shared storage system built on NetApp's popular ONTAP file system. FSx for ONTAP is used when [`FsxOntapSettings` properties](SharedStorage-v3.md#SharedStorage-v3-FsxOntapSettings.properties) are specified. Support for FSx for ONTAP was added in AWS ParallelCluster version 3.2.0.

For more information about FSx for ONTAP, see [https://aws.amazon.com/fsx/netapp-ontap/](https://aws.amazon.com/fsx/netapp-ontap/) and [https://docs.aws.amazon.com/fsx/](https://docs.aws.amazon.com/fsx/).

## Amazon FSx for OpenZFS
<a name="amazon-fsx-openzfs-v3"></a>

FSx for OpenZFS provides a fully managed shared storage system built on the popular OpenZFS file system. FSx for OpenZFS is used when the [`FsxOpenZfsSettings` properties](SharedStorage-v3.md#SharedStorage-v3-FsxOpenZfsSettings.properties) are specified. Support for FSx for OpenZFS was added in AWS ParallelCluster version 3.2.0.

For more information about FSx for OpenZFS, see [https://aws.amazon.com/fsx/openzfs/](https://aws.amazon.com/fsx/openzfs/) and [https://docs.aws.amazon.com/fsx/](https://docs.aws.amazon.com/fsx/).

## AWS Identity and Access Management
<a name="aws-identity-and-access-management-iam-v3"></a>

AWS Identity and Access Management (IAM) is used within AWS ParallelCluster to provide a least privileged IAM role for Amazon EC2 for the instance that is specific to each individual cluster. AWS ParallelCluster instances are given access only to the specific API calls that are required to deploy and manage the cluster.

With AWS Batch clusters, IAM roles are also created for the components that are involved with the Docker image building process when clusters are created. These components include the Lambda functions that are allowed to add and delete Docker images to and from the Amazon ECR repository. They also include the functions allowed to delete the Amazon S3 bucket that is created for the cluster and CodeBuild project. There are also roles for AWS Batch resources, instances, and jobs.

For more information about IAM, see [https://aws.amazon.com/iam/](https://aws.amazon.com/iam/) and [https://docs.aws.amazon.com/iam/](https://docs.aws.amazon.com/iam/).

## AWS Lambda
<a name="aws-lambda-v3"></a>

AWS Lambda (Lambda) runs the functions that orchestrate the creation of Docker images. Lambda also manages the cleanup of custom cluster resources, such as Docker images stored in the Amazon ECR repository and on Amazon S3.

For more information about Lambda, see [https://aws.amazon.com/lambda/](https://aws.amazon.com/lambda/) and [https://docs.aws.amazon.com/lambda/](https://docs.aws.amazon.com/lambda/).

## Amazon RDS
<a name="aws-rds-v3"></a>

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the AWS Cloud.

AWS ParallelCluster uses Amazon RDS for AWS Batch and Slurm.

For more information about Amazon RDS, see [https://aws.amazon.com/rds/](https://aws.amazon.com/rds/) and [https://docs.aws.amazon.com/rds/](https://docs.aws.amazon.com/rds).

## Amazon Route 53
<a name="amazon-route-53-v3"></a>

Amazon Route 53 (Route 53) is used to create hosted zones with hostnames and fully qualified domain names for each of the compute nodes.

For more information about Route 53, see [https://aws.amazon.com/route53/](https://aws.amazon.com/route53/) and [https://docs.aws.amazon.com/route53/](https://docs.aws.amazon.com/route53/).

## Amazon Simple Notification Service
<a name="aws-sns-v3"></a>

 (Amazon SNS) is a managed service that provides message delivery from publishers to subscribers (also known as producers and consumers).

AWS ParallelCluster uses Amazon SNS for API hosting.

For more information about Amazon SNS, see [https://aws.amazon.com/sns/](https://aws.amazon.com/sns/) and [https://docs.aws.amazon.com/sns/](https://docs.aws.amazon.com/sns/).

## Amazon Simple Storage Service
<a name="amazon-s3-v3"></a>

Amazon Simple Storage Service (Amazon S3) stores AWS ParallelCluster templates located in each AWS Region. AWS ParallelCluster can be configured to allow CLI/SDK tools to use Amazon S3.

AWS ParallelCluster also creates an Amazon S3 bucket in your AWS account to store resources that are used by your clusters, such as the cluster configuration file. AWS ParallelCluster maintains one Amazon S3 bucket in each AWS Region that you create clusters in.

When you use AWS Batch cluster, an Amazon S3 bucket in your account is used for storing related data. For example, the bucket stores artifacts created when a Docker image and scripts are created from submitted jobs.

For more information, see [https://aws.amazon.com/s3/](https://aws.amazon.com/s3/) and [https://docs.aws.amazon.com/s3/](https://docs.aws.amazon.com/s3/).

## Amazon VPC
<a name="amazon-vpc-v3"></a>

An Amazon Virtual Private Cloud (VPC) defines a network used by the nodes in your cluster.

For more information about Amazon VPC, see [https://aws.amazon.com/vpc/](https://aws.amazon.com/vpc/) and [https://docs.aws.amazon.com/vpc/](https://docs.aws.amazon.com/vpc/).

## Elastic Fabric Adapter
<a name="aws-efa-v3"></a>

Elastic Fabric Adapter (EFA) is a network interface for instances that you can use to run applications requiring high levels of inter-node communications at scale on AWS.

For more information about Elastic Fabric Adapter, see [https://aws.amazon.com/hpc/efa/](https://aws.amazon.com/hpc/efa/).

## EC2 Image Builder
<a name="aws-image-builder-v3"></a>

EC2 Image Builder is a fully managed AWS service that helps you to automate the creation, management, and deployment of customized, secure, and up-to-date server images.

AWS ParallelCluster uses Image Builder to create and manage AWS ParallelCluster images.

For more information about EC2 Image Builder, see [https://aws.amazon.com/image-builder/](https://aws.amazon.com/image-builder/) and [https://docs.aws.amazon.com/imagebuilder/](https://docs.aws.amazon.com/imagebuilder/).

## Amazon DCV
<a name="nice-dcv-v3"></a>

Amazon DCV is a high-performance remote display protocol that provides a secure way to deliver remote desktops and application streaming to any device over varying network conditions. Amazon DCV is used when the [`HeadNode` section](HeadNode-v3.md) / [`Dcv`](HeadNode-v3.md#HeadNode-v3-Dcv) settings are specified. Support for Amazon DCV was added in AWS ParallelCluster version 2.5.0.

For more information about Amazon DCV, see [https://aws.amazon.com/hpc/dcv/](https://aws.amazon.com/hpc/dcv/) and [https://docs.aws.amazon.com/dcv/](https://docs.aws.amazon.com/dcv/).

# AWS ParallelCluster internal directories
<a name="directories-v3"></a>

There are several internal directories that AWS ParallelCluster uses to share data within the cluster. The following directories are shared between the head node, compute nodes, and login nodes: 
+ `/opt/slurm`
+ `/opt/intel`
+ `/opt/parallelcluster/shared (only with compute nodes)`
+ `/opt/parallelcluster/shared_login_nodes (only with login nodes)`
+ `/home (unless specified in SharedStorage)`

**Note**  
By default these directories are created on the head nodes EBS volume and shared as NFS exports to the compute and login nodes. Starting from AWS ParallelCluster 3.8 you can enable AWS ParallelCluster to create and manage an Amazon EFS filesystem to host and share these directories by setting the [SharedStorageType](HeadNode-v3.md#yaml-HeadNode-SharedStorageType) parameter to efs.  
When the cluster scales out, NFS exports via the EBS volume may pose performance bottlenecks. Using EFS, you can avoid NFS exports as your cluster scales out and avoid the performance bottlenecks associated with them.