本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 为 AWS 设置一个 Grafana 监控控制面板 ParallelCluster
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster"></a>

*Dario La Porta 和 William Lu，Amazon Web Services*

## Summary
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-summary"></a>

AWS ParallelCluster 可帮助您部署和管理高性能计算 (HPC) 集群。支持 AWS Batch 和 Slurm 开源作业计划程序。尽管 ParallelCluster AWS 与 Amazon CloudWatch 集成了日志和指标，但它没有为工作负载提供监控控制面板。

[适用于 AWS 的 Grafana 控制面板 GitHub () 是 ParallelCluster AWS](https://github.com/aws-samples/aws-parallelcluster-monitoring) 的监控控制面板。 ParallelCluster它提供了操作系统级别的作业调度程序见解和详细的监控指标。有关此解决方案中包含的仪表板的更多信息，请参阅 GitHub 存储库中的[示例仪表板](https://github.com/aws-samples/aws-parallelcluster-monitoring#example-dashboards)。这些指标可帮助您更好地了解 HPC 工作负载及性能。但是，控制面板代码不会针对最新版本的 AWS ParallelCluster 或解决方案中使用的开源软件包进行更新。此模式增强解决方案，提供以下优势：
+ 支持 AWS ParallelCluster v3
+ 使用最新版开源包，包括 Prometheus、Grafana、Prometheus Slurm Exporter 和 NVIDIA DCGM-Exporter
+ 增加 Slurm 作业使用 GPUs 的 CPU 内核数量
+ 添加任务监控控制面板
+ 增强具有 4 或 8 个图形处理单元的节点的 GPU 节点监控仪表板 (GPUs)

此版本的增强型解决方案已在 AWS 客户的 HPC 生产环境中实施和验证。

## 先决条件和限制
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-prereqs"></a>

**先决条件**
+ [AWS ParallelCluster CLI](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster-v3.html)，已安装并配置。
+ AWS 支持的[网络配置](https://docs.aws.amazon.com/parallelcluster/latest/ug/iam-roles-in-parallelcluster-v3.html) ParallelCluster。此模式使用使用[ ParallelCluster 使用两个子网的 AWS 配置，这需要公有子网](https://docs.aws.amazon.com/parallelcluster/latest/ug/network-configuration-v3.html#network-configuration-v3-two-subnets)、私有子网、Internet 网关和 NAT 网关。
+ 所有 AWS ParallelCluster 集群节点都必须能够访问互联网。这是必要条件，这样安装脚本才能下载开源软件和 Docker 映像。
+ 亚马逊弹性计算云（亚马逊 EC2）中的[密钥对](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)。具有此密钥对的资源具有对头节点的 Secure Shell (SSH) 访问权限。

**限制**
+ 此示例旨在支持 Ubuntu 20.04 LTS。如果您使用的是其他版本的 Ubuntu，或者您使用的是 Amazon Linux 或 CentOS，则需要修改此解决方案提供的脚本。这些修改不包含在此模式中。

**产品版本**
+ Ubuntu 20.04 LTS
+ ParallelCluster 3.X

**账单与成本注意事项**
+ 以这种模式部署的解决方案并不在免费套餐范围内。亚马逊 EC2、亚马逊 Lustre、亚马逊 VPC 中的 NAT 网关和亚马逊 Route 53 均 FSx 需收费。

## 架构
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-architecture"></a>

**目标架构**

下图显示了用户如何在头节点 ParallelCluster 上访问 AWS 的监控控制面板。头节点运行 NICE DCV、Prometheus、Grafana、Prometheus Slurm Exporter、Prometheus Node Exporter 以及 NGINX Open Source。计算节点运行 Prometheus Node Exporter，如果节点包含，它们还会运行 NVIDIA dcgm-Exporter。 GPUs头节点从计算节点检索信息，并将此数据显示在 Grafana 控制面板中。

![\[在头节点 ParallelCluster 上访问 AWS 的监控控制面板。\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/images/pattern-img/a2132c94-98e0-4b90-8be0-99ebfa546442/images/d2255792-f66a-4ef2-8f04-cc3d5482db5f.png)


在大多数情况下，头节点的负载并不重，因为作业调度程序不需要大量的 CPU 或内存。用户通过端口 443 上的 SSL 访问头节点上的控制面板。

所有授权查看者都可以匿名查看监控控制面板。仅 Grafana 管理员可以修改控制面板。您可在 `aws-parallelcluster-monitoring/docker-compose/docker-compose.head.yml` 文件中为 Grafana 管理员配置密码。

## 工具
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-tools"></a>

**Amazon Web Services**
+ [NICE DCV](https://docs.aws.amazon.com/dcv/#nice-dcv) 是一种高性能远程显示协议，可帮助您在不同的网络条件下将远程桌面和应用程序流从任何云或数据中心传送到任何设备。
+ [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html) 可帮助您部署和管理高性能计算 (HPC) 集群。支持 AWS Batch 和 Slurm 开源作业计划程序。
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) 是一项基于云的对象存储服务，可帮助您存储、保护和检索任意数量的数据。
+ [Amazon Virtual Private Cloud (Amazon VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 可帮助您将 AWS 资源启动到您定义的虚拟网络中。

**其他工具**
+ [Docker](https://www.docker.com/) 是一组平台即服务（PaaS）产品，它们使用操作系统级别的虚拟化技术在容器中交付软件。
+ [Grafana](https://grafana.com/docs/grafana/latest/introduction/) 是一款开源软件，可帮助您查询、可视化、提醒和浏览指标、日志和跟踪。
+ [NGINX Open Source](https://nginx.org/en/docs/?_ga=2.187509224.1322712425.1699399865-405102969.1699399865) 是一个开源 Web 服务器和反向代理。
+ [NVIDIA 数据中心 GPU 管理器 (DCGM)](https://docs.nvidia.com/data-center-gpu-manager-dcgm/index.html) 是一套工具，用于在集群环境中管理和监控 NVIDIA 数据中心图形处理单元 (GPUs)。在这种模式中，您使用 [dcgm-Exporter](https://github.com/NVIDIA/dcgm-exporter)，它可以帮助您从 Prometheus 中导出 GPU 指标。
+ [Prometheus](https://prometheus.io/docs/introduction/overview/) 是开源系统监控工具包，可将其指标收集并存储为时间序列数据，以及相关的键值对（称为*标签*）。在此模式下，您还可使用 [Prometheus Slurm Exporter](https://github.com/vpenso/prometheus-slurm-exporter) 收集和导出指标，您可使用 [Prometheus Node Exporter](https://github.com/prometheus/node_exporter) 导出来自结算节点的指标。
+ [Ubuntu](https://help.ubuntu.com/) 是基于 Linux 的开源操作系统，专为企业服务器、桌面、云环境和物联网而设计。

**代码存储库**

此模式的代码可在 GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard)存储库中找到。

## 操作说明
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-epics"></a>

### 创建所需资源
<a name="create-the-required-resources"></a>


| Task | 说明 | 所需技能 | 
| --- | --- | --- | 
| 创建 S3 存储桶。 | 创建 Amazon S3 存储桶。您可使用此存储桶存储配置脚本。有关说明，请参阅 Amazon S3 文档中的[创建存储桶](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html)。 | 常规 AWS | 
| 克隆存储库。 | 通过运行以下命令克隆 GitHub [pcluster-monitoring-dashboard](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring)存储库。<pre>git clone https://github.com/aws-samples/parallelcluster-monitoring-dashboard.git</pre> | DevOps 工程师 | 
| 创建管理员密码。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | Linux Shell 脚本 | 
| 将所需文件复制至 S3 存储桶。 | 将 [post\$1install.sh](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/post_install.sh) 脚本和[aws-parallelcluster-monitoring](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/tree/main/aws-parallelcluster-monitoring)文件夹复制到您创建的 S3 存储桶中。有关说明，请参阅 Amazon S3 文档中的[上传对象](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html)。 | 常规 AWS | 
| 为头节点配置其他安全组。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理员 | 
| 为头节点配置 IAM policy。 | 为头节点创建基于身份的策略。该策略允许节点从 Amazon 检索指标数据 CloudWatch。该 GitHub 存储库包含一个示例[策略](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/head_node.json)。有关说明，请参阅 AWS Identity and Access Management (IAM) 文档中的[创建 IAM policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html)。 | AWS 管理员 | 
| 为计算机节点配置 IAM policy。 | 为计算机节点创建基于身份的策略。此策略允许节点创建包含作业 ID 和任务拥有者的标签。该 GitHub 存储库包含一个示例[策略](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/policies/compute_node.json)。有关说明，请参阅 IAM 文档中的[创建 IAM policy。 ](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html)如您使用提供的示例文件，请替换以下值：[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理员 | 

### 创建集群
<a name="create-the-cluster"></a>


| Task | 说明 | 所需技能 | 
| --- | --- | --- | 
| 修改所提供的集群模板文件。 | 创建 AWS ParallelCluster 集群。使用提供的 [cluster.yaml](https://github.com/aws-samples/parallelcluster-monitoring-dashboard/blob/main/cluster.yaml) A CloudFormation WS 模板文件作为创建集群的起点。替换所提供模板中的以下值：[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理员 | 
| 创建集群。 | 在 AWS ParallelCluster CLI 中，输入以下命令。这将部署 CloudFormation 模板并创建集群。有关此命令的更多信息，请参阅 AWS 文档中的 [pcluster create-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.create-cluster-v3.html)。 ParallelCluster <pre>pcluster create-cluster -n <cluster_name> -c cluster.yaml</pre> | AWS 管理员 | 
| 监控集群创建。 | 输入以下命令，以监控集群创建。有关此命令的更多信息，请参阅 [AWS 文档中的 pcluster describe-](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.describe-cluster-v3.html) cluster。 ParallelCluster <pre>pcluster describe-cluster -n <cluster_name></pre> | AWS 管理员 | 

### 使用 Grafana 控制面板
<a name="using-the-grafana-dashboards"></a>


| Task | 说明 | 所需技能 | 
| --- | --- | --- | 
| 访问 Grafana 门户。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.html) | AWS 管理员 | 

### 清理解决方案，以停止产生相关成本
<a name="clean-up-the-solution-to-stop-incurring-associated-costs"></a>


| Task | 说明 | 所需技能 | 
| --- | --- | --- | 
| 请删除集群。 | 输入以下命令以删除集群。有关此命令的更多信息，请参阅 AWS 文档中的 [pcluster delete-cluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.delete-cluster-v3.html)。 ParallelCluster <pre>pcluster delete-cluster -n <cluster_name></pre> | AWS 管理员 | 
| 删除 IAM policy。 | 删除您为头节点与计算节点创建的策略。有关删除策略的更多信息，请参阅 IAM 文档中的[删除 IAM policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-delete.html)。 | AWS 管理员 | 
| 删除安全组和规则。 | 删除您为头节点创建的安全组。有关更多信息，请参阅 Amazon VPC 文档中的[删除安全组规则](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-group-rules)和[删除安全组](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-security-groups.html#deleting-security-groups)。 | AWS 管理员 | 
| 删除 S3 存储桶。 | 删除您创建的用于存储配置脚本的 S3 存储桶。有关更多信息，请参阅 Amazon S3 文档中的[删除存储桶](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html)。 | 常规 AWS | 

## 问题排查
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-troubleshooting"></a>


| 问题 | 解决方案 | 
| --- | --- | 
| 头节点在浏览器中不可访问。 | 检查安全组并确认入站端口 443 已经打开。 | 
| 无法打开 Grafana。 | 在头节点上，查看 `docker logs Grafana` 的容器日志。 | 
| 部分指标没有数据。 | 在头节点，检查所有容器的容器日志。 | 

## 相关资源
<a name="set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster-resources"></a>

**AWS 文档**
+ [适用于亚马逊的 IAM 政策 EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-policies-for-amazon-ec2.html)

**其他 AWS 资源**
+ [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/)
+ [AWS 监控控制面板 ParallelCluster](https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/)（AWS 博客文章）

**其他资源**
+ [Prometheus 监控系统](https://prometheus.io/)
+ [Grafana](https://grafana.com/)