# REL11-BP01 监控工作负载的所有组件以检测故障
<a name="rel_withstand_component_failures_monitoring_health"></a>

 持续监控工作负载的运行状况，以便您和您的自动化系统立即发现任何故障或性能下降情况。监控基于商业价值的关键性能指标（KPI）。

 所有恢复和修复机制必须从快速检测问题的能力入手。首先，应该检测技术故障并加以解决。不过，可用性基于工作负载创造商业价值的能力，因此衡量它的关键性能指标（KPI）需要成为检测和补救策略的一部分。

 **期望结果：**独立监控工作负载的重要组成部分，对故障发生的时间和位置进行检测，再根据检测结果发出警报。

 **常见反模式：**
+  未配置警报，因此不会在发生中断时进行通知。
+  虽然配置了警报，但只有在达到阈值时才会发出警报，导致没有足够的响应时间。
+  收集指标的频率不够高，无法满足恢复时间目标（RTO）。
+  仅主动监控工作负载中面向客户的接口。
+  只收集技术指标，不收集业务功能指标。
+  没有衡量工作负载用户体验的指标。
+  创建的监控太多。

 **建立此最佳实践的好处：**如果在所有层都设置了适当的监控，则可以通过减少检测时间来缩短恢复时间。

 **在未建立这种最佳实践的情况下暴露的风险等级：**高 

## 实施指导
<a name="implementation-guidance"></a>

 确定将接受审查以决定是否监控的所有工作负载。确定工作负载中所有需要监控的组件后，就需要确定监控间隔。根据检测故障所花费的时间，监控间隔将直接影响启动恢复的速度。平均检测时间（MTTD）是从故障发生到修复操作开始之间的时间。服务列表应广泛而完整。

 监控必须覆盖应用程序堆栈的所有层，包括应用程序、平台、基础设施和网络。

 监控策略应考虑*灰色故障*的影响。有关灰色故障的更多详细信息，请参阅《Advanced Multi-AZ Resilience Patterns》白皮书中的 [Gray failures](https://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/gray-failures.html)。

### 实施步骤
<a name="implementation-steps"></a>
+  监控间隔取决于必须以多快的速度恢复。恢复时间取决于恢复所需的时间，因此在确定收集频率时，必须考虑此时间和恢复时间目标（RTO）。
+  为组件和托管服务配置详细监控。
  +  确定[详细监控对于 EC2 实例](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html)和[自动扩缩](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-monitoring.html)来说是否有必要。详细监控以 1 分钟为间隔提供指标，默认监控以 5 分钟为间隔提供指标。
  +  确定是否需要为 RDS 设置[增强监控](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Monitoring.html)。增强监控使用 RDS 实例上的代理，获取关于不同进程或线程的有用信息。
  +  确定以下各项的关键无服务器组件的监控要求：[Lambda](https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html)、[API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/monitoring_automated_manual.html)、[Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/eks-observe.html)、[Amazon ECS](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amp/ecs)，以及所有类型的[负载均衡器](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-monitoring.html)。
  +  确定以下各项的存储组件的监控要求：[Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/monitoring-overview.html)、[Amazon FSx](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/monitoring_overview.html)、[Amazon EFS](https://docs.aws.amazon.com/efs/latest/ug/monitoring_overview.html) 和 [Amazon EBS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html)。
+  创建[自定义指标](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html)来测量业务关键性能指标（KPI）。工作负载会实现关键业务功能，这些功能应用作 KPI 来协助在发生间接问题时予以识别。
+  使用用户金丝雀来监控用户的故障体验。可运行并模拟客户行为的[综合事务测试](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)（又称为“金丝雀测试”，但与金丝雀部署不同）是一项重要的测试流程。从不同的远程位置针对工作负载端点持续地运行此类测试。
+  创建跟踪用户体验的[自定义指标](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html)。如果您可以衡量客户体验，就可以确定发生了客户体验下降。
+  [设置警报](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)，在检测到工作负载的任何部分未正常运行时发出警报，并指示什么时候自动扩展资源。警报可以直观地显示在控制面板上，通过 Amazon SNS 或电子邮件发送警报，并与自动扩缩功能结合使用来纵向扩展或缩减工作负载资源。
+  创建[控制面板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)，以可视化形式呈现指标。可以使用控制面板直观地查看趋势、离群值和表示其他潜在问题的指标，或者提供您可能需要调查的问题的指示。
+  为服务创建[分布式跟踪监控](https://aws.amazon.com/xray/faqs/)。使用分布式监控，您可以了解应用程序及其底层服务的运行情况，以便确定和诊断性能问题及错误的根本原因。
+  在单独的区域和账户中创建监控系统（使用 [CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_xaxr_dashboard.html) 或 [X-Ray](https://aws.amazon.com/xray/faqs/)）控制面板和数据收集。
+  随时了解 [AWS Health](https://aws.amazon.com/premiumsupport/technology/aws-health/) 的服务降级情况。通过 [AWS 用户通知服务](https://docs.aws.amazon.com/notifications/latest/userguide/what-is-service.html) [创建要发送到电子邮件和聊天渠道且契合目标的 AWS Health 事件通知](https://docs.aws.amazon.com/health/latest/ug/user-notifications.html)，并以编程方式[通过 Amazon EventBridge 与监控和警报工具](https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html)集成。

## 资源
<a name="resources"></a>

 **相关最佳实践：**
+  [可用性定义](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html) 
+  [REL11-BP06 当事件影响可用性时发送通知](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_notifications_sent_system.html) 

 **相关文档：**
+  [使用 Amazon CloudWatch Synthetics 创建用户金丝雀](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [为实例启用或禁用详细监控](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html) 
+  [增强监控](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html) 
+  [Monitoring Your Auto Scaling Groups and Instances Using Amazon CloudWatch](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-monitoring.html) 
+  [发布自定义指标](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [使用 Amazon CloudWatch 警报](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [使用 CloudWatch 控制面板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [使用跨区域跨账户的 CloudWatch 控制面板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_xaxr_dashboard.html) 
+  [使用跨区域跨账户的 X-Ray 跟踪](https://aws.amazon.com/xray/faqs/) 
+  [Understanding availability](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/understanding-availability.html) 

 **相关视频：**
+  [Mitigating gray failures](https://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/gray-failures.html) 

 **相关示例：**
+  [One Observability Workshop: Explore X-Ray](https://catalog.workshops.aws/observability/en-US/aws-native/xray/explore-xray) 

 **相关工具：**
+  [CloudWatch](https://aws.amazon.com/cloudwatch/) 
+  [CloudWatch X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/security-logging-monitoring.html)