# REL11-BP01 監控工作負載的所有元件以偵測故障
<a name="rel_withstand_component_failures_monitoring_health"></a>

 持續監控工作負載的運作狀態，讓您和自動化系統在發生故障或效能降低時能夠察覺。根據商業價值監控關鍵績效指標 (KPI)。

 所有復原和修復機制首先都必須能夠快速偵測問題。應該先偵測技術故障，以便解決問題。不過，可用性取決於工作負載提供商業價值的能力，因此測量此需求的關鍵績效指標 (KPI) 必須成為偵測和修復策略的一部分。

 **預期成果：**工作負載的基本元件會單獨監控，以偵測故障發生的時機和位置並發出警示。

 **常見的反模式：**
+  未設定任何警報，因此會在未發出通知的情況下發生中斷。
+  警示存在，但在此閾值下無法提供足夠的回應時間。
+  收集的指標經常不足以符合復原時間點目標 (RTO)。
+  只主動監控面對客戶的工作負載介面。
+  只收集技術指標，未收集業務功能指標。
+  無測量工作負載使用者體驗的指標。
+  建立了太多監控。

 **建立此最佳實務的優勢：**在各層級內進行適當的監控，可讓您減少偵測時間，進而減少復原時間。

 **未建立此最佳實務時的曝險等級：**高 

## 實作指引
<a name="implementation-guidance"></a>

 確定將要審核以進行監控的所有工作負載。確定需要監控的所有工作負載元件之後，您現在需要確定監控間隔。根據偵測故障所需的時間而定，監控間隔會直接影響復原的速度。平均偵測時間 (MTTD) 是指從發生故障到開始修復作業經過的時間。服務清單應盡可能廣泛且完整。

 監控必須涵蓋應用程式堆疊的所有層級，包括應用程式、平台、基礎設施和網路。

 您的監控策略應考慮*微小故障*的影響。如需微小故障的詳細資訊，請參閱《進階多可用區域彈性模式》白皮書中的 [Gray failures](https://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/gray-failures.html)。

### 實作步驟
<a name="implementation-steps"></a>
+  您的監控間隔取決於復原必須多快完成。您的復原時間取決於所需的復原時間，因此您必須考量此時間和復原時間點目標 (RTO)，藉以決定收集頻率。
+  設定元件和受管服務的詳細監控。
  +  判斷 [EC2 執行個體](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html)和 [Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-monitoring.html) 是否需要詳細監控。詳細監控提供 1 分鐘的間隔指標，預設監控則提供 5 分鐘的間隔指標。
  +  判斷 RDS 是否需要[增強型監控](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Monitoring.html)。增強型監控使用 RDS 執行個體上的代理程式，以取得不同處理程序或執行緒的實用資訊。
  +  判斷 [Lambda](https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html)、[API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/monitoring_automated_manual.html)、[Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/eks-observe.html)、[Amazon ECS](https://catalog.workshops.aws/observability/en-US/aws-managed-oss/amp/ecs) 和所有類型的[負載平衡器](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-monitoring.html)的關鍵無伺服器元件的監控需求。
  +  確定 [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/monitoring-overview.html)、[Amazon FSx](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/monitoring_overview.html)、[Amazon EFS](https://docs.aws.amazon.com/efs/latest/ug/monitoring_overview.html) 和 [Amazon EBS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html) 的儲存元件的監控需求。
+  建立[自訂指標](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html)以測量業務關鍵績效指標 (KPI)。工作負載會實作重要的業務功能，這些功能應做為 KPI，以利確定間接問題發生的時間。
+  以使用者 Canary 監控使用者的故障體驗。可執行和模擬客戶行為的[綜合交易測試](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) (也稱為 Canary 測試，但請別與 Canary 部署混淆)，是最重要的測試程序之一。針對來自不同遠端位置的工作負載端點持續執行這些測試。
+  建立追蹤使用者體驗的[自訂指標](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html)。如果您可以檢測客戶的體驗，則可以判斷消費者體驗何時變差。
+  [設定警示](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)以偵測工作負載的任何部分何時未正常運作，並指示何時自動擴展資源。警示會在儀表板上以視覺化方式顯示、透過 Amazon SNS 或電子郵件傳送提醒，以及搭配使用 Auto Scaling 來擴展或縮減工作負載資源。
+  建立[儀表板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html)以視覺化指標。儀表板可以讓您以視覺化方式查看趨勢、極端值和其他潛在問題的指標，或指出您可能想要調查的問題。
+  為您的服務建立[分散式追蹤監控](https://aws.amazon.com/xray/faqs/)。透過分散式監控，您可以了解應用程式及其基礎服務的執行方式，以確定和疑難排解效能問題與錯誤的根本原因。
+  在單獨的區域和帳戶中建立監控系統 (使用 [CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_xaxr_dashboard.html) 或 [X-Ray](https://aws.amazon.com/xray/faqs/)) 儀表板和資料收集。
+  透過 [AWS Health](https://aws.amazon.com/premiumsupport/technology/aws-health/) 隨時掌握服務降級的相關資訊。[透過 [AWS 使用者通知](https://docs.aws.amazon.com/notifications/latest/userguide/what-is-service.html) 建立符合用途的 AWS Health 事件通知](https://docs.aws.amazon.com/health/latest/ug/user-notifications.html)，以利用電子郵件和聊天管道傳送，並透過 [Amazon EventBridge 以程式設計方式與您的監控和警示工具](https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html)整合。

## 資源
<a name="resources"></a>

 **相關的最佳實務：**
+  [可用性定義](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html) 
+  [REL11-BP06 當事件影響可用性時傳送通知](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_notifications_sent_system.html) 

 **相關文件：**
+  [Amazon CloudWatch Synthetics 可讓您建立使用者 Canary](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [為執行個體啟用或停用詳細監控](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html) 
+  [Enhanced Monitoring (增強型監控](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html) 
+  [使用 Amazon CloudWatch 監控 Auto Scaling 群組和執行個體](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-monitoring.html) 
+  [發佈自訂指標](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 
+  [使用 Amazon CloudWatch 警示](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [使用 CloudWatch 儀表板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html) 
+  [使用跨區域跨帳戶 CloudWatch 儀表板](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_xaxr_dashboard.html) 
+  [使用跨區域跨帳戶 X-Ray 追蹤](https://aws.amazon.com/xray/faqs/) 
+  [了解可用性](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/understanding-availability.html) 

 **相關影片：**
+  [減少微小故障](https://docs.aws.amazon.com/whitepapers/latest/advanced-multi-az-resilience-patterns/gray-failures.html) 

 **相關範例：**
+  [一個可觀測性研討會：探索 X-Ray](https://catalog.workshops.aws/observability/en-US/aws-native/xray/explore-xray) 

 **相關工具：**
+  [CloudWatch](https://aws.amazon.com/cloudwatch/)：
+  [CloudWatch X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/security-logging-monitoring.html)