# OPS04-BP04 实施依赖项遥测
<a name="ops_observability_dependency_telemetry"></a>

 要想监控工作负载所依赖的外部服务和组件的运行状况及性能，依赖项遥测必不可少。依赖项遥测提供有关与 DNS、数据库或第三方 API 等依赖项相关的可访问性、超时及其他关键事件的宝贵洞察。对应用程序进行检测，发布有关这些依赖项的指标、日志和跟踪数据时，能更清楚地了解可能影响工作负载的潜在瓶颈、性能问题或故障。

 **期望结果：**确保工作负载所依赖的依赖项按预期执行，让您能够主动解决问题并确保最佳的工作负载性能。

 **常见反模式：**
+  **忽略外部依赖项：**仅关注内部应用程序指标，而忽略与外部依赖项相关的指标。
+  **缺乏主动监控：**等待问题出现，而不是持续监控依赖项运行状况和性能。
+  **孤立监控：**使用多种不同的监控工具，这可能会导致依赖项运行状况视图支离破碎且不一致。

 **建立此最佳实践的好处：**
+  **提高工作负载可靠性：**通过确保外部依赖项始终可用且性能出色来实现。
+  **更快地检测和解决问题：**在依赖项问题影响工作负载之前，主动发现和解决这些问题。
+  **全面视图：**全面了解影响工作负载运行状况的内部和外部组件。
+  **增强工作负载可扩展性：**通过了解外部依赖项的可扩展性限制和性能特征来实现。

 **在未建立这种最佳实践的情况下暴露的风险等级：**高 

## 实施指导
<a name="implementation-guidance"></a>

 从确定工作负载所依赖的服务、基础设施和流程开始，实施依赖项遥测。量化这些依赖项按预期运行时的良好状况，然后确定将需要哪些数据来衡量这些状况。利用这些信息，可以创建控制面板和警报，为运营团队提供有关这些依赖项状态的洞察。当依赖项无法按需交付时，使用 AWS 工具来发现和量化影响。不断重新审视策略，考虑优先事项、目标和所获洞察的变化。

### 实施步骤
<a name="implementation-steps"></a>

 要有效地实现依赖项遥测，请执行以下操作：

1.  **确定外部依赖项：**与利益相关方合作，查明工作负载所依赖的外部依赖项。外部依赖项可包括外部数据库、第三方 API、通往其他环境的网络连接路由以及 DNS 服务等内容。实现有效的依赖项遥测的第一步是全面了解这些依赖项是什么。

1.  **制定监控策略：**一旦清楚地了解了外部依赖项，就可以针对它们构建一个量身定制的监控策略。这包括了解每个依赖项的重要程度、其预期行为以及任何相关的服务水平协议或目标（SLA 或 SLT）。设置主动警报，在出现状态变化或性能偏差时发出通知。

1.  **使用[网络监控](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Network-Monitoring-Sections.html)：**使用[网络检测仪](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html)和[网络监视器](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/what-is-network-monitor.html)，全面了解全球互联网和网络状况。这些工具有助于了解并应对影响外部依赖项的中断、破坏或性能下降。

1.  **随时了解 [AWS Health](https://aws.amazon.com/premiumsupport/technology/aws-health/) 的最新信息：**AWS Health 是有关 AWS 云资源运行状况的权威信息来源。使用 AWS Health 可视化并接收有关任何当前服务事件和即将发生的更改（例如计划的生命周期事件）的通知，以便您可以采取措施来减轻影响。

   1.  通过 [AWS 用户通知服务](https://docs.aws.amazon.com/notifications/latest/userguide/what-is-service.html) 创建要发送到电子邮件和聊天渠道且[契合目标的 AWS Health 事件通知](https://docs.aws.amazon.com/health/latest/ug/user-notifications.html)，并通过 Amazon EventBridge 或 [AWS Health API](https://docs.aws.amazon.com/health/latest/APIReference/Welcome.html) 以编程方式与[监控和警报工具](https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html)集成。

   1.  通过与您可能已经通过 Amazon EventBridge 或 AWS Health API 使用的变更管理或 ITSM 工具（如 [Jira](https://docs.aws.amazon.com/smc/latest/ag/cloud-sys-health.html) 或 [ServiceNow](https://docs.aws.amazon.com/smc/latest/ag/sn-aws-health.html)）集成，规划和跟踪需要采取行动的运行状况事件的进度。

   1.  如果您使用 AWS Organizations，请启用 [organization view for AWS Health](https://docs.aws.amazon.com/health/latest/ug/aggregate-events.html) 以跨账户聚合 AWS Health 事件。

1.  **使用 [AWS X-Ray](https://aws.amazon.com/xray/) 检测应用程序：**AWS X-Ray 让您能够深入了解应用程序及其底层依赖项的运行情况。通过从头到尾跟踪请求，可以找出应用程序所依赖的外部服务或组件中的瓶颈或故障。

1.  **使用 [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/)：**这项服务由机器学习驱动，可发现操作问题，预测何时可能出现严重问题，并建议可采取的具体行动。其可贵之处在于，可以让您深入了解依赖项，并确保这些依赖项不会成为操作问题的根源。

1.  **定期监控：**持续监控与外部依赖项相关的指标和日志。针对意外行为或性能下降设置警报。

1.  **更改后进行验证：**每当任何外部依赖项有更新或更改时，都应验证其性能，并检查它们是否符合应用程序的要求。

 **实施计划的工作量级别：**中 

## 资源
<a name="resources"></a>

 **相关最佳实践：**
+  [OPS04-BP01 定义工作负载 KPI](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_identify_kpis.html) 
+  [OPS04-BP02 实施应用程序遥测](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_application_telemetry.html) 
+  [OPS04-BP03 实施用户活动遥测](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_customer_telemetry.html) 
+  [OPS04-BP05 实施事务可追溯性](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_dist_trace.html) 
+  [OP08-BP04 创建可操作的警报](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_workload_observability_create_alerts.html) 

 **相关文档：**
+  《[Amazon Personal Health Dashboard User Guide](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html)》 
+  《[AWS Internet Monitor User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html)》 
+  [AWS X-Ray 开发人员指南](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 
+  《[AWS DevOps Guru User Guide](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html)》 

 **相关视频：**
+  [Visibility into how internet issues impact app performance](https://www.youtube.com/watch?v=Kuc_SG_aBgQ) 
+  [Introduction to Amazon DevOps Guru](https://www.youtube.com/watch?v=2uA8q-8mTZY) 
+  [Manage resource lifecycle events at scale with AWS Health](https://www.youtube.com/watch?v=VoLLNL5j9NA) 

 **相关示例：**
+  [AWS Health Aware](https://github.com/aws-samples/aws-health-aware/) 
+  [Using Tag-Based Filtering to Manage AWS Health Monitoring and Alerting at Scale](https://aws.amazon.com/blogs/mt/using-tag-based-filtering-to-manage-health-monitoring-and-alerting-at-scale/)