# REL11-BP03 自动修复所有层
<a name="rel_withstand_component_failures_auto_healing_system"></a>

 在检测到故障时，使用自动化功能执行修复操作。降级可能会通过内部服务机制自动修复，也可能需要通过补救措施重启或移除资源。

 对于自我管理型应用程序和跨区域修复，可以从[现有最佳实践](https://aws.amazon.com/blogs/architecture/understand-resiliency-patterns-and-trade-offs-to-architect-efficiently-in-the-cloud/)中获取恢复设计和自动修复流程。

 重启或移除资源是修复故障的重要方法。最佳实践是尽可能使服务为无状态。这可以防止重启资源时数据丢失或可用性受损。在云中，作为重启的一部分，您可以（而且在一般情况下也应该）替换完整的资源（例如计算实例或无服务器函数）。重启本身是从故障恢复的简单而可靠的方法。工作负载中会发生很多不同类型的故障。故障可能发生在硬件、软件、通信和操作上。

 重启或重试也适用于网络请求。向网络超时以及依赖项返回错误的依赖性故障应用相同的恢复方法。这两个事件对系统具有类似的影响，可应用类似的采用指数回退和抖动的有限重试策略，而不是尝试将各个事件当作特例进行处理。重启功能是面向恢复的计算和高可用性集群架构的特色恢复机制。

 **期望结果：**执行自动操作来修复检测到的故障。

 **常见反模式：**
+  预置资源，而不进行自动扩缩。
+  在实例或容器中单独部署应用程序。
+  在不使用自动恢复的情况下，部署无法部署到多个位置的应用程序。
+  手动修复自动扩缩和自动恢复无法修复的应用程序。
+  缺乏对数据库进行失效转移的自动化功能。
+  缺乏将流量重新路由到新端点的自动化方法。
+  缺乏存储复制功能。

 **建立此最佳实践的好处：**自动修复可以缩短平均恢复时间，并提高可用性。

 **在未建立这种最佳实践的情况下暴露的风险等级：**高 

## 实施指导
<a name="implementation-guidance"></a>

 Amazon EKS 或其他 Kubernetes 服务的设计应包括最小和最大副本集或有状态集，以及最小集群和节点组大小。这些机制提供了最低数量的持续可用处理资源，同时使用 Kubernetes 控制面板自动修复任何故障。

 使用计算集群通过负载均衡器访问的设计模式应利用自动扩缩组。弹性负载均衡（ELB）自动在一个或多个可用区（AZ）中的多个目标和虚拟设备之间分配传入的应用程序流量。

 对于不使用负载平衡的基于集群计算的设计，其规模至少应能承受一个节点的中断。这将允许服务在恢复新节点时，使自身以可能降低的容量维持运行状态。示例服务有 Mongo、DynamoDB Accelerator、Amazon Redshift、Amazon EMR、Cassandra、Kafka、MSK-EC2、Couchbase、ELK 和 Amazon OpenSearch Service。其中许多服务都可以设计带有额外的自动修复功能。某些集群技术必须在节点丢失时生成警报，从而触发自动或手动工作流程来重新创建新节点。该工作流程可以使用 AWS Systems Manager 自动执行，从而快速修复问题。

 Amazon EventBridge 可用于监控和筛选事件，例如 CloudWatch 警报或其他 AWS 服务中的状态更改。然后，其可根据事件信息调用 AWS Lambda、Systems Manager Automation 或其他目标，对工作负载运行自定义修复逻辑。可以配置 Amazon EC2 Auto Scaling 来检查 EC2 实例的运行状况。若实例处于正在运行以外的任何状态，或系统状态受损，Amazon EC2 Auto Scaling 会认为实例的运行状况不佳，继而启动替换实例。针对大规模替换（例如整个可用区丢失），静态稳定性更适合用来实现高可用性。

### 实施步骤
<a name="implementation-steps"></a>
+  使用自动扩缩组在工作负载中部署层。[自动扩缩](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html)可以对无状态应用程序执行自我修复，以及添加或移除容量。
+  对于前面提到的计算实例，可使用[负载均衡](https://docs.aws.amazon.com/autoscaling/ec2/userguide/autoscaling-load-balancer.html)，然后选择合适的负载均衡器类型。
+  考虑 Amazon RDS 修复。对于备用实例，请配置[自动失效转移](https://repost.aws/questions/QU4DYhqh2yQGGmjE_x0ylBYg/what-happens-after-failover-in-rds)到备用实例。针对 Amazon RDS 只读副本，需要自动化工作流程，使只读副本成为主副本。
+  [在 EC2 实例上实施自动恢复](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html)，这些实例中部署的应用程序无法部署到多个位置，但在故障后允许重新启动。当无法将应用程序部署到多个位置时，可以使用自动恢复来替换发生故障的硬件并重新启动实例。实例元数据和关联的 IP 地址，以及 [EBS 卷](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html)和 [Amazon Elastic File System](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEFS.html) 或 [适用于 Lustre 的文件系统](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html)和[适用于 Windows 的文件系统](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/what-is.html)的挂载点，都会得到保留。使用 [AWS OpsWorks](https://docs.aws.amazon.com/opsworks/latest/userguide/workinginstances-autohealing.html) 时，您可以在层级别配置 EC2 实例的自动修复。
+  当无法使用自动扩缩或自动恢复，或者自动恢复出故障时，可以使用 [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) 和 [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) 实施自动恢复。当无法使用自动扩缩，并且无法使用自动恢复或自动恢复失败时，可以使用 AWS Step Functions 和 AWS Lambda 进行自动修复。
+  [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html) 可用于监控和筛选事件，例如 [CloudWatch 警报](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)或其他 AWS 服务中的状态更改。根据事件信息，该服务可以调用 AWS Lambda（或其他目标），在工作负载上运行自定义修复逻辑。

## 资源
<a name="resources"></a>

 **相关最佳实践：**
+  [可用性定义](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html) 
+  [REL11-BP01 监控工作负载的所有组件以检测故障](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_notifications_sent_system.html) 

 **相关文档：**
+  [AWS Auto Scaling 的工作原理](https://docs.aws.amazon.com/autoscaling/plans/userguide/how-it-works.html) 
+  [Amazon EC2 Automatic Recovery](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html) 
+  [Amazon Elastic Block Store （Amazon EBS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) 
+  [Amazon Elastic File System (Amazon EFS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEFS.html) 
+  [What is Amazon FSx for Lustre?](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html)
+  [What is Amazon FSx for Windows File Server?](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/what-is.html)
+  [AWS OpsWorks: Using Auto Healing to Replace Failed Instances](https://docs.aws.amazon.com/opsworks/latest/userguide/workinginstances-autohealing.html) 
+  [什么是 AWS Step Functions？](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) 
+  [什么是 AWS Lambda？](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) 
+  [什么是 Amazon EventBridge？](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html) 
+  [使用 Amazon CloudWatch 警报](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) 
+  [Amazon RDS 失效转移](https://d1.awsstatic.com/rdsImages/IG1_RDS1_AvailabilityDurability_Final.pdf) 
+  [SSM – Systems Manager Automation](https://docs.aws.amazon.com/resilience-hub/latest/userguide/integrate-ssm.html) 
+  [韧性架构最佳实践](https://aws.amazon.com/blogs/architecture/understand-resiliency-patterns-and-trade-offs-to-architect-efficiently-in-the-cloud/) 

 **相关视频：**
+  [Automatically Provision and Scale OpenSearch Service](https://www.youtube.com/watch?v=GPQKetORzmE) 
+  [Amazon RDS Failover Automatically](https://www.youtube.com/watch?v=Mu7fgHOzOn0) 

 **相关示例：**
+  [Amazon RDS 失效转移讲习会](https://catalog.workshops.aws/resilient-apps/en-US/rds-multi-availability-zone/failover-db-instance) 

 **相关工具：**
+  [CloudWatch](https://aws.amazon.com/cloudwatch/) 
+  [CloudWatch X-Ray](https://docs.aws.amazon.com/xray/latest/devguide/security-logging-monitoring.html)