

# How zonal autoshift and practice runs work
<a name="arc-zonal-autoshift.how-it-works"></a>

The zonal autoshift capability in Amazon Application Recovery Controller (ARC) allows AWS to shift traffic for a resource away from an Availability Zone, on your behalf, when AWS determines that there's an impairment that could potentially affect customers in the Availability Zone. Zonal autoshift is designed for a resource that is pre-scaled in all Availability Zones in an AWS Region, so that an application can operate normally with the loss of one Availability Zone.

With zonal autoshift, you are required to configure practice runs, where ARC regularly shifts traffic for the resource away from one Availability Zone. ARC schedules practice runs about weekly for each resource that has a practice run configuration associated with it. Practice runs for each resource are scheduled independently.

For each practice run, ARC records an outcome. If a practice run is interrupted by a blocking condition, the practice run outcome is not marked as successful. For more information about practice run outcomes, see [Outcomes for practice runs](arc-zonal-autoshift.considerations.md#ZAConsiderationsPracticeRunOutcomes). 

You can configure Amazon EventBridge notifications to send you information about autoshifts and practice runs. For more information, see [Using zonal autoshift with Amazon EventBridge](eventbridge-zonal-autoshift.md).

**Topics**
+ [About zonal autoshift](arc-zonal-autoshift.how-it-works.about.md)
+ [When AWS starts and stops autoshifts](arc-zonal-autoshift.how-it-works.start-stop-auto.md)
+ [When ARC schedules, starts, and ends practice runs](arc-zonal-autoshift.how-it-works.scheduled-practice-runs.md)
+ [Capacity checks for practice runs](arc-zonal-autoshift.how-it-works.capacity-check.md)
+ [Notification for practice runs and autoshifts](arc-zonal-autoshift.how-it-works.notifications.md)
+ [Precedence for zonal shifts](arc-zonal-autoshift.how-it-works.precedence.md)
+ [Stopping an active autoshift or practice run](arc-zonal-autoshift.how-it-works.stop-shift.md)
+ [How traffic is shifted away](arc-zonal-autoshift.how-it-works.how-traffic-shifted.md)
+ [Alarms for practice runs](arc-zonal-autoshift.how-it-works.alarms.md)
+ [Blocked windows and allowed windows (in UTC)](arc-zonal-autoshift.how-it-works.blocked-windows.md)

# About zonal autoshift
<a name="arc-zonal-autoshift.how-it-works.about"></a>

Zonal autoshift is a capability where AWS shifts application resource traffic away from an Availability Zone, on your behalf. AWS starts an autoshift when internal telemetry indicates that there is an Availability Zone impairment that could potentially impact customers. The internal telemetry incorporates metrics from several sources, including the AWS network, and the Amazon EC2 and Elastic Load Balancing services. 

You must manually enable zonal autoshift for supported AWS resources. 

When you deploy and run AWS applications on load balancers in multiple (typically three) AZs in a Region, and you pre-scale to support static stability, AWS can quickly recover customer applications in an AZ by shifting traffic away with an autoshift. By shifting away resource traffic to other AZs in the Region, AWS can reduce the duration and severity of potential impact caused by power outages, hardware or software issues in an AZ, or other impairments.

The resources supported by ARC provide integrations that mark the specified AZ as unhealthy, which results in traffic being shifted away from the impaired AZ. 

When you enable zonal autoshift for a resource, you must also configure a practice run for the resource. AWS performs practice runs about weekly, for 30 minutes, to help you make sure that you have enough capacity to run your application without one of the Availability Zones in the Region.

As with zonal shift, there are a few specific scenarios where zonal autoshift does not shift traffic away from the AZ. For example, if the load balancer target groups in the AZs don't have any instances, or if all of the instances are unhealthy, then the load balancer is in a fail open state and you can't shift away one of the AZs.

To learn more about zonal autoshift, see [Zonal autoshift in ARC](arc-zonal-autoshift.md).

# When AWS starts and stops autoshifts
<a name="arc-zonal-autoshift.how-it-works.start-stop-auto"></a>

When you enable zonal autoshift for a resource, you authorize AWS to shift away resource traffic for an application from an Availability Zone during events, on your behalf, to help reduce time to recovery.

To achieve this, zonal autoshift uses AWS telemetry to detect, as early as possible, that there is an Availability Zone impairment that could potentially impact customers. When AWS starts an autoshift, traffic to configured resources immediately starts shifting away from the impaired Availability Zone that could potentially impact customers.

Zonal autoshift is a capability designed for customers who have pre-scaled their application resources for all Availability Zones in an AWS Region. You should not rely on scaling on demand when an autoshift or practice run starts.

AWS ends an autoshift when it determines that the Availability Zone has recovered.

# When ARC schedules, starts, and ends practice runs
<a name="arc-zonal-autoshift.how-it-works.scheduled-practice-runs"></a>

ARC schedules a practice run for a resource weekly, for about 30 minutes. ARC schedules, starts, and manages practice runs for each resource independently. ARC does not batch together practice runs for resources in the same account. You can also start on-demand practice runs yourself, to help verify that your setup is safe for a zonal autoshift event.

When a practice run continues for the expected duration, without interruption, it is marked with an outcome of `SUCCESSFUL`. There are several other possible outcomes: `FAILED`, `INTERRUPTED`, `CAPACITY_CHECK_FAILED` and `PENDING`. Outcome values and descriptions are included in the [Outcomes for practice runs](arc-zonal-autoshift.considerations.md#ZAConsiderationsPracticeRunOutcomes) section.

There are some scenarios when ARC interrupts a practice run and ends it. For example, if an autoshift starts during a practice run, ARC interrupts the practice run and ends it. As another example, say that the resource has an adverse response to a practice run and causes an alarm that you've specified to monitor the practice run to go into an `ALARM` state. In this scenario, ARC also interrupts the practice run and ends it.

In addition, there are several scenarios when ARC does not start a schedule practice run for a resource.

In response to interrupted and blocked practice runs for a resource, ARC does the following:
+ If a practice run for a resource is interrupted while it's in progress, ARC considers the weekly practice run to be over, and schedules a new practice run for the resource for the next week. The weekly practice outcome is `INTERRUPTED` in this scenario, not `FAILED`. The practice run outcome set to `FAILED` only when the outcome alarm that monitors the practice run goes into an `ALARM` state during the practice run. 
+ If there is a blocking constraint when a practice run for a resource is scheduled to be started, ARC does not start the practice run. ARC continues regular monitoring, to determine if there are still one or more blocking constraints. When there aren't any blocking constraints, ARC starts the practice run for the resource.

The following are examples of blocking constraints that stop ARC from starting, or continuing, a practice run for a resource:
+ ARC does not start or continue practice runs when there is an AWS Fault Injection Service experiment in progress. If an AWS FIS event is active when ARC has scheduled a practice run to start, ARC does not start the practice run. ARC monitors throughout practice runs for blocking constraints, including an AWS FIS event. If an AWS FIS event starts while a practice run is active, ARC ends the practice run and doesn't attempt to start another one until the next regularly scheduled practice run for the resource.
+ If there is a current AWS event in a Region, ARC does not start practice runs for resources, and ends active practice runs, in the Region.

When the practice run finishes without being interrupted, ARC schedules the next practice run in a week, as usual. If a practice run isn't started because of a blocking constraint, such as a AWS FIS experiment or a blocked time window that you've specified, ARC continues to attempt to start a practice run until the practice run can be started.

# Capacity checks for practice runs
<a name="arc-zonal-autoshift.how-it-works.capacity-check"></a>

When a practice run starts, to temporarily move traffic away from an Availability Zone, ARC runs a check to verify that you have enough capacity in other Availability Zones to safely move traffic away from the AZ. If there isn't sufficient capacity available, the traffic shift for the practice run is not started and the practice run ends. 

In addition, ARC runs a capacity check for load balancer resources when a zonal autoshift completes, before ARC ends the traffic shift started by the autoshift. If the capacity check fails when the autoshift ends, traffic is not shifted back to the Availability Zone that it was moved away from.

Checks for balanced capacity are only completed for load balancers and Auto Scaling groups.

For a load balancer resource, capacity checks validate that healthy hosts associated with the load balancer are distributed across Availability Zones. Specifically, capacity checks make sure that the number of healthy hosts across all Availability Zones where the resource is registered are balanced. For capacity checks, balanced means that the healthy capacity for each Availability Zone is in parity with the other zones, within a small variance.

Note that capacity checks are not applied to load balancers with target groups of type Lambda nor to Application Load Balancers, because those targets are not configured zonally.

Capacity checks are also completed for Auto Scaling groups. For an Auto Scaling group, capacity checks validate that the total healthy zonal capacity of an Auto Scaling group–that is, the number of total healthy hosts across all the Availability Zones–meet the desired capacity set for that Auto Scaling group. 

**When a capacity check fails**

When a capacity check finds that available capacity isn't balanced for a resource, the outcome for the practice run is `CAPACITY_CHECK_FAILED`. To learn more about why a capacity check has failed, see the comment field for the `ZonalShiftSummary`. To find the comment field for your practice run zonal shift, do the following:

1. Using the AWS CLI, list the zonal shifts for the resource that you specified in the practice run using the [ListZonalShifts](https://docs.aws.amazon.com/arc-zonal-shift/latest/api/API_ListZonalShifts.html) API operation. 

   FOr example, to return the zonal shifts, you can run a command similar to the following:

   ```
   aws arc-zonal-shift start-practice-run 
       --resource-identifier="arn:aws:elasticloadbalancing:Region:111122223333:ExampleALB123456890"
   ```

1. Review the array of `ZonalShiftSummary` objects returned to find the zonal shift for the practice run that failed due to capacity checks.

1. For the applicable zonal shift, review the information in the `Comment` field.

# Notification for practice runs and autoshifts
<a name="arc-zonal-autoshift.how-it-works.notifications"></a>

You can choose to be notified about practice runs and autoshifts for your resource by setting up Amazon EventBridge notifications. You can set up EventBridge notifications even when you haven't enabled zonal autoshift for any resources, known as *autoshift observer notification*. With autoshift observer notification, you are notified about all autoshifts that ARC starts when an Availability Zone is potentially impaired. Note that you must configure this option in each AWS Region that you want to receive notifications about. 

To see the steps for enabling autoshift observer notification, see [Enabling or disabling autoshift observer notification](arc-zonal-autoshift.enable-autoshift-observer.md). To learn more about notification options and how to configure them in EventBridge, see [Using zonal autoshift with Amazon EventBridge](eventbridge-zonal-autoshift.md).

# Precedence for zonal shifts
<a name="arc-zonal-autoshift.how-it-works.precedence"></a>

There can be no more than one applied zonal shift at a given time. That is, only one practice run zonal shift, customer-initiated zonal shift, autoshift, or AWS FIS experiment for the resource. When a second zonal shift is started, ARC follows a precedence to determine which zonal shift type is in effect for a resource. 

The general principle for precedence is that zonal shifts that you start as a customer take precedence over other shift types. However, be aware that a currently-running AWS-initiated practice run prevents you from starting an on-demand practice run.

To illustrate precedence in ARC, the following is how precedence works for example scenarios:


| Zonal shift type applied | Zonal shift type initiated | Result | 
| --- | --- | --- | 
| AWS FIS experiment | Practice run | The practice run will fail to start, as the AWS FIS experiment takes precedence.  | 
| AWS FIS experiment | Manual zonal shift | The AWS FIS experiment will be canceled, and the manual zonal shift will be applied.  | 
| AWS FIS experiment | Zonal autoshift | The AWS FIS experiment will be canceled, and the zonal autoshift will be applied.  | 
| AWS FIS experiment | AWS FIS experiment | The initiated AWS FIS experiment will fail to start because there is an existing experiment running that triggered the AWS FIS autoshift action. | 
| Practice run | Manual zonal shift | The practice run will be canceled and the outcome set to INTERRUPTED, and the zonal shift will be applied. | 
| Practice run | AWS FIS experiment | The practice run will be canceled and the outcome set to INTERRUPTED, and the AWS FIS experiment will be applied. | 
| Practice run | Zonal autoshift | The practice run will be canceled and the outcome set to INTERRUPTED, and the zonal autoshift will be applied. | 
| Manual zonal shift | Practice run | The practice run will fail to start. | 
| Manual zonal shift | AWS FIS experiment | The AWS FIS experiment will fail to start, or fail if it's already in progress. | 
| Manual zonal shift | Zonal autoshift | The zonal autoshift will be ACTIVE but not APPLIED on the resource. The manual zonal shift takes precedence. | 
| Zonal autoshift  | AWS FIS experiment | The AWS FIS experiment will fail to start, or will fail if it's in progress. | 
| Zonal autoshift  | Manual zonal shift | The zonal autoshift will be ACTIVE but not APPLIED on the resource. The manual zonal shift takes precedence. | 
| Zonal autoshift  | Practice run | The practice run will fail to start, as the zonal autoshift takes precedence. | 

The traffic shift that is currently in effect for the resource has an applied zonal shift status set to `APPLIED`. Only one shift is set to `APPLIED` at any time. Other shifts that are in progress are set to `NOT_APPLIED`, but remain with `ACTIVE` status.

# Stopping an active autoshift or practice run for a resource
<a name="arc-zonal-autoshift.how-it-works.stop-shift"></a>

To stop an in-progress autoshift for a resource you must cancel the zonal shift.

Regular practice runs still take place for the resource, on the same schedule. If you want to stop practice runs in addition to disabling autoshifts, you must delete the practice run configuration associated with the resource.

When you delete a practice run configuration, AWS stops performing practice runs that shift traffic for the resource away from an Availability Zone each week. In addition, because zonal autoshift requires practice runs, when you delete a practice run configuration using the ARC console, this action also disables zonal autoshift for the resource. However, note that if you use the zonal autoshift API to delete a practice run, you must first disable zonal autoshift for the resource.

For more information, see [Canceling a zonal autoshift](arc-zonal-autoshift.canceling-an-autoshift.md) and [Enabling and working with zonal autoshift](arc-zonal-autoshift.start-cancel.md).

# How traffic is shifted away
<a name="arc-zonal-autoshift.how-it-works.how-traffic-shifted"></a>

For autoshifts and for practice run zonal shifts, traffic is shifted away from an Availability Zone using the same mechanism that ARC uses for customer-initiated zonal shifts. An unhealthy health check results in Amazon Route 53 withdrawing the corresponding IP addresses for the resource from DNS, so that traffic is redirected from the Availability Zone. New connections are now routed to other Availability Zones in the AWS Region instead.

With an autoshift, when an Availability Zone recovers and AWS decides to end the autoshift, ARC reverses the health check process, requesting the Route 53 health checks to be reverted. Then, the original zonal IP addresses are restored and, if the health checks continue to be healthy, the Availability Zone is included in the application's routing again.

It's important to be aware that autoshifts are not based on health checks that monitor the underlying health of load balancers or applications. ARC uses health checks to move traffic away from Availability Zones, by requesting health checks to be set to unhealthy, and then restores health checks to normal again when it ends an autoshift or zonal shift. 

# Alarms for practice runs
<a name="arc-zonal-autoshift.how-it-works.alarms"></a>

You can specify two types of CloudWatch alarms for practice runs in zonal autoshift: outcome alarms and blocking alarms. 

**Outcome alarms (required)**  
 For the first type of alarm, the *outcome alarm*, at least one alarm is required to be specified. You should configure outcome alarms to monitor the health of your application when traffic is shifted away from an Availability Zone during each 30-minute practice run.  
For a practice run to be effective, specify as outcome alarms at least one CloudWatch alarm that meets both of the following criteria:  
The alarm monitors metrics for the resource, or for your application  
AND  
The alarm responds with an `ALARM` state when your application is adversely affected by the loss of one Availability Zone.  
For more information, see the **Alarms that you specify for practice runs** section in [Best practices when you configure zonal autoshift](arc-zonal-autoshift.considerations.md).  
Outcome alarms also provide information for the *practice run outcome* that ARC reports for each practice run. If an outcome alarm enters an `ALARM` state, ARC ends the practice run and returns a practice run outcome of `FAILED`. If the practice run completes the 30 minute test period and none of the outcome alarms that you've specified enters an `ALARM` state, the outcome returned is `SUCCEEDED`. A list of all outcome values, with descriptions, is provided in the [Outcomes for practice runs](arc-zonal-autoshift.considerations.md#ZAConsiderationsPracticeRunOutcomes) section.

**Blocking alarms (optional)**  
Optionally, you can specify a second type of alarm, the *blocking alarm*. Blocking alarms block practice runs from starting, or continuing, when one or more of the alarms is in an `ALARM` state. Blocking alarms block practice run traffic shifts from being started—and stop any practice runs in progress—when at least one of the alarms is in an `ALARM` state.   
For example, in a large architecture with multiple microservices, when one microservice is experiencing a problem, you typically want to stop all other changes in the application environment, which would including blocking practice runs. You can add a blocking alarm in ARC to accomplish this.

# Blocked windows and allowed windows (in UTC)
<a name="arc-zonal-autoshift.how-it-works.blocked-windows"></a>

You have the option to *block* or *allow* practice runs for specific calendar dates, or for specific time windows, that is, days and times, specified in UTC. 

For example, if you have an application update scheduled to launch on May 1, 2024, and you don't want practice runs to shift traffic away at that time, you could set a blocked date for `2024-05-01`.

Or, say you run business report summaries three days a week. For this scenario, you could set the following recurring days and times as blocked windows, for example, in UTC: `MON-20:30-21:30 WED-20:30-21:30 FRI-20:30-21:30`.

Alternatively, you might decide that Wednesdays and Fridays from noon to 5:00 are the best times for ARC to start practice runs, to test your setup. For this scenario, you could set the following recurring days and times as allowed windows, for example, in UTC: `WED-12:00-17:00 FRI-12:00-17:00`.