# Best practices
<a name="bestpractices"></a>

Following, you can find recommended best practices for MemoryDB. Following these improves your cluster's performance and reliability. 

**Topics**
+ [

# Resilience in MemoryDB
](disaster-recovery-resiliency.md)
+ [

# Best practices: Pub/Sub and Enhanced I/O Multiplexing
](best-practices-pubsub.md)
+ [

# Best practices: Online cluster resizing
](best-practices-online-resharding.md)

# Resilience in MemoryDB
<a name="disaster-recovery-resiliency"></a>

The AWS global infrastructure is built around AWS Regions and Availability Zones. AWS Regions provide multiple physically separated and isolated Availability Zones, which are connected with low-latency, high-throughput, and highly redundant networking. With Availability Zones, you can design and operate applications and databases that automatically fail over between Availability Zones without interruption. Availability Zones are more highly available, fault tolerant, and scalable than traditional single or multiple data center infrastructures. 

For more information about AWS Regions and Availability Zones, see [AWS Global Infrastructure](https://aws.amazon.com/about-aws/global-infrastructure/).

In addition to the AWS global infrastructure, MemoryDB offers several features to help support your data resiliency and snapshot needs.

**Topics**
+ [

# Mitigating Failures
](faulttolerance.md)

# Mitigating Failures
<a name="faulttolerance"></a>

When planning your MemoryDB implementation, you should plan so that failures have a minimal impact upon your application and data. The topics in this section cover approaches you can take to protect your application and data from failures.

## Mitigating Failures: MemoryDB clusters
<a name="faulttolerance.cluster.replication"></a>

A MemoryDB cluster is comprised of a single primary node which your application can both read from and write to, and from 0 to 5 read-only replica nodes. However, we highly recommend to use at least 1 replica for high availability. Whenever data is written to the primary node it is persisted to the transaction log and asynchronously updated on the replica nodes. 

**When a read replica fails**

1. MemoryDB detects the failed replica.

1. MemoryDB takes the failed node offline.

1. MemoryDB launches and provisions a replacement node in the same AZ.

1. The new node synchronizes with the transaction log.

During this time your application can continue reading and writing using the other nodes.

**MemoryDB Multi-AZ**  
If Multi-AZ is activated on your MemoryDB clusters, a failed primary will be detected and replaced automatically. 

****

1. MemoryDB detects the primary node failure.

1. MemoryDB fails over to a replica after ensuring it is consistent with the failed primary.

1. MemoryDB spins up a replica in the failed primary's AZ.

1. The new node syncs with the transaction log.

Failing over to a replica node is generally faster than creating and provisioning a new primary node. This means your application can resume writing to your primary node sooner.

For more information, see [Minimizing downtime in MemoryDB with Multi-AZ](autofailover.md).

# Best practices: Pub/Sub and Enhanced I/O Multiplexing
<a name="best-practices-pubsub"></a>

When using Valkey or Redis OSS version 7 or later, we recommend using [sharded Pub/Sub](https://valkey.io/topics/pubsub/). You also improve throughput and latency using [enhanced I/O multiplexing](https://aws.amazon.com/memorydb/features/#Ultra-fast_performance), which is automatically available when using Valkey or Redis OSS version 7 or later and requires no client changes. It is ideal for pub/sub workloads, which often are throughput-bound with multiple client connections.

# Best practices: Online cluster resizing
<a name="best-practices-online-resharding"></a>

*Resharding * involves adding and removing shards or nodes to your cluster and redistributing key spaces. As a result, multiple things have an impact on the resharding operation, such as the load on the cluster, memory utilization, and overall size of data. For the best experience, we recommend that you follow overall cluster best practices for uniform workload pattern distribution. In addition, we recommend taking the following steps.

Before initiating resharding, we recommend the following:
+ **Test your application** – Test your application behavior during resharding in a staging environment if possible.
+ **Get early notification for scaling issues** – Resharding is a compute-intensive operation. Because of this, we recommend keeping CPU utilization under 80 percent on multicore instances and less than 50 percent on single core instances during resharding. Monitor MemoryDB metrics and initiate resharding before your application starts observing scaling issues. Useful metrics to track are `CPUUtilization`, `NetworkBytesIn`, `NetworkBytesOut`, `CurrConnections`, `NewConnections`, `FreeableMemory`, `SwapUsage`, and `BytesUsedForMemoryDB`.
+ **Ensure sufficient free memory is available before scaling in** – If you're scaling in, ensure that free memory available on the shards to be retained is at least 1.5 times the memory used on the shards you plan to remove.
+ **Initiate resharding during off-peak hours** – This practice helps to reduce the latency and throughput impact on the client during the resharding operation. It also helps to complete resharding faster as more resources can be used for slot redistribution.
+ **Review client timeout behavior** – Some clients might observe higher latency during online cluster resizing. Configuring your client library with a higher timeout can help by giving the system time to connect even under higher load conditions on server. In some cases, you might open a large number of connections to the server. In these cases, consider adding exponential backoff to reconnect logic. Doing this can help prevent a burst of new connections hitting the server at the same time.

During resharding, we recommend the following:
+ **Avoid expensive commands** – Avoid running any computationally and I/O intensive operations, such as the `KEYS` and `SMEMBERS` commands. We suggest this approach because these operations increase the load on the cluster and have an impact on the performance of the cluster. Instead, use the `SCAN` and `SSCAN` commands.
+ **Follow Lua best practices** – Avoid long running Lua scripts, and always declare keys used in Lua scripts up front. We recommend this approach to determine that the Lua script is not using cross slot commands. Ensure that the keys used in Lua scripts belong to the same slot.

After resharding, note the following:
+ Scale-in might be partially successful if insufficient memory is available on target shards. If such a result occurs, review available memory and retry the operation, if necessary.
+ Slots with large items are not migrated. In particular, slots with items larger than 256 MB post-serialization are not migrated.
+  `FLUSHALL` and `FLUSHDB` commands are not supported inside Lua scripts during a resharding operation.