Most NoSQL deployments use three replicas to ensure high availability (HA). From a high-level perspective, the first replica is usually used to store your dataset, the second for failover purposes, and the third to serve as a tiebreaker in case of a network split event. Because DRAM is expensive, maintaining three replicas can be extremely expensive. Redis Enterprise, on the other hand, allows you to have a fully HA system with only two replicas, where your tiebreaker is determined at the node level by using an uneven number of nodes in a cluster. The example below compares the infrastructure cost of running a 90GB HA OSS Redis dataset on Amazon Web Services with three replicas as opposed to with an Redis Enterprise cluster that uses two replicas and a quorum node:
The cost of 90GB dataset HA deployment on AWS (reserved instances)
Pure in-memory replication
Redis Enterprise replication is based on diskless replication at both the master and slave, as shown in the figure below:
An internal replication benchmark with 52GB DRAM (19GB RDB) showed the following results on 2x r3.2xlarge instances:
In addition, Redis Enterprise uses PSYNC2 for its core operations, so the active replication link is maintained afterwards for planned failover or shard migration operations.
A Redis Enterprise cluster uses two watchdog processes to detect failures:
- Node watchdog: Monitors all processes running on a given node. For example, the node watchdog triggers a shard failover event if a specific shard is not responsive.
- Cluster watchdog: Responsible for the health of the cluster nodes and uses a gossip protocol to manage the membership of the nodes in the cluster. For example, cluster watchdog triggers a node failure event or detects a network split incident.
These watchdog processes are part of the distributed cluster manager entity and reside on each node of the cluster. It is extremely important for failure detection to be managed by entities that run inside the cluster in order to avoid situations like that shown on the left side of the figure below. In this example, the watchdog entity is located in the wrong side of the network split and cannot trigger the failover process:
Once a failure event is detected, the Redis Enterprise cluster automatically and transparently runs a set of internal distributed processes that failover the relevant shard(s) and endpoint(s) (if needed) to healthy cluster nodes. If necessary, they also reroute user traffic through a different proxy or proxies.
The Redis Enterprise cluster has out-of-the-box HA profiles for noisy (public cloud) and quiet (virtual private cloud, on-premises) environments. We have found that triggering failovers too aggressively can create stability issues. On the other hand, in a quiet network environment, a Redis Enterprise cluster can be easily tuned to support a constant single-digit (<10 sec) failover time in all failure scenarios.
Multi availability zone (AZ)/rack deployment
Redis Enterprise supports multi-AZ/rack cluster configurations. In this mode, the cluster nodes are tagged with the zone/rack they have been deployed in, and Redis Enterprise ensures that master and slave Redis processes of the same shard are never hosted on nodes that are located in the same AZ/rack. Running Redis Enterprise in a multi-AZ/rack environment requires the following conditions:
- Three or more cluster nodes, with the total number being uneven
- Three or more AZs/racks, with the total number being uneven
- The number of nodes in a given AZ/rack should always be a minority, i.e. smaller than the number of nodes in all other AZ/racks. This ensures that the majority of the nodes in the cluster will remain up and running if there is an AZ/rack failure.
- Network latency between the AZ/rack should be <10ms, guaranteeing that cluster failure detection mechanisms will operate correctly. In most cases, decisions are made when real failure events happen and are less influenced by network glitches. For cases with more than 10ms latency between AZ/racks, Active-Passive or Active-Active deployments should be considered.
An example of Redis Enterprise multi-AZ configuration in the cloud is shown here:
As you can see, this example meets all the conditions discussed above:
- There is an uneven number of nodes (7).
- There is an uneven number of availability-zones (3).
- The number of nodes in each zone is a minority.
- In all the major public clouds, network latency between zones in the same region is always less than 10ms.