Highly Available Redis
(with failover time in single digit seconds)
Most NoSQL deployments use three replicas to ensure high availability (HA). From a high level perspective, the first replica is usually used to store your dataset, the second for failover purposes and the third to serve as a tiebreaker in case of a network split event. As DRAM is expensive, maintaining three replicas can be extremely expensive. Redis Enterprise, on the other hand, allows you to have a fully HA system with only two replicas, where your tiebreaker is determined at the node level by using an uneven number of nodes in a cluster. The example below compares the infrastructure cost of running a 90GB HA OSS Redis dataset on AWS with three replicas as opposed to with an Redis Enterprise cluster that uses two replicas and a quorum node:
The Cost of 90GB Dataset HA Deployment on AWS (reserved instances)
Pure In-Memory Replication
Redis Enterprise replication is based on diskless replication at both the master and slave, as shown in the figure below:
An internal replication benchmark with 52GB RAM (19GB RDB) showed the following results on 2x r3.2xlarge instances:
In addition, Redis Enterprise uses PSYNC2 for its core operations, so the active replication link is maintained afterwards for planned failover or shard migration operations. [/caption]
A Redis Enterprise cluster uses two watchdog processes to detect failures:
- Node watchdog – Monitors all processes running on a given node. For example, the node watchdog triggers a shard failover event if a specific shard is not responsive.
- Cluster watchdog – Responsible for the health of the cluster nodes and uses a gossip protocol to manage the membership of the nodes in the cluster. For example, cluster watchdog triggers a node failure event or detects a network split incident.
These watchdog processes are part of the distributed cluster manager entity and reside on each node of the cluster. It is extremely important for failure detection to be managed by entities that run inside the cluster in order to avoid situations like that shown on the left side of the figure below. In this example, the watchdog entity is located in the wrong side of the network split and cannot trigger the failover process:
Once a failure event is detected, the Redis Enterprise cluster automatically and transparently runs a set of internal distributed processes that failover the relevant shard(s) and endpoint(s) (if needed) to healthy cluster nodes. If necessary, they also reroute user traffic through a different proxy or proxies.
The Redis Enterprise cluster has out-of-the-box HA profiles for noisy (public cloud) and quiet (VPC, on-premises) environments. We have found that triggering failovers in too aggressive of a manner can potentially create stability issues. On the other hand, in a quiet network environment, a Redis Enterprise cluster can be easily tuned to support a constant single digit (<10 sec) failover time in all failure scenarios.
Multi-AZ (Availability-Zone) / Rack Deployment
Redis Enterprise supports multi-AZ/rack cluster configurations. In this mode, the cluster nodes are tagged with the zone/rack they have been deployed in, and Redis Enterprise ensures that master and slave Redis processes of the same shard are never hosted on nodes that are located in the same AZ/rack. To run Redis Enterprise in a multi-AZ/rack environment, the following conditions are required:
- Three or more cluster nodes, with the total number being uneven
- Three or more AZs/racks, with the total number being uneven
- The number of nodes in a given AZ/rack should always be a minority, i.e. smaller than the number of nodes in all other AZ/racks. This ensures that the majority of the nodes in the cluster will remain up and running if there is an AZ/rack failure.
- Network latency between the AZ/rack should be <10ms guaranteeing that cluster failure detection mechanisms operate correctly. In most cases, decisions are made when real failure events happen and are less influenced by network glitches. For cases with more than 10ms latency between AZ/racks, Active-Passive or Active-Active deployments should be considered.
An example of Redis Enterprise multi-AZ configuration in the cloud is shown in the figure below:
As you can see, all the conditions discussed above are met in this example:
- There is an uneven number of nodes (7).
- There is an uneven number of availability-zones (3).
- The number of nodes in each zone is a minority.
- In all the major public clouds, network latency between zones in the same region is always less than 10ms.