Hi everyone,
I am running Redis in Sentinel mode with the following setup:
- 1 master
- 3 replicas
- C# application using StackExchange.Redis
- Writes go to the current master
- Reads are intended to go to replicas
The goal is to keep read traffic available during master failover and to switch to a replica that is actually able to serve reads as quickly as possible.
During failover testing, I observed that after one replica is promoted to master, other replicas may enter full resync / loading state and return errors such as:
LOADING Redis is loading the dataset in memory
MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'
Here are the relevant Redis / Sentinel settings from my environment:
Sentinel:
- monitor quorum: 2
- down-after-milliseconds: 2000 ms
- failover-timeout: 120000 ms
- parallel-syncs: 1
Redis replication:
- repl-backlog-size: 100mb in the original STG config
- repl-backlog-size: also tested with 3gb locally
- repl-backlog-ttl: 7200 seconds
- replica-priority:
- original/default master node: 1
- other replica nodes: 100
- replica-serve-stale-data: yes
- min-replicas-to-write: not explicitly set
- min-replicas-max-lag: not explicitly set
Dataset size during local testing:
- around 3 million keys
- around 3 GB used memory
Even after increasing repl-backlog-size to 3gb in local testing, I still observed cases where replicas entered LOADING during failover recovery. So my current assumption is that a larger backlog can reduce the probability of full resync, but it does not guarantee that replicas will always recover via partial resync.
My current understanding is:
- Sentinel can tell clients which node is the current master.
- Sentinel can expose the replica topology.
- Sentinel chooses a replica for promotion based on factors such as
replica-priority, replication offset, run ID, and availability. - However, choosing the best replica for promotion does not necessarily mean all remaining replicas are immediately ready to serve reads.
- A replica can still be reachable at the TCP level but not service-ready because it may return
LOADING,MASTERDOWN, or time out. - StackExchange.Redis with replica reads /
PreferReplicadoes not seem to give me direct control to choose only replicas that pass my own readiness criteria.
What I want to achieve is:
- Detect replicas that are reachable but not ready for reads.
- Exclude replicas returning
LOADING,MASTERDOWN, timeout, or non-PONGhealth responses. - Route reads only to healthy replicas.
- Avoid falling back to master unless explicitly allowed, because we are concerned about overloading the master during failover.
- If no healthy replica exists, fail fast or use an application-level fallback instead of treating Redis errors as cache miss.
My questions are:
- In Redis Sentinel mode, is there a recommended way to make replica reads readiness-aware?
- During Sentinel failover, how exactly does Redis/Sentinel choose the replica to promote?
- How much do
replica-priority, replication offset, run ID, and replica availability affect the promotion decision? - Is there any way to prefer the replica with the most complete data and shortest recovery time?
- Is
LOADING/MASTERDOWNduring failover something Sentinel is expected to expose to clients, or should it be handled at the client/application layer? - Does StackExchange.Redis provide any built-in mechanism to avoid replicas that are in
LOADING,MASTERDOWN, or otherwise not ready for reads? - If not, is the common approach to build a custom client-side read router that periodically probes each replica with
PING,INFO replication, andINFO persistence? - Which Redis / Sentinel settings are most relevant for reducing full resync / loading windows during Sentinel failover?
- Are there recommended tuning strategies for settings such as
repl-backlog-size,repl-backlog-ttl,parallel-syncs,replica-priority,replica-serve-stale-data,min-replicas-to-write,down-after-milliseconds, andfailover-timeout? - Would Redis Cluster be a better long-term fit if we need topology-aware routing, failover handling, and better control over recovery behavior?
I am trying to understand whether this is a limitation of Sentinel-style replica reads, a StackExchange.Redis limitation, a Redis configuration issue, or a design issue in my approach.
Any advice from people running Redis Sentinel with read-from-replica traffic in production would be appreciated.