u/New-Technology-9941

▲ 0 r/redis

Hi everyone,

I am running Redis in Sentinel mode with the following setup:

  • 1 master
  • 3 replicas
  • C# application using StackExchange.Redis
  • Writes go to the current master
  • Reads are intended to go to replicas

The goal is to keep read traffic available during master failover and to switch to a replica that is actually able to serve reads as quickly as possible.

During failover testing, I observed that after one replica is promoted to master, other replicas may enter full resync / loading state and return errors such as:

LOADING Redis is loading the dataset in memory
MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'

Here are the relevant Redis / Sentinel settings from my environment:

Sentinel:
- monitor quorum: 2
- down-after-milliseconds: 2000 ms
- failover-timeout: 120000 ms
- parallel-syncs: 1

Redis replication:
- repl-backlog-size: 100mb in the original STG config
- repl-backlog-size: also tested with 3gb locally
- repl-backlog-ttl: 7200 seconds
- replica-priority:
  - original/default master node: 1
  - other replica nodes: 100
- replica-serve-stale-data: yes
- min-replicas-to-write: not explicitly set
- min-replicas-max-lag: not explicitly set

Dataset size during local testing:
- around 3 million keys
- around 3 GB used memory

Even after increasing repl-backlog-size to 3gb in local testing, I still observed cases where replicas entered LOADING during failover recovery. So my current assumption is that a larger backlog can reduce the probability of full resync, but it does not guarantee that replicas will always recover via partial resync.

My current understanding is:

  • Sentinel can tell clients which node is the current master.
  • Sentinel can expose the replica topology.
  • Sentinel chooses a replica for promotion based on factors such as replica-priority, replication offset, run ID, and availability.
  • However, choosing the best replica for promotion does not necessarily mean all remaining replicas are immediately ready to serve reads.
  • A replica can still be reachable at the TCP level but not service-ready because it may return LOADING, MASTERDOWN, or time out.
  • StackExchange.Redis with replica reads / PreferReplica does not seem to give me direct control to choose only replicas that pass my own readiness criteria.

What I want to achieve is:

  1. Detect replicas that are reachable but not ready for reads.
  2. Exclude replicas returning LOADING, MASTERDOWN, timeout, or non-PONG health responses.
  3. Route reads only to healthy replicas.
  4. Avoid falling back to master unless explicitly allowed, because we are concerned about overloading the master during failover.
  5. If no healthy replica exists, fail fast or use an application-level fallback instead of treating Redis errors as cache miss.

My questions are:

  1. In Redis Sentinel mode, is there a recommended way to make replica reads readiness-aware?
  2. During Sentinel failover, how exactly does Redis/Sentinel choose the replica to promote?
  3. How much do replica-priority, replication offset, run ID, and replica availability affect the promotion decision?
  4. Is there any way to prefer the replica with the most complete data and shortest recovery time?
  5. Is LOADING / MASTERDOWN during failover something Sentinel is expected to expose to clients, or should it be handled at the client/application layer?
  6. Does StackExchange.Redis provide any built-in mechanism to avoid replicas that are in LOADING, MASTERDOWN, or otherwise not ready for reads?
  7. If not, is the common approach to build a custom client-side read router that periodically probes each replica with PING, INFO replication, and INFO persistence?
  8. Which Redis / Sentinel settings are most relevant for reducing full resync / loading windows during Sentinel failover?
  9. Are there recommended tuning strategies for settings such as repl-backlog-size, repl-backlog-ttl, parallel-syncs, replica-priority, replica-serve-stale-data, min-replicas-to-write, down-after-milliseconds, and failover-timeout?
  10. Would Redis Cluster be a better long-term fit if we need topology-aware routing, failover handling, and better control over recovery behavior?

I am trying to understand whether this is a limitation of Sentinel-style replica reads, a StackExchange.Redis limitation, a Redis configuration issue, or a design issue in my approach.

Any advice from people running Redis Sentinel with read-from-replica traffic in production would be appreciated.

reddit.com
u/New-Technology-9941 — 18 days ago