How to ensure fault tolerance of microservices

🟢 Junior Level

Fault tolerance is the ability of a system to continue operating when individual components fail.

Key patterns:

Circuit Breaker — opens the circuit on errors, returns fallback
Retry — repeats a failed call with a delay
Timeout — limits the wait time for a response
Fallback — alternative path on failure
Bulkhead — separates thread pools so a failure in one does not affect others

Service A → Circuit Breaker → Retry → Timeout → Fallback → result

🟡 Middle Level

Resilience4j

// "backend" — configuration name from application.yml.
// fallbackMethod — method called when circuit breaker triggers.
// Resilience4j intercepts the call via AOP proxy.
@CircuitBreaker(name = "backend", fallbackMethod = "fallback")
@Retry(name = "backend")
@TimeLimiter(name = "backend")
@Bulkhead(name = "backend")
public CompletableFuture<String> callBackend(String input) {
    return backendService.callAsync(input);
}

public CompletableFuture<String> fallback(String input, Throwable t) {
    return CompletableFuture.completedFuture("fallback: " + input);
}

Configuration

resilience4j:
  circuitbreaker:
    instances:
      backend:
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
  retry:
    instances:
      backend:
        maxAttempts: 3
        waitDuration: 1s
  timelimiter:
    instances:
      backend:
        timeoutDuration: 5s

Common mistakes

Retry without exponential backoff:

3 retries without delay immediately → service overload (traffic tripling).
Need exponential backoff with jitter — see file 20.
Solution: exponential backoff + jitter

🔴 Senior Level

Chaos Engineering

Fault tolerance testing:
- Chaos Monkey (Netflix) — randomly kills services
- Gremlin — fault injection
- Toxiproxy — network latency

Goal: find weak spots before production

Production Experience

Multi-region deployment:

Region 1 (active) → traffic
Region 2 (passive) → standby
Region 1 down → failover to Region 2

Best Practices

✅ Circuit Breaker for external calls
✅ Retry with exponential backoff
✅ Timeout on all calls
✅ Fallback for graceful degradation
✅ Bulkhead for isolation
✅ Monitoring and alerting

❌ Without timeout
❌ Retry without backoff
❌ Without fallback
❌ Without monitoring

🎯 Interview Cheat Sheet

Must know:

5 key patterns: Circuit Breaker, Retry, Timeout, Fallback, Bulkhead
Circuit Breaker blocks calls to a non-working service
Retry with exponential backoff + jitter retries calls with increasing delay
Timeout limits wait time — protects against hanging
Fallback is an alternative path on failure (graceful degradation)
Bulkhead isolates resources — failure in one does not affect others
Chaos Engineering (Chaos Monkey, Gremlin) tests fault tolerance before production

Frequent follow-up questions:

Why is retry without backoff bad? Triples traffic to a failing service — worsens the problem.
What is Chaos Engineering? Intentionally introducing failures (kill a service, add latency) to find weak spots.
Resilience4j vs Hystrix? Hystrix is in maintenance mode, Resilience4j is the modern standard.
How does multi-region deployment ensure fault tolerance? Region 1 down → failover to Region 2.

Red flags (NOT to say):

“Retry without backoff is fine” — no, service overload
“Circuit Breaker is not needed, we have a reliable network” — failure can be on the service side
“Fallback = ignore the error” — no, it’s an alternative path (cache, default value)
“Bulkhead = Circuit Breaker” — no, Bulkhead isolates resources, CB blocks calls

Related topics:

[[5. What is Circuit Breaker pattern]]
[[6. How does Circuit Breaker work and what states does it have]]
[[18. What is Bulkhead pattern]]
[[19. What is Retry pattern and how to use it correctly]]
[[20. What is exponential backoff]]