Question 17 · Section 17

How to ensure fault tolerance of microservices

4. Fallback — alternative path on failure 5. Bulkhead — separates thread pools so a failure in one does not affect others

Language versions: English Russian Ukrainian

🟢 Junior Level

Fault tolerance is the ability of a system to continue operating when individual components fail.

Key patterns:

  1. Circuit Breaker — opens the circuit on errors, returns fallback
  2. Retry — repeats a failed call with a delay
  3. Timeout — limits the wait time for a response
  4. Fallback — alternative path on failure
  5. Bulkhead — separates thread pools so a failure in one does not affect others
Service A → Circuit Breaker → Retry → Timeout → Fallback → result

🟡 Middle Level

Resilience4j

// "backend" — configuration name from application.yml.
// fallbackMethod — method called when circuit breaker triggers.
// Resilience4j intercepts the call via AOP proxy.
@CircuitBreaker(name = "backend", fallbackMethod = "fallback")
@Retry(name = "backend")
@TimeLimiter(name = "backend")
@Bulkhead(name = "backend")
public CompletableFuture<String> callBackend(String input) {
    return backendService.callAsync(input);
}

public CompletableFuture<String> fallback(String input, Throwable t) {
    return CompletableFuture.completedFuture("fallback: " + input);
}

Configuration

resilience4j:
  circuitbreaker:
    instances:
      backend:
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
  retry:
    instances:
      backend:
        maxAttempts: 3
        waitDuration: 1s
  timelimiter:
    instances:
      backend:
        timeoutDuration: 5s

Common mistakes

  1. Retry without exponential backoff:
    3 retries without delay immediately → service overload (traffic tripling).
    Need exponential backoff with jitter — see file 20.
    Solution: exponential backoff + jitter
    

🔴 Senior Level

Chaos Engineering

Fault tolerance testing:
- Chaos Monkey (Netflix) — randomly kills services
- Gremlin — fault injection
- Toxiproxy — network latency

Goal: find weak spots before production

Production Experience

Multi-region deployment:

Region 1 (active) → traffic
Region 2 (passive) → standby
Region 1 down → failover to Region 2

Best Practices

✅ Circuit Breaker for external calls
✅ Retry with exponential backoff
✅ Timeout on all calls
✅ Fallback for graceful degradation
✅ Bulkhead for isolation
✅ Monitoring and alerting

❌ Without timeout
❌ Retry without backoff
❌ Without fallback
❌ Without monitoring

🎯 Interview Cheat Sheet

Must know:

  • 5 key patterns: Circuit Breaker, Retry, Timeout, Fallback, Bulkhead
  • Circuit Breaker blocks calls to a non-working service
  • Retry with exponential backoff + jitter retries calls with increasing delay
  • Timeout limits wait time — protects against hanging
  • Fallback is an alternative path on failure (graceful degradation)
  • Bulkhead isolates resources — failure in one does not affect others
  • Chaos Engineering (Chaos Monkey, Gremlin) tests fault tolerance before production

Frequent follow-up questions:

  • Why is retry without backoff bad? Triples traffic to a failing service — worsens the problem.
  • What is Chaos Engineering? Intentionally introducing failures (kill a service, add latency) to find weak spots.
  • Resilience4j vs Hystrix? Hystrix is in maintenance mode, Resilience4j is the modern standard.
  • How does multi-region deployment ensure fault tolerance? Region 1 down → failover to Region 2.

Red flags (NOT to say):

  • “Retry without backoff is fine” — no, service overload
  • “Circuit Breaker is not needed, we have a reliable network” — failure can be on the service side
  • “Fallback = ignore the error” — no, it’s an alternative path (cache, default value)
  • “Bulkhead = Circuit Breaker” — no, Bulkhead isolates resources, CB blocks calls

Related topics:

  • [[5. What is Circuit Breaker pattern]]
  • [[6. How does Circuit Breaker work and what states does it have]]
  • [[18. What is Bulkhead pattern]]
  • [[19. What is Retry pattern and how to use it correctly]]
  • [[20. What is exponential backoff]]