How to ensure fault tolerance of microservices
4. Fallback — alternative path on failure 5. Bulkhead — separates thread pools so a failure in one does not affect others
🟢 Junior Level
Fault tolerance is the ability of a system to continue operating when individual components fail.
Key patterns:
- Circuit Breaker — opens the circuit on errors, returns fallback
- Retry — repeats a failed call with a delay
- Timeout — limits the wait time for a response
- Fallback — alternative path on failure
- Bulkhead — separates thread pools so a failure in one does not affect others
Service A → Circuit Breaker → Retry → Timeout → Fallback → result
🟡 Middle Level
Resilience4j
// "backend" — configuration name from application.yml.
// fallbackMethod — method called when circuit breaker triggers.
// Resilience4j intercepts the call via AOP proxy.
@CircuitBreaker(name = "backend", fallbackMethod = "fallback")
@Retry(name = "backend")
@TimeLimiter(name = "backend")
@Bulkhead(name = "backend")
public CompletableFuture<String> callBackend(String input) {
return backendService.callAsync(input);
}
public CompletableFuture<String> fallback(String input, Throwable t) {
return CompletableFuture.completedFuture("fallback: " + input);
}
Configuration
resilience4j:
circuitbreaker:
instances:
backend:
failureRateThreshold: 50
waitDurationInOpenState: 10s
retry:
instances:
backend:
maxAttempts: 3
waitDuration: 1s
timelimiter:
instances:
backend:
timeoutDuration: 5s
Common mistakes
- Retry without exponential backoff:
3 retries without delay immediately → service overload (traffic tripling). Need exponential backoff with jitter — see file 20. Solution: exponential backoff + jitter
🔴 Senior Level
Chaos Engineering
Fault tolerance testing:
- Chaos Monkey (Netflix) — randomly kills services
- Gremlin — fault injection
- Toxiproxy — network latency
Goal: find weak spots before production
Production Experience
Multi-region deployment:
Region 1 (active) → traffic
Region 2 (passive) → standby
Region 1 down → failover to Region 2
Best Practices
✅ Circuit Breaker for external calls
✅ Retry with exponential backoff
✅ Timeout on all calls
✅ Fallback for graceful degradation
✅ Bulkhead for isolation
✅ Monitoring and alerting
❌ Without timeout
❌ Retry without backoff
❌ Without fallback
❌ Without monitoring
🎯 Interview Cheat Sheet
Must know:
- 5 key patterns: Circuit Breaker, Retry, Timeout, Fallback, Bulkhead
- Circuit Breaker blocks calls to a non-working service
- Retry with exponential backoff + jitter retries calls with increasing delay
- Timeout limits wait time — protects against hanging
- Fallback is an alternative path on failure (graceful degradation)
- Bulkhead isolates resources — failure in one does not affect others
- Chaos Engineering (Chaos Monkey, Gremlin) tests fault tolerance before production
Frequent follow-up questions:
- Why is retry without backoff bad? Triples traffic to a failing service — worsens the problem.
- What is Chaos Engineering? Intentionally introducing failures (kill a service, add latency) to find weak spots.
- Resilience4j vs Hystrix? Hystrix is in maintenance mode, Resilience4j is the modern standard.
- How does multi-region deployment ensure fault tolerance? Region 1 down → failover to Region 2.
Red flags (NOT to say):
- “Retry without backoff is fine” — no, service overload
- “Circuit Breaker is not needed, we have a reliable network” — failure can be on the service side
- “Fallback = ignore the error” — no, it’s an alternative path (cache, default value)
- “Bulkhead = Circuit Breaker” — no, Bulkhead isolates resources, CB blocks calls
Related topics:
- [[5. What is Circuit Breaker pattern]]
- [[6. How does Circuit Breaker work and what states does it have]]
- [[18. What is Bulkhead pattern]]
- [[19. What is Retry pattern and how to use it correctly]]
- [[20. What is exponential backoff]]