How to implement horizontal scaling of microservices
Vertical scaling (more CPU/RAM) hits the limit of a single server and requires downtime. Horizontal scaling is theoretically infinite and without downtime.
Junior Level
Horizontal scaling means adding more instances of a service to handle load.
Vertical scaling (more CPU/RAM) hits the limit of a single server and requires downtime. Horizontal scaling is theoretically infinite and without downtime.
One instance:
Client -> Service
Horizontal scaling:
Client -> Load Balancer -> Service #1
-> Service #2
-> Service #3
Methods:
- Kubernetes — automatic (HPA — Horizontal Pod Autoscaler, K8s automatically adds Pods when load increases)
- Docker Compose —
docker-compose up --scale service=3 - Cloud — auto-scaling groups
Middle Level
When NOT to use horizontal scaling
- Stateful services (WebSocket connections, in-memory caches)
- Licensed software with per-instance pricing
- Services with expensive initialization (minutes to start)
Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
// 70% — headroom for load spikes. At 90% a new instance won't have time to start up
// before the spike. At 50% you'll overpay for extra instances.
Statelessness
For horizontal scaling, services must be stateless:
(Stateless — the service doesn't keep state in memory; any instance is interchangeable.)
✅ No local state
✅ Session in Redis
✅ Data in DB
✅ Configuration from outside
Common mistakes
- Stateful services:
Session in memory -> when scaling, requests go to a different instance -> no session Solution: external session storage
Senior Level
Custom metrics
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
Production Experience
Blue-Green Deployment:
(Blue-Green Deployment — deployment strategy without downtime: two environments, traffic switch.)
v1 (Blue) -> production traffic
v2 (Green) -> deployed, being tested
Switch traffic to v2 -> roll back if problems
Best Practices
✅ Stateless services
✅ Health checks
✅ Graceful shutdown
✅ Resource limits
✅ Monitoring + alerting
❌ Stateful without external storage
❌ Without resource limits
❌ Without graceful shutdown
Interview Cheat Sheet
Must know:
- Horizontal scaling = more instances behind a load balancer
- Vertical scaling = more CPU/RAM, hits the limit of a single server
- Services MUST be stateless for horizontal scaling
- Kubernetes HPA — automatic scaling based on CPU/metrics (70% CPU target)
- Session in Redis, data in DB, configuration from outside
- Blue-Green deployment — deployment without downtime
- NOT suitable for stateful services (WebSocket, in-memory cache)
Common follow-up questions:
- Why 70% CPU target? Headroom for load spikes — at 90% a new instance won’t have time to scale up.
- How to make a service stateless? Session in Redis, data in DB, configuration from outside, no local state.
- What is graceful shutdown? Completing current requests before stopping, deregister from Registry.
- Custom metrics for HPA? http_requests_per_second, queue length, business metrics.
Red flags (DO NOT say):
- “Stateful services are easy to scale” — no, external state management is needed
- “Vertical scaling is always simpler” — yes, but hits a limit
- “HPA at 90% CPU — efficient” — no, won’t scale fast enough during a spike
- “Session in memory is fine” — no, requests will go to a different instance
Related topics:
- [[10. What is sharding]]
- [[11. What is the difference between sharding and partitioning]]
- [[26. What tools are used for microservice orchestration]]
- [[7. What is Service Discovery and why is it needed]]
- [[13. What is Database per Service pattern]]