How to implement horizontal scaling of microservices

Junior Level

Horizontal scaling means adding more instances of a service to handle load.

Vertical scaling (more CPU/RAM) hits the limit of a single server and requires downtime. Horizontal scaling is theoretically infinite and without downtime.

One instance:
Client -> Service

Horizontal scaling:
Client -> Load Balancer -> Service #1
                       -> Service #2
                       -> Service #3

Methods:

Kubernetes — automatic (HPA — Horizontal Pod Autoscaler, K8s automatically adds Pods when load increases)
Docker Compose — docker-compose up --scale service=3
Cloud — auto-scaling groups

Middle Level

When NOT to use horizontal scaling

Stateful services (WebSocket connections, in-memory caches)
Licensed software with per-instance pricing
Services with expensive initialization (minutes to start)

Kubernetes HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
// 70% — headroom for load spikes. At 90% a new instance won't have time to start up
// before the spike. At 50% you'll overpay for extra instances.

Statelessness

For horizontal scaling, services must be stateless:
(Stateless — the service doesn't keep state in memory; any instance is interchangeable.)
✅ No local state
✅ Session in Redis
✅ Data in DB
✅ Configuration from outside

Common mistakes

Stateful services:

Session in memory -> when scaling, requests go to a different instance -> no session
Solution: external session storage

Senior Level

Custom metrics

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 1000

Production Experience

Blue-Green Deployment:

(Blue-Green Deployment — deployment strategy without downtime: two environments, traffic switch.)
v1 (Blue) -> production traffic
v2 (Green) -> deployed, being tested
Switch traffic to v2 -> roll back if problems

Best Practices

✅ Stateless services
✅ Health checks
✅ Graceful shutdown
✅ Resource limits
✅ Monitoring + alerting

❌ Stateful without external storage
❌ Without resource limits
❌ Without graceful shutdown

Interview Cheat Sheet

Must know:

Horizontal scaling = more instances behind a load balancer
Vertical scaling = more CPU/RAM, hits the limit of a single server
Services MUST be stateless for horizontal scaling
Kubernetes HPA — automatic scaling based on CPU/metrics (70% CPU target)
Session in Redis, data in DB, configuration from outside
Blue-Green deployment — deployment without downtime
NOT suitable for stateful services (WebSocket, in-memory cache)

Common follow-up questions:

Why 70% CPU target? Headroom for load spikes — at 90% a new instance won’t have time to scale up.
How to make a service stateless? Session in Redis, data in DB, configuration from outside, no local state.
What is graceful shutdown? Completing current requests before stopping, deregister from Registry.
Custom metrics for HPA? http_requests_per_second, queue length, business metrics.

Red flags (DO NOT say):

“Stateful services are easy to scale” — no, external state management is needed
“Vertical scaling is always simpler” — yes, but hits a limit
“HPA at 90% CPU — efficient” — no, won’t scale fast enough during a spike
“Session in memory is fine” — no, requests will go to a different instance

Related topics:

[[10. What is sharding]]
[[11. What is the difference between sharding and partitioning]]
[[26. What tools are used for microservice orchestration]]
[[7. What is Service Discovery and why is it needed]]
[[13. What is Database per Service pattern]]