How does scaling work in Kubernetes?
This is the most common approach. More copies = more requests handled.
Junior Level
Simple Explanation
Scaling is the process of changing the amount of resources allocated to an application. In Kubernetes, you can scale at two levels:
- More copies of the application (horizontal) — launch additional Pods
- More resources for a Pod (vertical) — give more CPU/RAM
Horizontal Scaling (more copies)
1 copy: [App] ← 100 req/s → slows down
3 copies: [App] [App] [App] ← 100 req/s → works fast
This is the most common approach. More copies = more requests handled.
Vertical Scaling (more resources)
Low RAM: [App: 256MB] ← OutOfMemoryError
High RAM: [App: 1GB] ← works normally
Less common approach. Not all applications can efficiently use more resources.
Manual Scaling
# Increase replica count to 5
kubectl scale deployment myapp --replicas=5
# Check
kubectl get deployment myapp
Automatic Scaling
Kubernetes can automatically change the replica count:
- HPA (Horizontal Pod Autoscaler) — based on CPU or other metrics
- Cluster Autoscaler — adds servers (Nodes) when resources are insufficient
What a Junior Developer Should Remember
- Horizontal = more copies (most common approach)
- Vertical = more resources per Pod
- Manual:
kubectl scale deployment --replicas=N - Automatic: HPA by CPU, Cluster Autoscaler by Node
- HPA requires Requests (minimum resources) to be specified
Middle Level
Types of Scaling
HPA — Horizontal Pod Autoscaler
HPA does not work well with stateful workloads (databases) — new Pods have no data. For stateful applications, use StatefulSet + manual scaling.
Changes the number of Pod replicas:
# HPA: maintain ~50% CPU utilization
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10
Metric sources:
- Resource metrics: CPU, Memory (from Metrics Server)
- Custom metrics: business metrics from Prometheus
- External metrics: external queues (AWS SQS, RabbitMQ)
VPA — Vertical Pod Autoscaler
VPA (Vertical Pod Autoscaler) — automatically adjusts CPU/RAM requests for Pods.
Changes CPU/RAM requests and limits of a Pod:
- Requires Pod restart to apply
- Useful for finding optimal resource requests
- Not recommended to use together with HPA on CPU
Cluster Autoscaler
Adds/removes Nodes in the cluster:
- If Pods are in Pending state (insufficient resources) → adds Node
- If Nodes are underutilized → removes them to save costs
How HPA Makes Decisions?
Formula:
desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]
Example:
- Desired CPU: 50%
- Current CPU: 100%
- Current replicas: 2
- Decision: ceil[2 × (100/50)] = 4 replicas
Requirements for HPA
- Metrics Server must be installed
- Requests must be specified in the Pod (HPA calculates percentage from Requests)
resources:
requests:
cpu: "500m" # Required for HPA
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
The “Race Condition” Problem
If HPA reacts too quickly to spikes, the system enters oscillation:
- CPU spike → HPA adds Pods → load drops → HPA removes Pods → spike again
Solution: Configure cooldown periods (behavior in HPA).
What a Middle Developer Should Remember
- HPA — primary scaling method
- VPA — for resource tuning (not together with HPA)
- Cluster Autoscaler — for cluster capacity management
- HPA requires Metrics Server and Resource Requests
- Configure cooldown to prevent oscillation
Senior Level
Scaling as an Architectural Strategy
Scaling in Kubernetes is not just “add more Pods” — it is a multi-level strategy that affects application architecture, infrastructure costs, and SLA.
Complete Scaling Picture
Application level:
├── HPA (horizontal): more replicas
└── VPA (vertical): more resources
Cluster level:
├── Cluster Autoscaler: more Nodes
└── Karpenter: instance type optimization
Load level:
├── Resource-based: CPU, Memory
├── Custom metrics: RPS, queue length, latency
└── External: business metrics
HPA: Advanced Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 512Mi
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Custom Metrics for HPA
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
Requires Prometheus Adapter to expose metrics to the K8s API.
KEDA: Event-Driven Autoscaling
KEDA — event-based scaling:
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: mygroup
topic: mytopic
lagThreshold: "100"
Supports: Kafka, RabbitMQ, Redis, AWS SQS, Azure Service Bus, and others.
Scale to Zero
Knative + Kourier:
- No traffic → 0 replicas
- On request → fast startup (cold start ~100ms-2s)
- Ideal for event-driven, serverless workloads
Requires:
- Fast application startup (GraalVM Native Image)
- Queue Proxy to buffer requests during scale up
VPA: Limitations and Conflicts
Do not use with HPA on CPU:
- HPA wants more Pods when CPU is high
- VPA wants more CPU per Pod
- Conflict → unpredictable behavior
Recommendation:
- HPA for stateless workloads (horizontal)
- VPA for stateful/monolithic workloads (vertical)
- VPA in
Offmode for recommendations (what to set in requests)
Cluster Autoscaler vs Karpenter
| Characteristic | Cluster Autoscaler | Karpenter |
|---|---|---|
| Speed | Slow (minutes) | Fast (seconds) |
| Instance selection | Limited | Optimized |
| Spot instances | Yes | Yes (better) |
| Provider | Cloud-specific | AWS (multi-cloud in development) |
Scaling Economics
Without autoscaling:
- Peak load: 100 replicas (2 hours per day)
- Rest of the time: 10 replicas idle
- Cost: 100 × 24h
With autoscaling:
- Peak: 100 replicas (2 hours)
- Off-peak: 10 replicas (22 hours)
- Cost: 100×2h + 10×22h = ~30% savings
Troubleshooting
HPA not scaling:
kubectl describe hpa myapp
# Conditions:
# AbleToScale False SucceededRescale (last transition: ...)
# ScalingActive False FailedGetResourceMetric
Check:
- Is Metrics Server running?
- Are Requests specified?
- Are metrics available?
Pods in Pending during scaling:
- Insufficient resources on Nodes → Cluster Autoscaler should trigger
- Check quota limits:
kubectl describe resourcequota
Summary for Senior
- HPA — for handling load, VPA — for resource tuning, CA — for capacity.
- Do not use HPA and VPA simultaneously on the same metric.
- Custom metrics (Prometheus Adapter) for business-oriented scaling.
- KEDA for event-driven scaling (Kafka, RabbitMQ, SQS).
- Scale-to-zero (Knative) for serverless workloads.
- Behavior policies control scale up/down speed, prevent oscillation.
- Karpenter is faster and smarter than Cluster Autoscaler (AWS).
- Always configure Resource Requests & Limits, otherwise autoscaling won’t work.
Interview Cheat Sheet
Must know:
- HPA — horizontal (more Pods), VPA — vertical (more CPU/RAM), CA — more Nodes
- HPA formula:
desiredReplicas = ceil[current × (currentMetric / desiredMetric)] - HPA requires Metrics Server and Resource Requests in Pod spec
- Do not use HPA and VPA simultaneously on the same metric (conflict)
- Custom metrics (Prometheus Adapter) for business-oriented scaling
- Behavior policies (stabilization window) prevent oscillation
- KEDA — event-driven scaling (Kafka, RabbitMQ, SQS); Knative — scale-to-zero
Common follow-up questions:
- “Why doesn’t HPA work without requests?” — HPA calculates percentage from requests; without them there’s no baseline
- “Is memory-based scaling a good idea?” — No for Java (JVM doesn’t return memory to OS immediately)
- “What is KEDA?” — Kubernetes Event-driven Autoscaling; scaling by events (queues, streams)
- “Karpenter vs Cluster Autoscaler?” — Karpenter is faster, smarter at selecting instance types
Red flags (DO NOT say):
- “HPA and VPA together for CPU” (they conflict, unpredictable behavior)
- “Scaling Java application by memory” (JVM memory management breaks the metric)
- “Cluster Autoscaler replaces HPA” (CA adds Nodes, HPA adds Pods — different levels)
- “Setting maxReplicas = 1000 without control” (risk of huge costs)
Related topics:
- [[What is HorizontalPodAutoscaler (HPA)]] — HPA in detail
- [[What is ReplicaSet]] — replication mechanism
- [[What is Node in Kubernetes]] — Cluster Autoscaler