What is HorizontalPodAutoscaler (HPA)?
HPA is like a restaurant manager:
Junior Level
Simple Explanation
HorizontalPodAutoscaler (HPA) is an automatic Kubernetes controller that changes the number of Pods based on load. Many requests → more Pods. Few requests → fewer Pods.
Simple Analogy
HPA is like a restaurant manager:
- Many customers → opens additional cash registers
- Few customers → closes extra registers, saves resources
How Does HPA Work?
- HPA checks Pod CPU utilization every 15 seconds
- If CPU is above the target → adds Pods
- If CPU is below → removes Pods
- Pod count stays within min/max bounds
Simple Example
# Automatically maintain ~50% CPU, from 2 to 10 Pods
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10
Or via YAML:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
What is Needed for HPA to Work?
- Metrics Server — collects CPU/Memory metrics
- Requests in the Pod — HPA calculates percentage from Request, not from Limit
resources:
requests:
cpu: "500m" # Required!
What a Junior Developer Should Remember
- HPA automatically changes the number of Pods
- Works based on CPU (or other metrics)
- Requires Metrics Server and Resource Requests
- Set min/max replicas for control
- HPA will not scale below minReplicas
Middle Level
HPA Formula
desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]
// desiredReplicas = ceil(currentReplicas * (currentMetric / desiredMetric)) // Example: ceil(2 * (300m / 200m)) = ceil(2 * 1.5) = ceil(3.0) = 3 Pods
Example:
- Desired CPU: 50%
- Current CPU: 100%
- Current replicas: 2
- Result: ceil[2 × (100/50)] = 4 replicas
Metric Sources
| Type | Source | Example |
|---|---|---|
| Resource | Metrics Server | CPU, Memory |
| Pods | Prometheus Adapter | RPS, latency |
| Object | Prometheus Adapter | Requests per queue |
| External | External systems | AWS SQS queue length |
Memory-based scaling: why it’s a bad idea
Java applications do not always return memory to the OS immediately after GC. This leads to:
- HPA sees high memory usage
- Adds Pods endlessly
- Resources wasted
Recommendation: Scale by CPU or business metrics, not by memory.
Behavior: Configuring Scaling Speed
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Scale Down is slower than Scale Up — this protects against premature Pod removal.
HPA and VPA Conflict
Do not use HPA and VPA simultaneously for CPU:
- HPA adds Pods when CPU is high
- VPA increases CPU per Pod
- They “fight” → instability
What a Middle Developer Should Remember
- HPA formula: desired replicas = current × (current metric / desired metric)
- Memory-based scaling is an anti-pattern for Java
- Behavior controls scale up/down speed
- Do not use HPA + VPA for the same metric
- maxReplicas protects budget from unexpected spikes
Senior Level
HPA as a Capacity Management System
HPA is not just “auto scaling” — it is a tool for balancing between SLA and cost optimization.
HPA v2 Algorithm: Detailed Analysis
Decision cycle:
Every 15 seconds (syncPeriod):
1. Fetch metrics for all targeted pods
2. Calculate desired replicas per metric
3. Take MAX across all metrics
4. Apply behavior policies (scale up/down limits)
5. Enforce min/max bounds
6. Patch Deployment spec.replicas
Stabilization Window:
- On Scale Up: 60s by default (fast reaction)
- On Scale Down: 300s by default (conservative)
Stabilization window is a “cooldown” period after scaling. Prevents flapping (rapid scaling up and down during load fluctuations).
This prevents “thrashing” — constant adding/removing of Pods.
Multi-metric HPA
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
HPA selects the MAX from calculations across all metrics.
Prometheus Adapter for Custom Metrics
# Prometheus rule
- record: http_requests_per_second
expr: rate(http_requests_total[5m])
# Prometheus Adapter config
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)"
as: "${1}"
metricsQuery: sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)
HPA and Java Applications: Specifics
JIT Warmup:
- New Pods have “cold” JIT
- First requests are processed slower
- CPU may temporarily spike → HPA adds even more Pods (over-shooting)
Solution:
- Readiness Probe with warmup check
- Initial delay for HPA (stabilization window)
- Startup Probe for long warmup
G1 GC and CPU:
- G1 GC uses CPU for concurrent marking
- HPA may interpret this as load
- Configure
-XX:ConcGCThreadsand-XX:ParallelGCThreads
Queue-based Scaling (KEDA)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaling
spec:
scaleTargetRef:
name: worker-deployment
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: rabbitmq
metadata:
queueName: tasks
queueLength: "10"
activationQueueLength: "5"
KEDA creates an HPA with custom metrics.
Troubleshooting
HPA shows <unknown>:
kubectl get hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS
# myapp-hpa Deployment/myapp <unknown> 2 10
Causes:
- Metrics Server not installed
- No requests in Pod spec
- Metrics not yet collected (wait 1-2 minutes)
HPA not scaling:
kubectl describe hpa myapp-hpa
# Warning: FailedGetResourceMetric unable to fetch metrics
Cost Optimization via HPA
# Budget limit
maxReplicas: 20 # no more than 20 Pods = cost control
# Slow scale down
scaleDown:
stabilizationWindowSeconds: 600 # 10 minutes
Summary for Senior
- HPA — primary horizontal scaling tool.
- Algorithm: MAX(desired replicas per metric), with behavior policies.
- Always set a reasonable maxReplicas for budget control.
- Behavior (stabilization window) prevents thrashing.
- Base scaling on CPU or custom business metrics, not memory.
- KEDA for event-driven scaling (queues, streams).
- Java specifics: JIT warmup, GC overhead — account for these in configuration.
- HPA + VPA conflict on CPU — use separately.
Interview Cheat Sheet
Must know:
- HPA automatically scales Pod count by CPU, memory, custom, or external metrics
- Formula:
desiredReplicas = ceil[current × (currentMetric / desiredMetric)] - Requires Metrics Server and Resource Requests (calculates % from requests, not limits)
- Behavior: stabilization window (scale up 60s, scale down 300s by default)
- HPA selects MAX from calculations across all metrics
- Memory-based scaling is an anti-pattern for Java (JVM GC, lazy memory return)
- KEDA — event-driven HPA (Kafka, RabbitMQ, SQS); scale-to-zero with Knative
Common follow-up questions:
- “Why does HPA show
<unknown>?” — Metrics Server not installed, no requests, or metrics not yet collected - “Java JIT warmup and HPA?” — New Pods are “cold”, CPU spikes → HPA over-shooting; use startupProbe
- “HPA + VPA together?” — No for the same metric; VPA in
Offmode for recommendations - “Why maxReplicas?” — Budget protection from unexpected load spikes
Red flags (DO NOT say):
- “Scaling Java by memory usage” (JVM doesn’t return memory immediately after GC)
- “HPA without requests” (doesn’t work, shows
<unknown>) - “Setting periodSeconds=1 for fast reaction” (thrashing, extra load)
- “HPA replaces Cluster Autoscaler” (HPA → Pods, CA → Nodes; different levels)
Related topics:
- [[How does scaling work in Kubernetes]] — all types of scaling
- [[Why are health checks needed]] — startupProbe for Java warmup
- [[How to monitor applications in Kubernetes]] — custom metrics for HPA