What is HorizontalPodAutoscaler (HPA)?

Junior Level

Simple Explanation

HorizontalPodAutoscaler (HPA) is an automatic Kubernetes controller that changes the number of Pods based on load. Many requests → more Pods. Few requests → fewer Pods.

Simple Analogy

HPA is like a restaurant manager:

Many customers → opens additional cash registers
Few customers → closes extra registers, saves resources

How Does HPA Work?

HPA checks Pod CPU utilization every 15 seconds
If CPU is above the target → adds Pods
If CPU is below → removes Pods
Pod count stays within min/max bounds

Simple Example

# Automatically maintain ~50% CPU, from 2 to 10 Pods
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10

Or via YAML:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

What is Needed for HPA to Work?

Metrics Server — collects CPU/Memory metrics
Requests in the Pod — HPA calculates percentage from Request, not from Limit

resources:
  requests:
    cpu: "500m"    # Required!

What a Junior Developer Should Remember

HPA automatically changes the number of Pods
Works based on CPU (or other metrics)
Requires Metrics Server and Resource Requests
Set min/max replicas for control
HPA will not scale below minReplicas

Middle Level

HPA Formula

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

// desiredReplicas = ceil(currentReplicas * (currentMetric / desiredMetric)) // Example: ceil(2 * (300m / 200m)) = ceil(2 * 1.5) = ceil(3.0) = 3 Pods

Example:

Desired CPU: 50%
Current CPU: 100%
Current replicas: 2
Result: ceil[2 × (100/50)] = 4 replicas

Metric Sources

Type	Source	Example
Resource	Metrics Server	CPU, Memory
Pods	Prometheus Adapter	RPS, latency
Object	Prometheus Adapter	Requests per queue
External	External systems	AWS SQS queue length

Memory-based scaling: why it’s a bad idea

Java applications do not always return memory to the OS immediately after GC. This leads to:

HPA sees high memory usage
Adds Pods endlessly
Resources wasted

Recommendation: Scale by CPU or business metrics, not by memory.

Behavior: Configuring Scaling Speed

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Pods
      value: 1
      periodSeconds: 120

Scale Down is slower than Scale Up — this protects against premature Pod removal.

HPA and VPA Conflict

Do not use HPA and VPA simultaneously for CPU:

HPA adds Pods when CPU is high
VPA increases CPU per Pod
They “fight” → instability

What a Middle Developer Should Remember

HPA formula: desired replicas = current × (current metric / desired metric)
Memory-based scaling is an anti-pattern for Java
Behavior controls scale up/down speed
Do not use HPA + VPA for the same metric
maxReplicas protects budget from unexpected spikes

Senior Level

HPA as a Capacity Management System

HPA is not just “auto scaling” — it is a tool for balancing between SLA and cost optimization.

HPA v2 Algorithm: Detailed Analysis

Decision cycle:

Every 15 seconds (syncPeriod):
Fetch metrics for all targeted pods
Calculate desired replicas per metric
Take MAX across all metrics
Apply behavior policies (scale up/down limits)
Enforce min/max bounds
Patch Deployment spec.replicas

Stabilization Window:

On Scale Up: 60s by default (fast reaction)
On Scale Down: 300s by default (conservative)

Stabilization window is a “cooldown” period after scaling. Prevents flapping (rapid scaling up and down during load fluctuations).

This prevents “thrashing” — constant adding/removing of Pods.

Multi-metric HPA

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"

HPA selects the MAX from calculations across all metrics.

Prometheus Adapter for Custom Metrics

# Prometheus rule
- record: http_requests_per_second
  expr: rate(http_requests_total[5m])

# Prometheus Adapter config
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)"
    as: "${1}"
  metricsQuery: sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)

HPA and Java Applications: Specifics

JIT Warmup:

New Pods have “cold” JIT
First requests are processed slower
CPU may temporarily spike → HPA adds even more Pods (over-shooting)

Solution:

Readiness Probe with warmup check
Initial delay for HPA (stabilization window)
Startup Probe for long warmup

G1 GC and CPU:

G1 GC uses CPU for concurrent marking
HPA may interpret this as load
Configure -XX:ConcGCThreads and -XX:ParallelGCThreads

Queue-based Scaling (KEDA)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaling
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      queueName: tasks
      queueLength: "10"
      activationQueueLength: "5"

KEDA creates an HPA with custom metrics.

Troubleshooting

HPA shows <unknown>:

kubectl get hpa
# NAME         REFERENCE           TARGETS   MINPODS   MAXPODS
# myapp-hpa    Deployment/myapp    <unknown> 2         10

Causes:

Metrics Server not installed
No requests in Pod spec
Metrics not yet collected (wait 1-2 minutes)

HPA not scaling:

kubectl describe hpa myapp-hpa
# Warning: FailedGetResourceMetric  unable to fetch metrics

Cost Optimization via HPA

# Budget limit
maxReplicas: 20  # no more than 20 Pods = cost control

# Slow scale down
scaleDown:
  stabilizationWindowSeconds: 600  # 10 minutes

Summary for Senior

HPA — primary horizontal scaling tool.
Algorithm: MAX(desired replicas per metric), with behavior policies.
Always set a reasonable maxReplicas for budget control.
Behavior (stabilization window) prevents thrashing.
Base scaling on CPU or custom business metrics, not memory.
KEDA for event-driven scaling (queues, streams).
Java specifics: JIT warmup, GC overhead — account for these in configuration.
HPA + VPA conflict on CPU — use separately.

Interview Cheat Sheet

Must know:

HPA automatically scales Pod count by CPU, memory, custom, or external metrics
Formula: desiredReplicas = ceil[current × (currentMetric / desiredMetric)]
Requires Metrics Server and Resource Requests (calculates % from requests, not limits)
Behavior: stabilization window (scale up 60s, scale down 300s by default)
HPA selects MAX from calculations across all metrics
Memory-based scaling is an anti-pattern for Java (JVM GC, lazy memory return)
KEDA — event-driven HPA (Kafka, RabbitMQ, SQS); scale-to-zero with Knative

Common follow-up questions:

“Why does HPA show <unknown>?” — Metrics Server not installed, no requests, or metrics not yet collected
“Java JIT warmup and HPA?” — New Pods are “cold”, CPU spikes → HPA over-shooting; use startupProbe
“HPA + VPA together?” — No for the same metric; VPA in Off mode for recommendations
“Why maxReplicas?” — Budget protection from unexpected load spikes

Red flags (DO NOT say):

“Scaling Java by memory usage” (JVM doesn’t return memory immediately after GC)
“HPA without requests” (doesn’t work, shows <unknown>)
“Setting periodSeconds=1 for fast reaction” (thrashing, extra load)
“HPA replaces Cluster Autoscaler” (HPA → Pods, CA → Nodes; different levels)

Related topics:

[[How does scaling work in Kubernetes]] — all types of scaling
[[Why are health checks needed]] — startupProbe for Java warmup
[[How to monitor applications in Kubernetes]] — custom metrics for HPA