Question 15 · Section 14

What is HorizontalPodAutoscaler (HPA)?

HPA is like a restaurant manager:

Language versions: English Russian Ukrainian

Junior Level

Simple Explanation

HorizontalPodAutoscaler (HPA) is an automatic Kubernetes controller that changes the number of Pods based on load. Many requests → more Pods. Few requests → fewer Pods.

Simple Analogy

HPA is like a restaurant manager:

  • Many customers → opens additional cash registers
  • Few customers → closes extra registers, saves resources

How Does HPA Work?

  1. HPA checks Pod CPU utilization every 15 seconds
  2. If CPU is above the target → adds Pods
  3. If CPU is below → removes Pods
  4. Pod count stays within min/max bounds

Simple Example

# Automatically maintain ~50% CPU, from 2 to 10 Pods
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10

Or via YAML:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

What is Needed for HPA to Work?

  1. Metrics Server — collects CPU/Memory metrics
  2. Requests in the Pod — HPA calculates percentage from Request, not from Limit
resources:
  requests:
    cpu: "500m"    # Required!

What a Junior Developer Should Remember

  • HPA automatically changes the number of Pods
  • Works based on CPU (or other metrics)
  • Requires Metrics Server and Resource Requests
  • Set min/max replicas for control
  • HPA will not scale below minReplicas

Middle Level

HPA Formula

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

// desiredReplicas = ceil(currentReplicas * (currentMetric / desiredMetric)) // Example: ceil(2 * (300m / 200m)) = ceil(2 * 1.5) = ceil(3.0) = 3 Pods

Example:

  • Desired CPU: 50%
  • Current CPU: 100%
  • Current replicas: 2
  • Result: ceil[2 × (100/50)] = 4 replicas

Metric Sources

Type Source Example
Resource Metrics Server CPU, Memory
Pods Prometheus Adapter RPS, latency
Object Prometheus Adapter Requests per queue
External External systems AWS SQS queue length

Memory-based scaling: why it’s a bad idea

Java applications do not always return memory to the OS immediately after GC. This leads to:

  • HPA sees high memory usage
  • Adds Pods endlessly
  • Resources wasted

Recommendation: Scale by CPU or business metrics, not by memory.

Behavior: Configuring Scaling Speed

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Pods
      value: 1
      periodSeconds: 120

Scale Down is slower than Scale Up — this protects against premature Pod removal.

HPA and VPA Conflict

Do not use HPA and VPA simultaneously for CPU:

  • HPA adds Pods when CPU is high
  • VPA increases CPU per Pod
  • They “fight” → instability

What a Middle Developer Should Remember

  • HPA formula: desired replicas = current × (current metric / desired metric)
  • Memory-based scaling is an anti-pattern for Java
  • Behavior controls scale up/down speed
  • Do not use HPA + VPA for the same metric
  • maxReplicas protects budget from unexpected spikes

Senior Level

HPA as a Capacity Management System

HPA is not just “auto scaling” — it is a tool for balancing between SLA and cost optimization.

HPA v2 Algorithm: Detailed Analysis

Decision cycle:

Every 15 seconds (syncPeriod):
1. Fetch metrics for all targeted pods
2. Calculate desired replicas per metric
3. Take MAX across all metrics
4. Apply behavior policies (scale up/down limits)
5. Enforce min/max bounds
6. Patch Deployment spec.replicas

Stabilization Window:

  • On Scale Up: 60s by default (fast reaction)
  • On Scale Down: 300s by default (conservative)

Stabilization window is a “cooldown” period after scaling. Prevents flapping (rapid scaling up and down during load fluctuations).

This prevents “thrashing” — constant adding/removing of Pods.

Multi-metric HPA

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"

HPA selects the MAX from calculations across all metrics.

Prometheus Adapter for Custom Metrics

# Prometheus rule
- record: http_requests_per_second
  expr: rate(http_requests_total[5m])

# Prometheus Adapter config
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)"
    as: "${1}"
  metricsQuery: sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)

HPA and Java Applications: Specifics

JIT Warmup:

  • New Pods have “cold” JIT
  • First requests are processed slower
  • CPU may temporarily spike → HPA adds even more Pods (over-shooting)

Solution:

  1. Readiness Probe with warmup check
  2. Initial delay for HPA (stabilization window)
  3. Startup Probe for long warmup

G1 GC and CPU:

  • G1 GC uses CPU for concurrent marking
  • HPA may interpret this as load
  • Configure -XX:ConcGCThreads and -XX:ParallelGCThreads

Queue-based Scaling (KEDA)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaling
spec:
  scaleTargetRef:
    name: worker-deployment
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      queueName: tasks
      queueLength: "10"
      activationQueueLength: "5"

KEDA creates an HPA with custom metrics.

Troubleshooting

HPA shows <unknown>:

kubectl get hpa
# NAME         REFERENCE           TARGETS   MINPODS   MAXPODS
# myapp-hpa    Deployment/myapp    <unknown> 2         10

Causes:

  1. Metrics Server not installed
  2. No requests in Pod spec
  3. Metrics not yet collected (wait 1-2 minutes)

HPA not scaling:

kubectl describe hpa myapp-hpa
# Warning: FailedGetResourceMetric  unable to fetch metrics

Cost Optimization via HPA

# Budget limit
maxReplicas: 20  # no more than 20 Pods = cost control

# Slow scale down
scaleDown:
  stabilizationWindowSeconds: 600  # 10 minutes

Summary for Senior

  • HPA — primary horizontal scaling tool.
  • Algorithm: MAX(desired replicas per metric), with behavior policies.
  • Always set a reasonable maxReplicas for budget control.
  • Behavior (stabilization window) prevents thrashing.
  • Base scaling on CPU or custom business metrics, not memory.
  • KEDA for event-driven scaling (queues, streams).
  • Java specifics: JIT warmup, GC overhead — account for these in configuration.
  • HPA + VPA conflict on CPU — use separately.

Interview Cheat Sheet

Must know:

  • HPA automatically scales Pod count by CPU, memory, custom, or external metrics
  • Formula: desiredReplicas = ceil[current × (currentMetric / desiredMetric)]
  • Requires Metrics Server and Resource Requests (calculates % from requests, not limits)
  • Behavior: stabilization window (scale up 60s, scale down 300s by default)
  • HPA selects MAX from calculations across all metrics
  • Memory-based scaling is an anti-pattern for Java (JVM GC, lazy memory return)
  • KEDA — event-driven HPA (Kafka, RabbitMQ, SQS); scale-to-zero with Knative

Common follow-up questions:

  • “Why does HPA show <unknown>?” — Metrics Server not installed, no requests, or metrics not yet collected
  • “Java JIT warmup and HPA?” — New Pods are “cold”, CPU spikes → HPA over-shooting; use startupProbe
  • “HPA + VPA together?” — No for the same metric; VPA in Off mode for recommendations
  • “Why maxReplicas?” — Budget protection from unexpected load spikes

Red flags (DO NOT say):

  • “Scaling Java by memory usage” (JVM doesn’t return memory immediately after GC)
  • “HPA without requests” (doesn’t work, shows <unknown>)
  • “Setting periodSeconds=1 for fast reaction” (thrashing, extra load)
  • “HPA replaces Cluster Autoscaler” (HPA → Pods, CA → Nodes; different levels)

Related topics:

  • [[How does scaling work in Kubernetes]] — all types of scaling
  • [[Why are health checks needed]] — startupProbe for Java warmup
  • [[How to monitor applications in Kubernetes]] — custom metrics for HPA