Question 14 · Section 14

How does scaling work in Kubernetes?

This is the most common approach. More copies = more requests handled.

Language versions: English Russian Ukrainian

Junior Level

Simple Explanation

Scaling is the process of changing the amount of resources allocated to an application. In Kubernetes, you can scale at two levels:

  1. More copies of the application (horizontal) — launch additional Pods
  2. More resources for a Pod (vertical) — give more CPU/RAM

Horizontal Scaling (more copies)

1 copy:     [App]          ← 100 req/s → slows down
3 copies:   [App] [App] [App]  ← 100 req/s → works fast

This is the most common approach. More copies = more requests handled.

Vertical Scaling (more resources)

Low RAM:    [App: 256MB]   ← OutOfMemoryError
High RAM:   [App: 1GB]     ← works normally

Less common approach. Not all applications can efficiently use more resources.

Manual Scaling

# Increase replica count to 5
kubectl scale deployment myapp --replicas=5

# Check
kubectl get deployment myapp

Automatic Scaling

Kubernetes can automatically change the replica count:

  • HPA (Horizontal Pod Autoscaler) — based on CPU or other metrics
  • Cluster Autoscaler — adds servers (Nodes) when resources are insufficient

What a Junior Developer Should Remember

  • Horizontal = more copies (most common approach)
  • Vertical = more resources per Pod
  • Manual: kubectl scale deployment --replicas=N
  • Automatic: HPA by CPU, Cluster Autoscaler by Node
  • HPA requires Requests (minimum resources) to be specified

Middle Level

Types of Scaling

HPA — Horizontal Pod Autoscaler

HPA does not work well with stateful workloads (databases) — new Pods have no data. For stateful applications, use StatefulSet + manual scaling.

Changes the number of Pod replicas:

# HPA: maintain ~50% CPU utilization
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10

Metric sources:

  • Resource metrics: CPU, Memory (from Metrics Server)
  • Custom metrics: business metrics from Prometheus
  • External metrics: external queues (AWS SQS, RabbitMQ)

VPA — Vertical Pod Autoscaler

VPA (Vertical Pod Autoscaler) — automatically adjusts CPU/RAM requests for Pods.

Changes CPU/RAM requests and limits of a Pod:

  • Requires Pod restart to apply
  • Useful for finding optimal resource requests
  • Not recommended to use together with HPA on CPU

Cluster Autoscaler

Adds/removes Nodes in the cluster:

  • If Pods are in Pending state (insufficient resources) → adds Node
  • If Nodes are underutilized → removes them to save costs

How HPA Makes Decisions?

Formula:

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Example:

  • Desired CPU: 50%
  • Current CPU: 100%
  • Current replicas: 2
  • Decision: ceil[2 × (100/50)] = 4 replicas

Requirements for HPA

  1. Metrics Server must be installed
  2. Requests must be specified in the Pod (HPA calculates percentage from Requests)
resources:
  requests:
    cpu: "500m"     # Required for HPA
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

The “Race Condition” Problem

If HPA reacts too quickly to spikes, the system enters oscillation:

  • CPU spike → HPA adds Pods → load drops → HPA removes Pods → spike again

Solution: Configure cooldown periods (behavior in HPA).

What a Middle Developer Should Remember

  • HPA — primary scaling method
  • VPA — for resource tuning (not together with HPA)
  • Cluster Autoscaler — for cluster capacity management
  • HPA requires Metrics Server and Resource Requests
  • Configure cooldown to prevent oscillation

Senior Level

Scaling as an Architectural Strategy

Scaling in Kubernetes is not just “add more Pods” — it is a multi-level strategy that affects application architecture, infrastructure costs, and SLA.

Complete Scaling Picture

Application level:
├── HPA (horizontal): more replicas
└── VPA (vertical): more resources

Cluster level:
├── Cluster Autoscaler: more Nodes
└── Karpenter: instance type optimization

Load level:
├── Resource-based: CPU, Memory
├── Custom metrics: RPS, queue length, latency
└── External: business metrics

HPA: Advanced Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 512Mi
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Custom Metrics for HPA

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"

Requires Prometheus Adapter to expose metrics to the K8s API.

KEDA: Event-Driven Autoscaling

KEDA — event-based scaling:

triggers:
- type: kafka
  metadata:
    bootstrapServers: kafka:9092
    consumerGroup: mygroup
    topic: mytopic
    lagThreshold: "100"

Supports: Kafka, RabbitMQ, Redis, AWS SQS, Azure Service Bus, and others.

Scale to Zero

Knative + Kourier:

  • No traffic → 0 replicas
  • On request → fast startup (cold start ~100ms-2s)
  • Ideal for event-driven, serverless workloads

Requires:

  • Fast application startup (GraalVM Native Image)
  • Queue Proxy to buffer requests during scale up

VPA: Limitations and Conflicts

Do not use with HPA on CPU:

  • HPA wants more Pods when CPU is high
  • VPA wants more CPU per Pod
  • Conflict → unpredictable behavior

Recommendation:

  • HPA for stateless workloads (horizontal)
  • VPA for stateful/monolithic workloads (vertical)
  • VPA in Off mode for recommendations (what to set in requests)

Cluster Autoscaler vs Karpenter

Characteristic Cluster Autoscaler Karpenter
Speed Slow (minutes) Fast (seconds)
Instance selection Limited Optimized
Spot instances Yes Yes (better)
Provider Cloud-specific AWS (multi-cloud in development)

Scaling Economics

Without autoscaling:
- Peak load: 100 replicas (2 hours per day)
- Rest of the time: 10 replicas idle
- Cost: 100 × 24h

With autoscaling:
- Peak: 100 replicas (2 hours)
- Off-peak: 10 replicas (22 hours)
- Cost: 100×2h + 10×22h = ~30% savings

Troubleshooting

HPA not scaling:

kubectl describe hpa myapp
# Conditions:
#   AbleToScale    False   SucceededRescale  (last transition: ...)
#   ScalingActive  False   FailedGetResourceMetric

Check:

  1. Is Metrics Server running?
  2. Are Requests specified?
  3. Are metrics available?

Pods in Pending during scaling:

  • Insufficient resources on Nodes → Cluster Autoscaler should trigger
  • Check quota limits: kubectl describe resourcequota

Summary for Senior

  • HPA — for handling load, VPA — for resource tuning, CA — for capacity.
  • Do not use HPA and VPA simultaneously on the same metric.
  • Custom metrics (Prometheus Adapter) for business-oriented scaling.
  • KEDA for event-driven scaling (Kafka, RabbitMQ, SQS).
  • Scale-to-zero (Knative) for serverless workloads.
  • Behavior policies control scale up/down speed, prevent oscillation.
  • Karpenter is faster and smarter than Cluster Autoscaler (AWS).
  • Always configure Resource Requests & Limits, otherwise autoscaling won’t work.

Interview Cheat Sheet

Must know:

  • HPA — horizontal (more Pods), VPA — vertical (more CPU/RAM), CA — more Nodes
  • HPA formula: desiredReplicas = ceil[current × (currentMetric / desiredMetric)]
  • HPA requires Metrics Server and Resource Requests in Pod spec
  • Do not use HPA and VPA simultaneously on the same metric (conflict)
  • Custom metrics (Prometheus Adapter) for business-oriented scaling
  • Behavior policies (stabilization window) prevent oscillation
  • KEDA — event-driven scaling (Kafka, RabbitMQ, SQS); Knative — scale-to-zero

Common follow-up questions:

  • “Why doesn’t HPA work without requests?” — HPA calculates percentage from requests; without them there’s no baseline
  • “Is memory-based scaling a good idea?” — No for Java (JVM doesn’t return memory to OS immediately)
  • “What is KEDA?” — Kubernetes Event-driven Autoscaling; scaling by events (queues, streams)
  • “Karpenter vs Cluster Autoscaler?” — Karpenter is faster, smarter at selecting instance types

Red flags (DO NOT say):

  • “HPA and VPA together for CPU” (they conflict, unpredictable behavior)
  • “Scaling Java application by memory” (JVM memory management breaks the metric)
  • “Cluster Autoscaler replaces HPA” (CA adds Nodes, HPA adds Pods — different levels)
  • “Setting maxReplicas = 1000 without control” (risk of huge costs)

Related topics:

  • [[What is HorizontalPodAutoscaler (HPA)]] — HPA in detail
  • [[What is ReplicaSet]] — replication mechanism
  • [[What is Node in Kubernetes]] — Cluster Autoscaler