How does scaling work in Kubernetes?

Junior Level

Simple Explanation

Scaling is the process of changing the amount of resources allocated to an application. In Kubernetes, you can scale at two levels:

More copies of the application (horizontal) — launch additional Pods
More resources for a Pod (vertical) — give more CPU/RAM

Horizontal Scaling (more copies)

1 copy:     [App]          ← 100 req/s → slows down
3 copies:   [App] [App] [App]  ← 100 req/s → works fast

This is the most common approach. More copies = more requests handled.

Vertical Scaling (more resources)

Low RAM:    [App: 256MB]   ← OutOfMemoryError
High RAM:   [App: 1GB]     ← works normally

Less common approach. Not all applications can efficiently use more resources.

Manual Scaling

# Increase replica count to 5
kubectl scale deployment myapp --replicas=5

# Check
kubectl get deployment myapp

Automatic Scaling

Kubernetes can automatically change the replica count:

HPA (Horizontal Pod Autoscaler) — based on CPU or other metrics
Cluster Autoscaler — adds servers (Nodes) when resources are insufficient

What a Junior Developer Should Remember

Horizontal = more copies (most common approach)
Vertical = more resources per Pod
Manual: kubectl scale deployment --replicas=N
Automatic: HPA by CPU, Cluster Autoscaler by Node
HPA requires Requests (minimum resources) to be specified

Middle Level

Types of Scaling

HPA — Horizontal Pod Autoscaler

HPA does not work well with stateful workloads (databases) — new Pods have no data. For stateful applications, use StatefulSet + manual scaling.

Changes the number of Pod replicas:

# HPA: maintain ~50% CPU utilization
kubectl autoscale deployment myapp --cpu-percent=50 --min=2 --max=10

Metric sources:

Resource metrics: CPU, Memory (from Metrics Server)
Custom metrics: business metrics from Prometheus
External metrics: external queues (AWS SQS, RabbitMQ)

VPA — Vertical Pod Autoscaler

VPA (Vertical Pod Autoscaler) — automatically adjusts CPU/RAM requests for Pods.

Changes CPU/RAM requests and limits of a Pod:

Requires Pod restart to apply
Useful for finding optimal resource requests
Not recommended to use together with HPA on CPU

Cluster Autoscaler

Adds/removes Nodes in the cluster:

If Pods are in Pending state (insufficient resources) → adds Node
If Nodes are underutilized → removes them to save costs

How HPA Makes Decisions?

Formula:

desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]

Example:

Desired CPU: 50%
Current CPU: 100%
Current replicas: 2
Decision: ceil[2 × (100/50)] = 4 replicas

Requirements for HPA

Metrics Server must be installed
Requests must be specified in the Pod (HPA calculates percentage from Requests)

resources:
  requests:
    cpu: "500m"     # Required for HPA
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

The “Race Condition” Problem

If HPA reacts too quickly to spikes, the system enters oscillation:

CPU spike → HPA adds Pods → load drops → HPA removes Pods → spike again

Solution: Configure cooldown periods (behavior in HPA).

What a Middle Developer Should Remember

HPA — primary scaling method
VPA — for resource tuning (not together with HPA)
Cluster Autoscaler — for cluster capacity management
HPA requires Metrics Server and Resource Requests
Configure cooldown to prevent oscillation

Senior Level

Scaling as an Architectural Strategy

Scaling in Kubernetes is not just “add more Pods” — it is a multi-level strategy that affects application architecture, infrastructure costs, and SLA.

Complete Scaling Picture

Application level:
├── HPA (horizontal): more replicas
└── VPA (vertical): more resources

Cluster level:
├── Cluster Autoscaler: more Nodes
└── Karpenter: instance type optimization

Load level:
├── Resource-based: CPU, Memory
├── Custom metrics: RPS, queue length, latency
└── External: business metrics

HPA: Advanced Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 512Mi
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Custom Metrics for HPA

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"

Requires Prometheus Adapter to expose metrics to the K8s API.

KEDA: Event-Driven Autoscaling

KEDA — event-based scaling:

triggers:
- type: kafka
  metadata:
    bootstrapServers: kafka:9092
    consumerGroup: mygroup
    topic: mytopic
    lagThreshold: "100"

Supports: Kafka, RabbitMQ, Redis, AWS SQS, Azure Service Bus, and others.

Scale to Zero

Knative + Kourier:

No traffic → 0 replicas
On request → fast startup (cold start ~100ms-2s)
Ideal for event-driven, serverless workloads

Requires:

Fast application startup (GraalVM Native Image)
Queue Proxy to buffer requests during scale up

VPA: Limitations and Conflicts

Do not use with HPA on CPU:

HPA wants more Pods when CPU is high
VPA wants more CPU per Pod
Conflict → unpredictable behavior

Recommendation:

HPA for stateless workloads (horizontal)
VPA for stateful/monolithic workloads (vertical)
VPA in Off mode for recommendations (what to set in requests)

Cluster Autoscaler vs Karpenter

Characteristic	Cluster Autoscaler	Karpenter
Speed	Slow (minutes)	Fast (seconds)
Instance selection	Limited	Optimized
Spot instances	Yes	Yes (better)
Provider	Cloud-specific	AWS (multi-cloud in development)

Scaling Economics

Without autoscaling:
- Peak load: 100 replicas (2 hours per day)
- Rest of the time: 10 replicas idle
- Cost: 100 × 24h

With autoscaling:
- Peak: 100 replicas (2 hours)
- Off-peak: 10 replicas (22 hours)
- Cost: 100×2h + 10×22h = ~30% savings

Troubleshooting

HPA not scaling:

kubectl describe hpa myapp
# Conditions:
#   AbleToScale    False   SucceededRescale  (last transition: ...)
#   ScalingActive  False   FailedGetResourceMetric

Check:

Is Metrics Server running?
Are Requests specified?
Are metrics available?

Pods in Pending during scaling:

Insufficient resources on Nodes → Cluster Autoscaler should trigger
Check quota limits: kubectl describe resourcequota

Summary for Senior

HPA — for handling load, VPA — for resource tuning, CA — for capacity.
Do not use HPA and VPA simultaneously on the same metric.
Custom metrics (Prometheus Adapter) for business-oriented scaling.
KEDA for event-driven scaling (Kafka, RabbitMQ, SQS).
Scale-to-zero (Knative) for serverless workloads.
Behavior policies control scale up/down speed, prevent oscillation.
Karpenter is faster and smarter than Cluster Autoscaler (AWS).
Always configure Resource Requests & Limits, otherwise autoscaling won’t work.

Interview Cheat Sheet

Must know:

HPA — horizontal (more Pods), VPA — vertical (more CPU/RAM), CA — more Nodes
HPA formula: desiredReplicas = ceil[current × (currentMetric / desiredMetric)]
HPA requires Metrics Server and Resource Requests in Pod spec
Do not use HPA and VPA simultaneously on the same metric (conflict)
Custom metrics (Prometheus Adapter) for business-oriented scaling
Behavior policies (stabilization window) prevent oscillation
KEDA — event-driven scaling (Kafka, RabbitMQ, SQS); Knative — scale-to-zero

Common follow-up questions:

“Why doesn’t HPA work without requests?” — HPA calculates percentage from requests; without them there’s no baseline
“Is memory-based scaling a good idea?” — No for Java (JVM doesn’t return memory to OS immediately)
“What is KEDA?” — Kubernetes Event-driven Autoscaling; scaling by events (queues, streams)
“Karpenter vs Cluster Autoscaler?” — Karpenter is faster, smarter at selecting instance types

Red flags (DO NOT say):

“HPA and VPA together for CPU” (they conflict, unpredictable behavior)
“Scaling Java application by memory” (JVM memory management breaks the metric)
“Cluster Autoscaler replaces HPA” (CA adds Nodes, HPA adds Pods — different levels)
“Setting maxReplicas = 1000 without control” (risk of huge costs)

Related topics:

[[What is HorizontalPodAutoscaler (HPA)]] — HPA in detail
[[What is ReplicaSet]] — replication mechanism
[[What is Node in Kubernetes]] — Cluster Autoscaler