Question 23 · Section 14

What is StatefulSet and When to Use It?

StatefulSet guarantees four key properties:

Language versions: English Russian Ukrainian

Junior Level

Simple Definition

StatefulSet is a controller in Kubernetes for managing applications that need to preserve their identity and data across restarts. Unlike a Deployment (where all Pods are identical and interchangeable), each Pod in a StatefulSet has a unique name, stable DNS, and its own disk.

StatefulSet – like Deployment, but each Pod gets a stable name (web-0, web-1, web-2) and stable storage (its own PersistentVolume). Pods are created and deleted in order.

Analogy

Deployment is like a taxi: any car can pick you up, all are interchangeable. StatefulSet is like a train: each car has its number, sits in a specific place, and if car #3 is removed, the train can’t just put any other in its place — it needs car #3 specifically.

YAML Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  serviceName: database-headless
  replicas: 3
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
        - name: db
          image: postgres:15
          ports:
            - name: postgres
              containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
---
# Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: database-headless
spec:
  clusterIP: None  # Headless!
  selector:
    app: database
  ports:
    - port: 5432

Headless Service (clusterIP: None) – K8s doesn’t create a virtual IP. Instead, DNS query returns each Pod’s IP directly. Required for StatefulSet – each Pod must be accessible by its name.

kubectl Example

# List StatefulSets
kubectl get statefulset database

# List Pods with their unique names
kubectl get pods -l app=database
# database-0  Running
# database-1  Running
# database-2  Running

# DNS to a specific Pod
# database-0.database-headless.default.svc.cluster.local

# Scale (in order!)
kubectl scale statefulset database --replicas=5

When to Use

  • Databases: PostgreSQL, MySQL, MongoDB
  • Distributed systems: Kafka, Zookeeper, Cassandra, Elasticsearch
  • Any application where each instance has a unique role (master/replica)
  • When data must be bound to a specific Pod

StatefulSet is needed when: (1) Pod identity matters (web-0 != web-1), (2) stable storage is needed, (3) launch order matters (web-0 -> web-1 -> web-2).


Middle Level

How it Works

StatefulSet guarantees four key properties:

  1. Stable network identity: Pods are named <name>-<ordinal> (database-0, database-1, database-2). Each Pod gets a DNS name: database-0.database-headless.default.svc.cluster.local. On restart, Pod keeps its name.

  2. Stable storage: Through volumeClaimTemplates, Kubernetes creates a separate PVC for each Pod (data-database-0, data-database-1). PVC is not deleted when Pod or StatefulSet is deleted. On Pod recreation, the same PVC is mounted.

  3. Ordered deployment: Pods are created strictly in order (0 -> 1 -> 2) and deleted in reverse (2 -> 1 -> 0). Pod N won’t start creating until Pod N-1 is Running and Ready.

  4. Headless Service: StatefulSet requires a Service with clusterIP: None. This Service doesn’t create a virtual IP, but allows DNS to resolve each Pod individually.

Practical Scenarios

Scenario 1: PostgreSQL Master-Replica cluster

database-0 -> Master (read/write)
database-1 -> Replica (read-only)
database-2 -> Replica (read-only)

Each Pod knows its role via ordinal. Init container determines: if ordinal=0, become master; if >0, connect to master as replica.

Scenario 2: Kafka cluster

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka-headless
  replicas: 3
  template:
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            # broker.id = pod ordinal

Each Kafka broker has a unique broker.id equal to its ordinal. This is critical for Kafka internals.

Scenario 3: Parallel scaling

spec:
  podManagementPolicy: Parallel  # instead of OrderedReady (default)

For applications that don’t need strict ordering (e.g., Elasticsearch data nodes), Parallel speeds up scaling.

Common Mistakes Table

Mistake Consequence Solution
Missing Headless Service DNS doesn’t resolve individual Pods, StatefulSet doesn’t work correctly Always create Service with clusterIP: None
PVC not deleted on kubectl delete statefulset “Orphaned” PVCs consume storage Delete PVC manually: kubectl delete pvc -l app=database
Using :latest tag Rolling Update doesn’t work, Pods after recreation may get different image Use specific tags
Expecting StatefulSet to configure cluster itself StatefulSet only guarantees identity, doesn’t configure DB/cluster Use Init containers or Operators
StorageClass without dynamic provisioning PVCs stuck in Pending, Pods don’t start Configure StorageClass with provisioner
StatefulSet for stateless application Unnecessary complexity, ordered startup slows deployment Use Deployment

Comparison: StatefulSet vs Deployment vs DaemonSet

Characteristic StatefulSet Deployment DaemonSet
Pod identity Unique (name, DNS, PVC) All identical All identical
Creation order Strict (0->1->2) Parallel Parallel
Storage Stable PVC per Pod Shared PVC or ephemeral HostPath or local volume
DNS Individual (pod-N.service) One Service IP One Service IP
Scaling Ordered (slow) Parallel (fast) One replica per node
When to use Databases, queues, clusters Stateless APIs, web apps Monitoring, log agents
Rolling Update Reverse order (N->0), waitForFirstConsumer Parallel By nodes

When NOT to Use

  • Stateless applications — use Deployment. StatefulSet adds complexity without benefit
  • When all Pods are identical — if identity doesn’t matter, Deployment is simpler and faster
  • Need fast scaling — ordered startup of StatefulSet is slow. For 100 replicas, this takes 100 * startup_time
  • State stored in external DB — if the application is stateless and state is in PostgreSQL, use Deployment

Senior Level

Deep Mechanics: StatefulSet Controller, PVC, and Reconciliation

StatefulSet Controller Architecture: StatefulSet Controller (in kube-controller-manager) works via reconciliation loop, but is significantly more complex than Deployment Controller:

  1. Watch: Subscribes to StatefulSet, Pod, PVC events
  2. Ordinal Management: Maintains set of ordinals (0..N-1). Each ordinal = Pod + PVC
  3. Ordered Creation:
    For i = 0 to replicas-1:
      If Pod-<i> doesn't exist → create
      If Pod-<i> not Ready → wait
      If Pod-<i> Ready → move to i+1
    
  4. PVC Binding: For each Pod, a PVC is created from volumeClaimTemplates. PVC is named <volumeClaimTemplate-name>-<statefulset-name>-<ordinal>. PVC is created before the Pod so kubelet can mount the volume before container startup.

  5. Pod Identity: When Pod is recreated (on a different node), the controller finds the existing PVC by label selector and attaches it to the new Pod. PVC is never deleted automatically.

Headless Service DNS: Headless Service (clusterIP: None) doesn’t create iptables/IPVS rules. Instead, CoreDNS creates DNS records for each Pod:

database-0.database-headless.default.svc.cluster.local → 10.244.1.5
database-1.database-headless.default.svc.cluster.local → 10.244.2.7
database-2.database-headless.default.svc.cluster.local → 10.244.3.9

DNS updates on Pod IP change (via EndpointSlice).

PersistentVolume Binding: PV is bound to PVC via claimRef. On Pod deletion, PVC is preserved. On StatefulSet deletion, PVC is preserved (orphaned). PV reclaimPolicy (Retain/Delete/Recycle) determines the fate of the data.

Update Strategy:

spec:
  updateStrategy:
    type: RollingUpdate  # or OnDelete
    rollingUpdate:
      partition: 0  # only Pods with ordinal >= partition are updated
  • RollingUpdate: Updates Pods in reverse order (N->0), one at a time
  • OnDelete: Updates Pod only after manual deletion
  • partition: Allows updating only part of Pods (for canary in StatefulSet)

Trade-offs

Aspect Trade-off
OrderedReady vs Parallel OrderedReady = safer for clusters with leader election. Parallel = faster for data nodes
Retain vs Delete PVC Retain = data is safe, but needs manual management. Delete = automatic, but risk of data loss
StatefulSet vs Operator StatefulSet = basic identity + storage. Operator = full automation (backup, failover, scaling). Operator is more complex but powerful
Replica count Small (3) = fewer resources, but less fault tolerant. Large (7+) = more replica lag, slower writes
StorageClass local vs network Local SSD = faster (IOPS), but no migration between nodes. Network (EBS, Ceph) = portable, but higher latency

Edge Cases (7+)

Edge Case 1: Pod moves to another node, PVC doesn’t follow PVC is bound to an availability zone. If Pod moves to a different zone, PVC can’t be mounted (EBS/Ceph zone-locked). Pod stuck in ContainerCreating. Solution: use WaitForFirstConsumer volumeBindingMode in StorageClass — PVC is created in the zone where Pod is scheduled.

Edge Case 2: Split-brain on network partition StatefulSet with 3 PostgreSQL replicas. Network partition: database-0 (master) on one side, database-1 and database-2 on the other. database-1 and database-2 elect a new master. Now two masters write data. On network recovery — data corruption. Solution: Patroni or another HA framework with consensus (etcd/Zookeeper).

Edge Case 3: StatefulSet Rolling Update with incompatible versions Upgrading Kafka from version 3.0 -> 3.5. StatefulSet updates Pods in reverse order: broker-2, broker-1, broker-0. During update, the cluster has mixed versions. If the replication protocol is incompatible, data loss or cluster failure. Solution: Operator with versioning and pre-flight checks.

Edge Case 4: Orphaned PVC after kubectl delete statefulset StatefulSet deleted, but 3 PVCs (10Gi each) remain. They consume storage but aren’t used. After a week, StorageClass quota exceeded. Solution: automate cleanup via finalizer or CronJob.

Edge Case 5: Readiness Probe + StatefulSet ordinal awareness All Pods have the same Readiness Probe. But database-0 (master) and database-1 (replica) have different readiness requirements. Replica may be “ready” only after full sync with master (which takes minutes). Standard readinessProbe doesn’t account for this. Solution: custom readiness script checking replication lag.

Edge Case 6: HPA with StatefulSet HPA doesn’t work directly with StatefulSet — StatefulSet doesn’t support scale subresource for HPA in the same way as Deployment. Since K8s 1.23, StatefulSet supports scale subresource, but HPA scaling doesn’t account for ordering constraints. Solution: KEDA (Kubernetes Event-driven Autoscaling) with StatefulSet scaler.

Edge Case 7: Pod ordinal reuse after deletion Delete database-2. StatefulSet with replicas=3 creates a new database-2. New Pod gets the same ordinal and the same PVC (data-database-2). But if PVC was manually deleted, a new PVC is created empty. Pod starts without data, and the cluster thinks it has a replica with data. Solution: never delete PVC without full cluster re-sync.

Edge Case 8: StatefulSet with podManagementPolicy: Parallel and failure On parallel creation, all 3 Pods start simultaneously. If database-0 (master) isn’t ready yet, database-1 and database-2 can’t connect to master. They enter crash loop. Solution: OrderedReady (default) for leader-follower architecture clusters.

Performance Numbers

Metric Value
Pod creation latency (ordered) N * (scheduling + container startup + readiness)
StatefulSet 3 replicas startup 2-5 minutes (depends on container startup)
StatefulSet 10 replicas startup (ordered) 5-15 minutes
StatefulSet 10 replicas startup (parallel) 1-3 minutes
PVC creation + binding 5-30 seconds (depends on provisioner)
DNS update latency (CoreDNS) 1-5 seconds
Rolling Update (reverse order) N * startup_time (updates one at a time)
Local SSD IOPS 100K-1M IOPS, <1ms latency
Network storage (EBS) IOPS 3K-64K IOPS, 1-10ms latency

Security

  • PVC data not encrypted by default — use StorageClass with encryption (AWS EBS encryption, Ceph encryption)
  • StatefulSet Pods have stable IPs — this simplifies targeting a specific Pod for attack. NetworkPolicy should restrict access to each ordinal
  • Headless Service DNS enumeration — attacker can enumerate all Pods via DNS: pod-0.service, pod-1.service, … Don’t expose sensitive services via headless Service
  • PVC access modesReadWriteOnce allows mounting on only one node. ReadWriteMany — on several, increasing attack surface
  • Pod Security Admission — StatefulSet Pods often require privileges (e.g., for databases). Use baseline or privileged only if necessary
  • Secrets for DB credentials — store in Secrets, mount as volumes (not env vars). Rotating secrets requires Rolling Update of StatefulSet

Production War Story

Situation: Fintech company, PostgreSQL cluster on StatefulSet (3 replicas), EBS storage, AWS. StatefulSet configured with volumeClaimTemplates, StorageClass with volumeBindingMode: Immediate.

Incident:

  1. Node with database-0 (master) crashed (hardware failure)
  2. Kubernetes recreated database-0 on a new node
  3. EBS volume was in zone us-east-1a. New node was in us-east-1b
  4. PVC couldn’t mount: Multi-Attach error — EBS zone-locked
  5. database-0 stuck in ContainerCreating for 2 hours
  6. database-1 and database-2 (replicas) continued running in read-only mode
  7. All writes failed — master unavailable
  8. Manual intervention: snapshot EBS in us-east-1a -> restore in us-east-1b -> attach to PVC -> restart Pod
  9. Downtime: 2.5 hours, loss of ~$200K in transactions

Post-mortem and fix:

  1. StorageClass with volumeBindingMode: WaitForFirstConsumer — PVC created in Pod’s zone, not pre-zone-locked
  2. Pod Topology Spread Constraints — spread Pods across zones: ```yaml topologySpreadConstraints:
    • maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule ```
  3. Patroni for HA — automatic failover: if master is down, replica promoted to master
  4. EBS Multi-AZ storage — use AWS EBS with cross-zone replication (more expensive but more reliable)
  5. Automated backup — daily snapshot + point-in-time recovery via WAL-G
  6. Alert on Pod pending > 5 minutes — would have fired immediately on zone mismatch

Monitoring after fix:

# Alert: Pod in ContainerCreating > 5 minutes
sum(kube_pod_status_phase{phase="Pending"}) by (statefulset) > 0

# Alert: PVC not mounted
sum(kube_persistentvolumeclaim_status_phase{phase!="Bound"}) > 0

# Alert: PostgreSQL replication lag
pg_replication_lag_seconds > 30

# Alert: Pod cross-zone reschedule
kube_pod_info{namespace="database"} unless on(node) kube_node_info

Monitoring (Prometheus/Grafana)

Key metrics:

# StatefulSet replicas status
kube_statefulset_status_replicas
kube_statefulset_status_replicas_ready
kube_statefulset_status_replicas_current
kube_statefulset_status_replicas_updated

# Pod status by ordinal
kube_pod_status_phase{statefulset="database"}

# PVC status
kube_persistentvolumeclaim_status_phase

# PVC storage usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

# PostgreSQL replication lag (via postgres_exporter)
pg_replication_lag_seconds

# Kafka broker status (via kafka_exporter)
kafka_brokers

# DNS resolution latency for headless service
coredns_dns_request_duration_seconds{service="database-headless"}

Grafana Dashboard panels:

  1. StatefulSet replicas: current vs ready vs desired — mismatch detection
  2. Pod status by ordinal — heatmap (0, 1, 2)
  3. PVC storage usage % — alert at > 80%
  4. Replication lag — critical for HA clusters
  5. Pod restart rate by ordinal — specific Pod crash loop detection
  6. DNS resolution latency — headless service problem detection

Highload Best Practices

  1. Use Operator, not raw StatefulSet — Zalando Postgres Operator, Strimzi Kafka Operator. They automate backup, failover, scaling, version upgrades
  2. volumeBindingMode: WaitForFirstConsumer — PVC created in Pod’s zone, not zone-locked in advance
  3. Pod Topology Spread Constraints — spread Pods across zones and nodes for fault tolerance
  4. Separate master and replica onto different StorageClasses — master: high-performance SSD, replica: cheaper storage
  5. Automated backup — daily snapshot + continuous WAL archiving (WAL-G, barman)
  6. Monitor replication lag — alert at lag > 30 seconds
  7. PodDisruptionBudgetminAvailable: 2 for 3-replica cluster to not lose quorum
  8. Readiness Probe with replication awareness — replica ready only after sync with master
  9. podManagementPolicy: Parallel only for stateless data nodes — for master-replica clusters always OrderedReady
  10. Don’t use StatefulSet for production DB without Operator — StatefulSet gives identity but doesn’t automate failover, backup, or recovery
  11. Storage IOPS monitoring — alert when volume IOPS approaching limit
  12. Regular failover testing — Chaos Engineering: kill master Pod, verify automatic failover

Interview Cheat Sheet

Must know:

  • StatefulSet — controller for stateful applications: stable names (pod-0, pod-1), DNS, PVC
  • Pods created in order (0->1->2), deleted in reverse (2->1->0)
  • Headless Service (clusterIP: None) required — DNS resolves each Pod individually
  • volumeClaimTemplates — separate PVC per Pod; PVC not deleted on StatefulSet deletion
  • For production DB, use Operator (Zalando Postgres, Strimzi Kafka), not raw StatefulSet
  • WaitForFirstConsumer in StorageClass — PVC created in Pod’s zone, not zone-locked
  • Rolling Update in reverse order (N->0); partition for canary update

Common follow-up questions:

  • “StatefulSet vs Deployment?” — Deployment: all Pods identical; StatefulSet: unique identity + stable storage
  • “Is PVC deleted on kubectl delete statefulset?” — No, PVCs are orphaned; must be deleted manually
  • “Does HPA work with StatefulSet?” — Limited (scale subresource since K8s 1.23); KEDA is better
  • “Split-brain in StatefulSet?” — On network partition, dual master is possible; need Patroni/consensus

Red flags (DO NOT say):

  • “StatefulSet for stateless applications” (excessive, use Deployment)
  • “StatefulSet configures DB cluster itself” (only identity; needs setup via Init/Operator)
  • “PVC deleted automatically” (orphaned PVC — common problem)
  • “StatefulSet = fast scaling” (ordered startup is slow)

Related topics:

  • [[What is Pod in Kubernetes]] — unit of scheduling
  • [[How to organize rolling update in Kubernetes]] — Pod updates
  • [[How does scaling work in Kubernetes]] — KEDA for StatefulSet