What is StatefulSet and When to Use It?

Junior Level

Simple Definition

StatefulSet is a controller in Kubernetes for managing applications that need to preserve their identity and data across restarts. Unlike a Deployment (where all Pods are identical and interchangeable), each Pod in a StatefulSet has a unique name, stable DNS, and its own disk.

StatefulSet – like Deployment, but each Pod gets a stable name (web-0, web-1, web-2) and stable storage (its own PersistentVolume). Pods are created and deleted in order.

Analogy

Deployment is like a taxi: any car can pick you up, all are interchangeable. StatefulSet is like a train: each car has its number, sits in a specific place, and if car #3 is removed, the train can’t just put any other in its place — it needs car #3 specifically.

YAML Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  serviceName: database-headless
  replicas: 3
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
        - name: db
          image: postgres:15
          ports:
            - name: postgres
              containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
---
# Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: database-headless
spec:
  clusterIP: None  # Headless!
  selector:
    app: database
  ports:
    - port: 5432

Headless Service (clusterIP: None) – K8s doesn’t create a virtual IP. Instead, DNS query returns each Pod’s IP directly. Required for StatefulSet – each Pod must be accessible by its name.

kubectl Example

# List StatefulSets
kubectl get statefulset database

# List Pods with their unique names
kubectl get pods -l app=database
# database-0  Running
# database-1  Running
# database-2  Running

# DNS to a specific Pod
# database-0.database-headless.default.svc.cluster.local

# Scale (in order!)
kubectl scale statefulset database --replicas=5

When to Use

Databases: PostgreSQL, MySQL, MongoDB
Distributed systems: Kafka, Zookeeper, Cassandra, Elasticsearch
Any application where each instance has a unique role (master/replica)
When data must be bound to a specific Pod

StatefulSet is needed when: (1) Pod identity matters (web-0 != web-1), (2) stable storage is needed, (3) launch order matters (web-0 -> web-1 -> web-2).

Middle Level

How it Works

StatefulSet guarantees four key properties:

Stable network identity: Pods are named <name>-<ordinal> (database-0, database-1, database-2). Each Pod gets a DNS name: database-0.database-headless.default.svc.cluster.local. On restart, Pod keeps its name.
Stable storage: Through volumeClaimTemplates, Kubernetes creates a separate PVC for each Pod (data-database-0, data-database-1). PVC is not deleted when Pod or StatefulSet is deleted. On Pod recreation, the same PVC is mounted.
Ordered deployment: Pods are created strictly in order (0 -> 1 -> 2) and deleted in reverse (2 -> 1 -> 0). Pod N won’t start creating until Pod N-1 is Running and Ready.
Headless Service: StatefulSet requires a Service with clusterIP: None. This Service doesn’t create a virtual IP, but allows DNS to resolve each Pod individually.

Practical Scenarios

Scenario 1: PostgreSQL Master-Replica cluster

database-0 -> Master (read/write)
database-1 -> Replica (read-only)
database-2 -> Replica (read-only)

Each Pod knows its role via ordinal. Init container determines: if ordinal=0, become master; if >0, connect to master as replica.

Scenario 2: Kafka cluster

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka-headless
  replicas: 3
  template:
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            # broker.id = pod ordinal

Each Kafka broker has a unique broker.id equal to its ordinal. This is critical for Kafka internals.

Scenario 3: Parallel scaling

spec:
  podManagementPolicy: Parallel  # instead of OrderedReady (default)

For applications that don’t need strict ordering (e.g., Elasticsearch data nodes), Parallel speeds up scaling.

Common Mistakes Table

Mistake	Consequence	Solution
Missing Headless Service	DNS doesn’t resolve individual Pods, StatefulSet doesn’t work correctly	Always create Service with `clusterIP: None`
PVC not deleted on `kubectl delete statefulset`	“Orphaned” PVCs consume storage	Delete PVC manually: `kubectl delete pvc -l app=database`
Using `:latest` tag	Rolling Update doesn’t work, Pods after recreation may get different image	Use specific tags
Expecting StatefulSet to configure cluster itself	StatefulSet only guarantees identity, doesn’t configure DB/cluster	Use Init containers or Operators
StorageClass without dynamic provisioning	PVCs stuck in Pending, Pods don’t start	Configure StorageClass with `provisioner`
StatefulSet for stateless application	Unnecessary complexity, ordered startup slows deployment	Use Deployment

Comparison: StatefulSet vs Deployment vs DaemonSet

Characteristic	StatefulSet	Deployment	DaemonSet
Pod identity	Unique (name, DNS, PVC)	All identical	All identical
Creation order	Strict (0->1->2)	Parallel	Parallel
Storage	Stable PVC per Pod	Shared PVC or ephemeral	HostPath or local volume
DNS	Individual (`pod-N.service`)	One Service IP	One Service IP
Scaling	Ordered (slow)	Parallel (fast)	One replica per node
When to use	Databases, queues, clusters	Stateless APIs, web apps	Monitoring, log agents
Rolling Update	Reverse order (N->0), waitForFirstConsumer	Parallel	By nodes

When NOT to Use

Stateless applications — use Deployment. StatefulSet adds complexity without benefit
When all Pods are identical — if identity doesn’t matter, Deployment is simpler and faster
Need fast scaling — ordered startup of StatefulSet is slow. For 100 replicas, this takes 100 * startup_time
State stored in external DB — if the application is stateless and state is in PostgreSQL, use Deployment

Senior Level

Deep Mechanics: StatefulSet Controller, PVC, and Reconciliation

StatefulSet Controller Architecture: StatefulSet Controller (in kube-controller-manager) works via reconciliation loop, but is significantly more complex than Deployment Controller:

Watch: Subscribes to StatefulSet, Pod, PVC events
Ordinal Management: Maintains set of ordinals (0..N-1). Each ordinal = Pod + PVC

Ordered Creation:

For i = 0 to replicas-1:
  If Pod-<i> doesn't exist → create
  If Pod-<i> not Ready → wait
  If Pod-<i> Ready → move to i+1

PVC Binding: For each Pod, a PVC is created from volumeClaimTemplates. PVC is named <volumeClaimTemplate-name>-<statefulset-name>-<ordinal>. PVC is created before the Pod so kubelet can mount the volume before container startup.
Pod Identity: When Pod is recreated (on a different node), the controller finds the existing PVC by label selector and attaches it to the new Pod. PVC is never deleted automatically.

Headless Service DNS: Headless Service (clusterIP: None) doesn’t create iptables/IPVS rules. Instead, CoreDNS creates DNS records for each Pod:

database-0.database-headless.default.svc.cluster.local → 10.244.1.5
database-1.database-headless.default.svc.cluster.local → 10.244.2.7
database-2.database-headless.default.svc.cluster.local → 10.244.3.9

DNS updates on Pod IP change (via EndpointSlice).

PersistentVolume Binding: PV is bound to PVC via claimRef. On Pod deletion, PVC is preserved. On StatefulSet deletion, PVC is preserved (orphaned). PV reclaimPolicy (Retain/Delete/Recycle) determines the fate of the data.

Update Strategy:

spec:
  updateStrategy:
    type: RollingUpdate  # or OnDelete
    rollingUpdate:
      partition: 0  # only Pods with ordinal >= partition are updated

RollingUpdate: Updates Pods in reverse order (N->0), one at a time
OnDelete: Updates Pod only after manual deletion
partition: Allows updating only part of Pods (for canary in StatefulSet)

Trade-offs

Aspect	Trade-off
OrderedReady vs Parallel	OrderedReady = safer for clusters with leader election. Parallel = faster for data nodes
Retain vs Delete PVC	Retain = data is safe, but needs manual management. Delete = automatic, but risk of data loss
StatefulSet vs Operator	StatefulSet = basic identity + storage. Operator = full automation (backup, failover, scaling). Operator is more complex but powerful
Replica count	Small (3) = fewer resources, but less fault tolerant. Large (7+) = more replica lag, slower writes
StorageClass local vs network	Local SSD = faster (IOPS), but no migration between nodes. Network (EBS, Ceph) = portable, but higher latency

Edge Cases (7+)

Edge Case 1: Pod moves to another node, PVC doesn’t follow PVC is bound to an availability zone. If Pod moves to a different zone, PVC can’t be mounted (EBS/Ceph zone-locked). Pod stuck in ContainerCreating. Solution: use WaitForFirstConsumer volumeBindingMode in StorageClass — PVC is created in the zone where Pod is scheduled.

Edge Case 2: Split-brain on network partition StatefulSet with 3 PostgreSQL replicas. Network partition: database-0 (master) on one side, database-1 and database-2 on the other. database-1 and database-2 elect a new master. Now two masters write data. On network recovery — data corruption. Solution: Patroni or another HA framework with consensus (etcd/Zookeeper).

Edge Case 3: StatefulSet Rolling Update with incompatible versions Upgrading Kafka from version 3.0 -> 3.5. StatefulSet updates Pods in reverse order: broker-2, broker-1, broker-0. During update, the cluster has mixed versions. If the replication protocol is incompatible, data loss or cluster failure. Solution: Operator with versioning and pre-flight checks.

Edge Case 4: Orphaned PVC after kubectl delete statefulset StatefulSet deleted, but 3 PVCs (10Gi each) remain. They consume storage but aren’t used. After a week, StorageClass quota exceeded. Solution: automate cleanup via finalizer or CronJob.

Edge Case 5: Readiness Probe + StatefulSet ordinal awareness All Pods have the same Readiness Probe. But database-0 (master) and database-1 (replica) have different readiness requirements. Replica may be “ready” only after full sync with master (which takes minutes). Standard readinessProbe doesn’t account for this. Solution: custom readiness script checking replication lag.

Edge Case 6: HPA with StatefulSet HPA doesn’t work directly with StatefulSet — StatefulSet doesn’t support scale subresource for HPA in the same way as Deployment. Since K8s 1.23, StatefulSet supports scale subresource, but HPA scaling doesn’t account for ordering constraints. Solution: KEDA (Kubernetes Event-driven Autoscaling) with StatefulSet scaler.

Edge Case 7: Pod ordinal reuse after deletion Delete database-2. StatefulSet with replicas=3 creates a new database-2. New Pod gets the same ordinal and the same PVC (data-database-2). But if PVC was manually deleted, a new PVC is created empty. Pod starts without data, and the cluster thinks it has a replica with data. Solution: never delete PVC without full cluster re-sync.

Edge Case 8: StatefulSet with podManagementPolicy: Parallel and failure On parallel creation, all 3 Pods start simultaneously. If database-0 (master) isn’t ready yet, database-1 and database-2 can’t connect to master. They enter crash loop. Solution: OrderedReady (default) for leader-follower architecture clusters.

Performance Numbers

Metric	Value
Pod creation latency (ordered)	N * (scheduling + container startup + readiness)
StatefulSet 3 replicas startup	2-5 minutes (depends on container startup)
StatefulSet 10 replicas startup (ordered)	5-15 minutes
StatefulSet 10 replicas startup (parallel)	1-3 minutes
PVC creation + binding	5-30 seconds (depends on provisioner)
DNS update latency (CoreDNS)	1-5 seconds
Rolling Update (reverse order)	N * startup_time (updates one at a time)
Local SSD IOPS	100K-1M IOPS, <1ms latency
Network storage (EBS) IOPS	3K-64K IOPS, 1-10ms latency

Security

PVC data not encrypted by default — use StorageClass with encryption (AWS EBS encryption, Ceph encryption)
StatefulSet Pods have stable IPs — this simplifies targeting a specific Pod for attack. NetworkPolicy should restrict access to each ordinal
Headless Service DNS enumeration — attacker can enumerate all Pods via DNS: pod-0.service, pod-1.service, … Don’t expose sensitive services via headless Service
PVC access modes — ReadWriteOnce allows mounting on only one node. ReadWriteMany — on several, increasing attack surface
Pod Security Admission — StatefulSet Pods often require privileges (e.g., for databases). Use baseline or privileged only if necessary
Secrets for DB credentials — store in Secrets, mount as volumes (not env vars). Rotating secrets requires Rolling Update of StatefulSet

Production War Story

Situation: Fintech company, PostgreSQL cluster on StatefulSet (3 replicas), EBS storage, AWS. StatefulSet configured with volumeClaimTemplates, StorageClass with volumeBindingMode: Immediate.

Incident:

Node with database-0 (master) crashed (hardware failure)
Kubernetes recreated database-0 on a new node
EBS volume was in zone us-east-1a. New node was in us-east-1b
PVC couldn’t mount: Multi-Attach error — EBS zone-locked
database-0 stuck in ContainerCreating for 2 hours
database-1 and database-2 (replicas) continued running in read-only mode
All writes failed — master unavailable
Manual intervention: snapshot EBS in us-east-1a -> restore in us-east-1b -> attach to PVC -> restart Pod
Downtime: 2.5 hours, loss of ~$200K in transactions

Post-mortem and fix:

StorageClass with volumeBindingMode: WaitForFirstConsumer — PVC created in Pod’s zone, not pre-zone-locked
Pod Topology Spread Constraints — spread Pods across zones: ```yaml topologySpreadConstraints:
- maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule ```
Patroni for HA — automatic failover: if master is down, replica promoted to master
EBS Multi-AZ storage — use AWS EBS with cross-zone replication (more expensive but more reliable)
Automated backup — daily snapshot + point-in-time recovery via WAL-G
Alert on Pod pending > 5 minutes — would have fired immediately on zone mismatch

Monitoring after fix:

# Alert: Pod in ContainerCreating > 5 minutes
sum(kube_pod_status_phase{phase="Pending"}) by (statefulset) > 0

# Alert: PVC not mounted
sum(kube_persistentvolumeclaim_status_phase{phase!="Bound"}) > 0

# Alert: PostgreSQL replication lag
pg_replication_lag_seconds > 30

# Alert: Pod cross-zone reschedule
kube_pod_info{namespace="database"} unless on(node) kube_node_info

Monitoring (Prometheus/Grafana)

Key metrics:

# StatefulSet replicas status
kube_statefulset_status_replicas
kube_statefulset_status_replicas_ready
kube_statefulset_status_replicas_current
kube_statefulset_status_replicas_updated

# Pod status by ordinal
kube_pod_status_phase{statefulset="database"}

# PVC status
kube_persistentvolumeclaim_status_phase

# PVC storage usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

# PostgreSQL replication lag (via postgres_exporter)
pg_replication_lag_seconds

# Kafka broker status (via kafka_exporter)
kafka_brokers

# DNS resolution latency for headless service
coredns_dns_request_duration_seconds{service="database-headless"}

Grafana Dashboard panels:

StatefulSet replicas: current vs ready vs desired — mismatch detection
Pod status by ordinal — heatmap (0, 1, 2)
PVC storage usage % — alert at > 80%
Replication lag — critical for HA clusters
Pod restart rate by ordinal — specific Pod crash loop detection
DNS resolution latency — headless service problem detection

Highload Best Practices

Use Operator, not raw StatefulSet — Zalando Postgres Operator, Strimzi Kafka Operator. They automate backup, failover, scaling, version upgrades
volumeBindingMode: WaitForFirstConsumer — PVC created in Pod’s zone, not zone-locked in advance
Pod Topology Spread Constraints — spread Pods across zones and nodes for fault tolerance
Separate master and replica onto different StorageClasses — master: high-performance SSD, replica: cheaper storage
Automated backup — daily snapshot + continuous WAL archiving (WAL-G, barman)
Monitor replication lag — alert at lag > 30 seconds
PodDisruptionBudget — minAvailable: 2 for 3-replica cluster to not lose quorum
Readiness Probe with replication awareness — replica ready only after sync with master
podManagementPolicy: Parallel only for stateless data nodes — for master-replica clusters always OrderedReady
Don’t use StatefulSet for production DB without Operator — StatefulSet gives identity but doesn’t automate failover, backup, or recovery
Storage IOPS monitoring — alert when volume IOPS approaching limit
Regular failover testing — Chaos Engineering: kill master Pod, verify automatic failover

Interview Cheat Sheet

Must know:

StatefulSet — controller for stateful applications: stable names (pod-0, pod-1), DNS, PVC
Pods created in order (0->1->2), deleted in reverse (2->1->0)
Headless Service (clusterIP: None) required — DNS resolves each Pod individually
volumeClaimTemplates — separate PVC per Pod; PVC not deleted on StatefulSet deletion
For production DB, use Operator (Zalando Postgres, Strimzi Kafka), not raw StatefulSet
WaitForFirstConsumer in StorageClass — PVC created in Pod’s zone, not zone-locked
Rolling Update in reverse order (N->0); partition for canary update

Common follow-up questions:

“StatefulSet vs Deployment?” — Deployment: all Pods identical; StatefulSet: unique identity + stable storage
“Is PVC deleted on kubectl delete statefulset?” — No, PVCs are orphaned; must be deleted manually
“Does HPA work with StatefulSet?” — Limited (scale subresource since K8s 1.23); KEDA is better
“Split-brain in StatefulSet?” — On network partition, dual master is possible; need Patroni/consensus

Red flags (DO NOT say):

“StatefulSet for stateless applications” (excessive, use Deployment)
“StatefulSet configures DB cluster itself” (only identity; needs setup via Init/Operator)
“PVC deleted automatically” (orphaned PVC — common problem)
“StatefulSet = fast scaling” (ordered startup is slow)

Related topics:

[[What is Pod in Kubernetes]] — unit of scheduling
[[How to organize rolling update in Kubernetes]] — Pod updates
[[How does scaling work in Kubernetes]] — KEDA for StatefulSet