What is StatefulSet and When to Use It?
StatefulSet guarantees four key properties:
Junior Level
Simple Definition
StatefulSet is a controller in Kubernetes for managing applications that need to preserve their identity and data across restarts. Unlike a Deployment (where all Pods are identical and interchangeable), each Pod in a StatefulSet has a unique name, stable DNS, and its own disk.
StatefulSet – like Deployment, but each Pod gets a stable name (web-0, web-1, web-2) and stable storage (its own PersistentVolume). Pods are created and deleted in order.
Analogy
Deployment is like a taxi: any car can pick you up, all are interchangeable. StatefulSet is like a train: each car has its number, sits in a specific place, and if car #3 is removed, the train can’t just put any other in its place — it needs car #3 specifically.
YAML Example
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
spec:
serviceName: database-headless
replicas: 3
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: db
image: postgres:15
ports:
- name: postgres
containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
# Headless Service for StatefulSet
apiVersion: v1
kind: Service
metadata:
name: database-headless
spec:
clusterIP: None # Headless!
selector:
app: database
ports:
- port: 5432
Headless Service (clusterIP: None) – K8s doesn’t create a virtual IP. Instead, DNS query returns each Pod’s IP directly. Required for StatefulSet – each Pod must be accessible by its name.
kubectl Example
# List StatefulSets
kubectl get statefulset database
# List Pods with their unique names
kubectl get pods -l app=database
# database-0 Running
# database-1 Running
# database-2 Running
# DNS to a specific Pod
# database-0.database-headless.default.svc.cluster.local
# Scale (in order!)
kubectl scale statefulset database --replicas=5
When to Use
- Databases: PostgreSQL, MySQL, MongoDB
- Distributed systems: Kafka, Zookeeper, Cassandra, Elasticsearch
- Any application where each instance has a unique role (master/replica)
- When data must be bound to a specific Pod
StatefulSet is needed when: (1) Pod identity matters (web-0 != web-1), (2) stable storage is needed, (3) launch order matters (web-0 -> web-1 -> web-2).
Middle Level
How it Works
StatefulSet guarantees four key properties:
-
Stable network identity: Pods are named
<name>-<ordinal>(database-0, database-1, database-2). Each Pod gets a DNS name:database-0.database-headless.default.svc.cluster.local. On restart, Pod keeps its name. -
Stable storage: Through
volumeClaimTemplates, Kubernetes creates a separate PVC for each Pod (data-database-0,data-database-1). PVC is not deleted when Pod or StatefulSet is deleted. On Pod recreation, the same PVC is mounted. -
Ordered deployment: Pods are created strictly in order (0 -> 1 -> 2) and deleted in reverse (2 -> 1 -> 0). Pod N won’t start creating until Pod N-1 is
RunningandReady. -
Headless Service: StatefulSet requires a Service with
clusterIP: None. This Service doesn’t create a virtual IP, but allows DNS to resolve each Pod individually.
Practical Scenarios
Scenario 1: PostgreSQL Master-Replica cluster
database-0 -> Master (read/write)
database-1 -> Replica (read-only)
database-2 -> Replica (read-only)
Each Pod knows its role via ordinal. Init container determines: if ordinal=0, become master; if >0, connect to master as replica.
Scenario 2: Kafka cluster
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
spec:
serviceName: kafka-headless
replicas: 3
template:
spec:
containers:
- name: kafka
image: confluentinc/cp-kafka:7.5
env:
- name: KAFKA_BROKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
# broker.id = pod ordinal
Each Kafka broker has a unique broker.id equal to its ordinal. This is critical for Kafka internals.
Scenario 3: Parallel scaling
spec:
podManagementPolicy: Parallel # instead of OrderedReady (default)
For applications that don’t need strict ordering (e.g., Elasticsearch data nodes), Parallel speeds up scaling.
Common Mistakes Table
| Mistake | Consequence | Solution |
|---|---|---|
| Missing Headless Service | DNS doesn’t resolve individual Pods, StatefulSet doesn’t work correctly | Always create Service with clusterIP: None |
PVC not deleted on kubectl delete statefulset |
“Orphaned” PVCs consume storage | Delete PVC manually: kubectl delete pvc -l app=database |
Using :latest tag |
Rolling Update doesn’t work, Pods after recreation may get different image | Use specific tags |
| Expecting StatefulSet to configure cluster itself | StatefulSet only guarantees identity, doesn’t configure DB/cluster | Use Init containers or Operators |
| StorageClass without dynamic provisioning | PVCs stuck in Pending, Pods don’t start | Configure StorageClass with provisioner |
| StatefulSet for stateless application | Unnecessary complexity, ordered startup slows deployment | Use Deployment |
Comparison: StatefulSet vs Deployment vs DaemonSet
| Characteristic | StatefulSet | Deployment | DaemonSet |
|---|---|---|---|
| Pod identity | Unique (name, DNS, PVC) | All identical | All identical |
| Creation order | Strict (0->1->2) | Parallel | Parallel |
| Storage | Stable PVC per Pod | Shared PVC or ephemeral | HostPath or local volume |
| DNS | Individual (pod-N.service) |
One Service IP | One Service IP |
| Scaling | Ordered (slow) | Parallel (fast) | One replica per node |
| When to use | Databases, queues, clusters | Stateless APIs, web apps | Monitoring, log agents |
| Rolling Update | Reverse order (N->0), waitForFirstConsumer | Parallel | By nodes |
When NOT to Use
- Stateless applications — use Deployment. StatefulSet adds complexity without benefit
- When all Pods are identical — if identity doesn’t matter, Deployment is simpler and faster
- Need fast scaling — ordered startup of StatefulSet is slow. For 100 replicas, this takes 100 * startup_time
- State stored in external DB — if the application is stateless and state is in PostgreSQL, use Deployment
Senior Level
Deep Mechanics: StatefulSet Controller, PVC, and Reconciliation
StatefulSet Controller Architecture: StatefulSet Controller (in kube-controller-manager) works via reconciliation loop, but is significantly more complex than Deployment Controller:
- Watch: Subscribes to StatefulSet, Pod, PVC events
- Ordinal Management: Maintains set of ordinals (0..N-1). Each ordinal = Pod + PVC
- Ordered Creation:
For i = 0 to replicas-1: If Pod-<i> doesn't exist → create If Pod-<i> not Ready → wait If Pod-<i> Ready → move to i+1 -
PVC Binding: For each Pod, a PVC is created from
volumeClaimTemplates. PVC is named<volumeClaimTemplate-name>-<statefulset-name>-<ordinal>. PVC is created before the Pod so kubelet can mount the volume before container startup. - Pod Identity: When Pod is recreated (on a different node), the controller finds the existing PVC by label selector and attaches it to the new Pod. PVC is never deleted automatically.
Headless Service DNS:
Headless Service (clusterIP: None) doesn’t create iptables/IPVS rules. Instead, CoreDNS creates DNS records for each Pod:
database-0.database-headless.default.svc.cluster.local → 10.244.1.5
database-1.database-headless.default.svc.cluster.local → 10.244.2.7
database-2.database-headless.default.svc.cluster.local → 10.244.3.9
DNS updates on Pod IP change (via EndpointSlice).
PersistentVolume Binding:
PV is bound to PVC via claimRef. On Pod deletion, PVC is preserved. On StatefulSet deletion, PVC is preserved (orphaned). PV reclaimPolicy (Retain/Delete/Recycle) determines the fate of the data.
Update Strategy:
spec:
updateStrategy:
type: RollingUpdate # or OnDelete
rollingUpdate:
partition: 0 # only Pods with ordinal >= partition are updated
RollingUpdate: Updates Pods in reverse order (N->0), one at a timeOnDelete: Updates Pod only after manual deletionpartition: Allows updating only part of Pods (for canary in StatefulSet)
Trade-offs
| Aspect | Trade-off |
|---|---|
| OrderedReady vs Parallel | OrderedReady = safer for clusters with leader election. Parallel = faster for data nodes |
| Retain vs Delete PVC | Retain = data is safe, but needs manual management. Delete = automatic, but risk of data loss |
| StatefulSet vs Operator | StatefulSet = basic identity + storage. Operator = full automation (backup, failover, scaling). Operator is more complex but powerful |
| Replica count | Small (3) = fewer resources, but less fault tolerant. Large (7+) = more replica lag, slower writes |
| StorageClass local vs network | Local SSD = faster (IOPS), but no migration between nodes. Network (EBS, Ceph) = portable, but higher latency |
Edge Cases (7+)
Edge Case 1: Pod moves to another node, PVC doesn’t follow
PVC is bound to an availability zone. If Pod moves to a different zone, PVC can’t be mounted (EBS/Ceph zone-locked). Pod stuck in ContainerCreating. Solution: use WaitForFirstConsumer volumeBindingMode in StorageClass — PVC is created in the zone where Pod is scheduled.
Edge Case 2: Split-brain on network partition StatefulSet with 3 PostgreSQL replicas. Network partition: database-0 (master) on one side, database-1 and database-2 on the other. database-1 and database-2 elect a new master. Now two masters write data. On network recovery — data corruption. Solution: Patroni or another HA framework with consensus (etcd/Zookeeper).
Edge Case 3: StatefulSet Rolling Update with incompatible versions Upgrading Kafka from version 3.0 -> 3.5. StatefulSet updates Pods in reverse order: broker-2, broker-1, broker-0. During update, the cluster has mixed versions. If the replication protocol is incompatible, data loss or cluster failure. Solution: Operator with versioning and pre-flight checks.
Edge Case 4: Orphaned PVC after kubectl delete statefulset
StatefulSet deleted, but 3 PVCs (10Gi each) remain. They consume storage but aren’t used. After a week, StorageClass quota exceeded. Solution: automate cleanup via finalizer or CronJob.
Edge Case 5: Readiness Probe + StatefulSet ordinal awareness All Pods have the same Readiness Probe. But database-0 (master) and database-1 (replica) have different readiness requirements. Replica may be “ready” only after full sync with master (which takes minutes). Standard readinessProbe doesn’t account for this. Solution: custom readiness script checking replication lag.
Edge Case 6: HPA with StatefulSet
HPA doesn’t work directly with StatefulSet — StatefulSet doesn’t support scale subresource for HPA in the same way as Deployment. Since K8s 1.23, StatefulSet supports scale subresource, but HPA scaling doesn’t account for ordering constraints. Solution: KEDA (Kubernetes Event-driven Autoscaling) with StatefulSet scaler.
Edge Case 7: Pod ordinal reuse after deletion
Delete database-2. StatefulSet with replicas=3 creates a new database-2. New Pod gets the same ordinal and the same PVC (data-database-2). But if PVC was manually deleted, a new PVC is created empty. Pod starts without data, and the cluster thinks it has a replica with data. Solution: never delete PVC without full cluster re-sync.
Edge Case 8: StatefulSet with podManagementPolicy: Parallel and failure
On parallel creation, all 3 Pods start simultaneously. If database-0 (master) isn’t ready yet, database-1 and database-2 can’t connect to master. They enter crash loop. Solution: OrderedReady (default) for leader-follower architecture clusters.
Performance Numbers
| Metric | Value |
|---|---|
| Pod creation latency (ordered) | N * (scheduling + container startup + readiness) |
| StatefulSet 3 replicas startup | 2-5 minutes (depends on container startup) |
| StatefulSet 10 replicas startup (ordered) | 5-15 minutes |
| StatefulSet 10 replicas startup (parallel) | 1-3 minutes |
| PVC creation + binding | 5-30 seconds (depends on provisioner) |
| DNS update latency (CoreDNS) | 1-5 seconds |
| Rolling Update (reverse order) | N * startup_time (updates one at a time) |
| Local SSD IOPS | 100K-1M IOPS, <1ms latency |
| Network storage (EBS) IOPS | 3K-64K IOPS, 1-10ms latency |
Security
- PVC data not encrypted by default — use StorageClass with encryption (AWS EBS encryption, Ceph encryption)
- StatefulSet Pods have stable IPs — this simplifies targeting a specific Pod for attack. NetworkPolicy should restrict access to each ordinal
- Headless Service DNS enumeration — attacker can enumerate all Pods via DNS:
pod-0.service,pod-1.service, … Don’t expose sensitive services via headless Service - PVC access modes —
ReadWriteOnceallows mounting on only one node.ReadWriteMany— on several, increasing attack surface - Pod Security Admission — StatefulSet Pods often require privileges (e.g., for databases). Use
baselineorprivilegedonly if necessary - Secrets for DB credentials — store in Secrets, mount as volumes (not env vars). Rotating secrets requires Rolling Update of StatefulSet
Production War Story
Situation: Fintech company, PostgreSQL cluster on StatefulSet (3 replicas), EBS storage, AWS. StatefulSet configured with volumeClaimTemplates, StorageClass with volumeBindingMode: Immediate.
Incident:
- Node with database-0 (master) crashed (hardware failure)
- Kubernetes recreated database-0 on a new node
- EBS volume was in zone us-east-1a. New node was in us-east-1b
- PVC couldn’t mount:
Multi-Attach error— EBS zone-locked - database-0 stuck in
ContainerCreatingfor 2 hours - database-1 and database-2 (replicas) continued running in read-only mode
- All writes failed — master unavailable
- Manual intervention: snapshot EBS in us-east-1a -> restore in us-east-1b -> attach to PVC -> restart Pod
- Downtime: 2.5 hours, loss of ~$200K in transactions
Post-mortem and fix:
- StorageClass with
volumeBindingMode: WaitForFirstConsumer— PVC created in Pod’s zone, not pre-zone-locked - Pod Topology Spread Constraints — spread Pods across zones:
```yaml
topologySpreadConstraints:
- maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule ```
- Patroni for HA — automatic failover: if master is down, replica promoted to master
- EBS Multi-AZ storage — use AWS EBS with cross-zone replication (more expensive but more reliable)
- Automated backup — daily snapshot + point-in-time recovery via WAL-G
- Alert on Pod pending > 5 minutes — would have fired immediately on zone mismatch
Monitoring after fix:
# Alert: Pod in ContainerCreating > 5 minutes
sum(kube_pod_status_phase{phase="Pending"}) by (statefulset) > 0
# Alert: PVC not mounted
sum(kube_persistentvolumeclaim_status_phase{phase!="Bound"}) > 0
# Alert: PostgreSQL replication lag
pg_replication_lag_seconds > 30
# Alert: Pod cross-zone reschedule
kube_pod_info{namespace="database"} unless on(node) kube_node_info
Monitoring (Prometheus/Grafana)
Key metrics:
# StatefulSet replicas status
kube_statefulset_status_replicas
kube_statefulset_status_replicas_ready
kube_statefulset_status_replicas_current
kube_statefulset_status_replicas_updated
# Pod status by ordinal
kube_pod_status_phase{statefulset="database"}
# PVC status
kube_persistentvolumeclaim_status_phase
# PVC storage usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
# PostgreSQL replication lag (via postgres_exporter)
pg_replication_lag_seconds
# Kafka broker status (via kafka_exporter)
kafka_brokers
# DNS resolution latency for headless service
coredns_dns_request_duration_seconds{service="database-headless"}
Grafana Dashboard panels:
- StatefulSet replicas: current vs ready vs desired — mismatch detection
- Pod status by ordinal — heatmap (0, 1, 2)
- PVC storage usage % — alert at > 80%
- Replication lag — critical for HA clusters
- Pod restart rate by ordinal — specific Pod crash loop detection
- DNS resolution latency — headless service problem detection
Highload Best Practices
- Use Operator, not raw StatefulSet — Zalando Postgres Operator, Strimzi Kafka Operator. They automate backup, failover, scaling, version upgrades
volumeBindingMode: WaitForFirstConsumer— PVC created in Pod’s zone, not zone-locked in advance- Pod Topology Spread Constraints — spread Pods across zones and nodes for fault tolerance
- Separate master and replica onto different StorageClasses — master: high-performance SSD, replica: cheaper storage
- Automated backup — daily snapshot + continuous WAL archiving (WAL-G, barman)
- Monitor replication lag — alert at lag > 30 seconds
- PodDisruptionBudget —
minAvailable: 2for 3-replica cluster to not lose quorum - Readiness Probe with replication awareness — replica ready only after sync with master
podManagementPolicy: Parallelonly for stateless data nodes — for master-replica clusters alwaysOrderedReady- Don’t use StatefulSet for production DB without Operator — StatefulSet gives identity but doesn’t automate failover, backup, or recovery
- Storage IOPS monitoring — alert when volume IOPS approaching limit
- Regular failover testing — Chaos Engineering: kill master Pod, verify automatic failover
Interview Cheat Sheet
Must know:
- StatefulSet — controller for stateful applications: stable names (pod-0, pod-1), DNS, PVC
- Pods created in order (0->1->2), deleted in reverse (2->1->0)
- Headless Service (
clusterIP: None) required — DNS resolves each Pod individually - volumeClaimTemplates — separate PVC per Pod; PVC not deleted on StatefulSet deletion
- For production DB, use Operator (Zalando Postgres, Strimzi Kafka), not raw StatefulSet
- WaitForFirstConsumer in StorageClass — PVC created in Pod’s zone, not zone-locked
- Rolling Update in reverse order (N->0); partition for canary update
Common follow-up questions:
- “StatefulSet vs Deployment?” — Deployment: all Pods identical; StatefulSet: unique identity + stable storage
- “Is PVC deleted on
kubectl delete statefulset?” — No, PVCs are orphaned; must be deleted manually - “Does HPA work with StatefulSet?” — Limited (scale subresource since K8s 1.23); KEDA is better
- “Split-brain in StatefulSet?” — On network partition, dual master is possible; need Patroni/consensus
Red flags (DO NOT say):
- “StatefulSet for stateless applications” (excessive, use Deployment)
- “StatefulSet configures DB cluster itself” (only identity; needs setup via Init/Operator)
- “PVC deleted automatically” (orphaned PVC — common problem)
- “StatefulSet = fast scaling” (ordered startup is slow)
Related topics:
- [[What is Pod in Kubernetes]] — unit of scheduling
- [[How to organize rolling update in Kubernetes]] — Pod updates
- [[How does scaling work in Kubernetes]] — KEDA for StatefulSet