What is Readiness Probe?
Readiness probe -- K8s checks whether the Pod is ready to accept traffic. If probe fails -- the Pod is removed from Service endpoints, but is NOT restarted.
Junior Level
Simple Definition
Readiness Probe is a check in Kubernetes that determines whether a Pod is ready to accept incoming traffic. If the probe fails — Kubernetes removes the Pod from the load balancer, but does not restart it.
Readiness probe – K8s checks whether the Pod is ready to accept traffic. If probe fails – the Pod is removed from Service endpoints, but is NOT restarted.
Analogy
Imagine a store. The doors are open (Pod is running), but the cashier hasn’t been trained yet, registers aren’t connected. The manager (Readiness Probe) says: “Don’t accept customers yet.” Once everything is ready — the doors open for shoppers.
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: my-app:1.0
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
kubectl Example
# Check Pod readiness status
kubectl get pods
# STATUS: Running (1/1 READY) — ready, Running (0/1 READY) — not ready
kubectl describe pod my-app | grep -A 10 Readiness
When to Use
- Application needs warmup time (cache loading, DB initialization)
- During deployment, so traffic only goes to ready Pods
- If the application can temporarily “opt out” under overload
Middle Level
How it Works
Readiness Probe is executed by the kubelet on each node. The kubelet polls the specified endpoint at periodSeconds intervals. On a successful response, the Pod gets Ready: True status and its IP is added to the Endpoints (or EndpointSlice) of all services that select this Pod. On failure — the IP is removed from Endpoints.
Types of checks:
httpGet— HTTP GET request, expects 2xx/3xxtcpSocket— checks that the TCP port is openexec— executes a command inside the containergrpc— gRPC health check (since Kubernetes 1.24+)
Practical Scenarios
Scenario 1: Spring Boot with Actuator
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
failureThreshold: 3
Scenario 2: Temporary opt-out under overload
If the task queue exceeds the limit, the application can return 503 on /health/ready to temporarily remove itself from load balancing and reduce load.
Scenario 3: Rolling Update Kubernetes will not start removing old Pods until new ones pass the Readiness Probe. This ensures Zero Downtime.
Common Mistakes Table
| Mistake | Consequence | Solution |
|---|---|---|
| Checking DB in Readiness during cascading failure | All Pods disconnect simultaneously, 503 for everyone | Return Ready if the application can work partially (cache, fallback) |
| Too frequent periodSeconds (1 sec) | Unnecessary load on the application | Set 3-10 seconds |
| Missing readinessProbe | Traffic goes to not-yet-ready Pods, 502/503 errors | Always add for stateful applications |
| Checking only TCP port | Application listens on port but doesn’t process requests | Use httpGet with a real health endpoint |
Comparison: Liveness vs Readiness
| Characteristic | Liveness Probe | Readiness Probe |
|---|---|---|
| Action on failure | Container restart | Removal from Service Endpoints |
| Purpose | Detect hangs, deadlocks | Ensure traffic goes only to ready Pods |
| Dependency checking | Not recommended | Recommended (DB, external APIs) |
| Impact on Rolling Update | No effect | Blocks update until ready |
| startupProbe interaction | Disabled until startupProbe succeeds | Disabled until startupProbe succeeds |
Failed readiness: Pod is removed from load balancing but continues running. Failed liveness: Pod is restarted. This is a critical distinction.
When NOT to Use
- Stateless applications without warmup: If the application is ready instantly after startup, Readiness Probe only adds delay
- Background workers without incoming traffic: If the Pod doesn’t serve HTTP requests, Readiness makes no sense (though it can be useful for monitoring)
- Single-replica Deployments with maxUnavailable=1: If the only Pod is removed from traffic, the service becomes completely unavailable
Startup probe – a separate probe for slow applications. While startup probe hasn’t succeeded – liveness and readiness are not started. For Java applications with long startup.
Senior Level
Deep Mechanics: kubelet and EndpointSlice
Readiness Probe is executed by kubelet in a separate goroutine. The kubelet maintains a probe cycle through probeManager, which stores each probe’s state in workerqueue. Results are aggregated in containerStatus and passed through CRI (Container Runtime Interface).
When the kubelet detects a Readiness status change, it updates PodStatus in the API Server. Then the following chain is triggered:
- EndpointSlice Controller (in kube-controller-manager) subscribes to Pod changes
- On
Readycondition change, the controller recalculates the EndpointSlice - kube-proxy (via iptables/IPVS) updates load balancing rules
- Traffic stops (or starts) being directed to the Pod
Latency chain: kubelet detect (~periodSeconds) → API Server write (~10-50ms) → EndpointSlice reconcile (~500ms) → kube-proxy sync (~1s) → traffic switched. Total: 2-5 seconds from probe failure to actual traffic cessation.
Trade-offs
| Aspect | Trade-off |
|---|---|
| Check frequency | Frequent probes = faster reaction, but higher load on application and API Server |
| Dependency checking | Checking DB = honest status, but risk of cascading failure |
| failureThreshold | Low = fast removal, but sensitive to network fluctuations. High = more stable, but slower reaction |
| httpGet vs tcpSocket | httpGet is more accurate but more expensive. tcpSocket is faster but doesn’t check business logic |
Edge Cases (6+)
Edge Case 1: Partial Readiness The application has 10 endpoints. 9 work, 1 depends on a failed DB. A Readiness Probe on one endpoint either “kills” the whole Pod (if it checks the DB) or “lies” (if it doesn’t). Solution: well-designed health endpoint strategy with gradation.
Edge Case 2: Pod Termination During Node Draining
On kubectl drain, the node is cordoned. Pods receive SIGTERM, but Readiness Probe continues running until full termination. If the Pod fails Readiness during graceful shutdown, traffic is switched. But if SIGTERM takes longer than terminationGracePeriodSeconds, the kubelet sends SIGKILL.
Edge Case 3: Pod in Pending with Readiness A Pod in Pending state (resources not allocated) will never start executing Readiness Probe — the kubelet doesn’t start probes until the container transitions to Running.
Edge Case 4: Race Condition During HPA Scaling
HPA scales the Deployment, new Pods are created. If the Readiness Probe has a large initialDelaySeconds, HPA may decide there aren’t enough replicas and create more. This leads to over-provisioning. Solution: adequate timeouts + behavior.stabilizationWindowSeconds in HPA.
Edge Case 5: Readiness Gate
Kubernetes 1.14+ supports Readiness Gates — external conditions that must be True for PodReady. For example, a Service Mesh controller may set a readiness gate only after registering the Pod in the mesh. This adds an external dependency to the standard kubelet check.
Edge Case 6: Headless Service and Readiness
For Headless Services (clusterIP: None), removing a Pod from Endpoints affects DNS resolution. Clients caching DNS may continue sending traffic to the unavailable Pod until TTL expires.
Performance Numbers
| Metric | Value |
|---|---|
| kubelet probe overhead | ~1-5ms CPU per httpGet probe |
| API Server update latency | 10-100ms on PodStatus update |
| EndpointSlice propagation | 500ms-2s to kube-proxy sync |
| Full traffic removal | 2-5 seconds from probe failure to traffic stop |
| kube-proxy iptables sync | ~1s for 1000 Services, ~5s for 5000 Services |
| kube-proxy IPVS sync | ~100ms for 1000 Services |
Security
- Readiness Probe endpoint must not be public — it should only be accessible from within the cluster
- If the probe uses
httpGet, make sure it doesn’t expose internal information (versions, configuration) - Do not use
execprobes with privileged commands — this is a potential escalation vector - In multi-tenant clusters: NetworkPolicy should restrict access to health endpoints only from kubelet
Production War Story
Situation: Large e-commerce cluster, Black Friday. All Pods have a Readiness Probe checking PostgreSQL. During the peak, the DB started slowing down due to write lock contention. The probe timeout exceeded failureThreshold, and all 200 Pods simultaneously disconnected from traffic.
Result: Complete 503 for 90 seconds until the DB stabilized and Pods returned to Endpoints. Loss of ~$500K in revenue.
Post-mortem and fix:
- Readiness Probe changed to check only internal state (thread pool, memory), without DB dependency
- Circuit Breaker added for DB-dependent operations
- Separate
/health/readywithout DB and/health/degradedwith DB — when DB has issues, the application serves data from cache - Prometheus alert configured on
kube_endpoint_address_available < expected
Monitoring (Prometheus/Grafana)
Key metrics:
# Pods not ready for traffic
sum(kube_pod_status_ready{condition="false"}) by (namespace)
# Readiness Probe failure rate
kubelet_pod_worker_duration_seconds{probe_type="readiness"}
# Probe latency
kubelet_pod_worker_duration_seconds_bucket{probe_type="readiness"}
# EndpointSlice without addresses
kube_endpoint_address_available{condition="not_ready"}
Grafana Dashboard:
- Panel 1: Number of Pods in Ready/Not Ready state (by namespace)
- Panel 2: Readiness Probe success rate over time
- Panel 3: Latency between probe failure and EndpointSlice update
- Panel 4: Correlate with DB latency and DB connection pool usage
Highload Best Practices
- Separate Liveness and Readiness endpoints — Liveness checks “is the application alive”, Readiness checks “can it serve traffic”
- Do not check external dependencies directly — use Circuit Breaker and return Ready if the application can function in degraded mode
- Use startupProbe for heavy JVM applications — to avoid setting a huge
initialDelaySecondson Readiness - Set
failureThreshold: 3andperiodSeconds: 5— balance between reaction speed and noise resistance - Monitor EndpointSlice sizes — with thousands of Pods, kube-proxy iptables rules become a bottleneck, switch to IPVS mode
- Graceful shutdown + preStop hook — give the application time to complete active requests before removing from Endpoints:
lifecycle: preStop: exec: command: ["sh", "-c", "sleep 10"]
Interview Cheat Sheet
Must know:
- Readiness Probe checks “is the Pod ready for traffic”; on failure — removal from Service Endpoints
- Failed readiness: Pod removed from load balancing but NOT restarted (difference from liveness)
- Updates EndpointSlice → kube-proxy updates iptables/IPVS → traffic switched (2-5 sec)
- Do not check external dependencies directly — risk of cascading failure
- Readiness Gates (K8s 1.14+) — external conditions for readiness (Service Mesh)
- For Rolling Update: K8s doesn’t remove old Pods until new ones pass Readiness
- Spring Boot:
/actuator/health/readiness— standard endpoint
Common follow-up questions:
- “What happens if readiness fails?” — Pod removed from Endpoints, but continues running
- “Why not check DB during cascading failure?” — All Pods disconnect simultaneously → complete 503
- “Readiness for a stateless application?” — May not be needed if ready instantly
- “Headless Service + Readiness?” — Affects DNS resolution; cached DNS may be stale
Red flags (DO NOT say):
- “Readiness restarts the Pod on failure” (no, only removes from load balancing)
- “Readiness = Liveness” (different actions on failure)
- “I check DB in readiness for all services” (risk of cascading failure)
- “Readiness is not needed — Pod Running means ready” (Running != ready)
Related topics:
- [[What is liveness probe]] — liveness check
- [[Why are health checks needed]] — all three probes together
- [[How to organize rolling update in Kubernetes]] — readiness blocks the update