What is Readiness Probe?

Junior Level

Simple Definition

Readiness Probe is a check in Kubernetes that determines whether a Pod is ready to accept incoming traffic. If the probe fails — Kubernetes removes the Pod from the load balancer, but does not restart it.

Readiness probe – K8s checks whether the Pod is ready to accept traffic. If probe fails – the Pod is removed from Service endpoints, but is NOT restarted.

Analogy

Imagine a store. The doors are open (Pod is running), but the cashier hasn’t been trained yet, registers aren’t connected. The manager (Readiness Probe) says: “Don’t accept customers yet.” Once everything is ready — the doors open for shoppers.

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: app
      image: my-app:1.0
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3

kubectl Example

# Check Pod readiness status
kubectl get pods
# STATUS: Running (1/1 READY) — ready, Running (0/1 READY) — not ready

kubectl describe pod my-app | grep -A 10 Readiness

When to Use

Application needs warmup time (cache loading, DB initialization)
During deployment, so traffic only goes to ready Pods
If the application can temporarily “opt out” under overload

Middle Level

How it Works

Readiness Probe is executed by the kubelet on each node. The kubelet polls the specified endpoint at periodSeconds intervals. On a successful response, the Pod gets Ready: True status and its IP is added to the Endpoints (or EndpointSlice) of all services that select this Pod. On failure — the IP is removed from Endpoints.

Types of checks:

httpGet — HTTP GET request, expects 2xx/3xx
tcpSocket — checks that the TCP port is open
exec — executes a command inside the container
grpc — gRPC health check (since Kubernetes 1.24+)

Practical Scenarios

Scenario 1: Spring Boot with Actuator

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

Scenario 2: Temporary opt-out under overload If the task queue exceeds the limit, the application can return 503 on /health/ready to temporarily remove itself from load balancing and reduce load.

Scenario 3: Rolling Update Kubernetes will not start removing old Pods until new ones pass the Readiness Probe. This ensures Zero Downtime.

Common Mistakes Table

Mistake	Consequence	Solution
Checking DB in Readiness during cascading failure	All Pods disconnect simultaneously, 503 for everyone	Return Ready if the application can work partially (cache, fallback)
Too frequent periodSeconds (1 sec)	Unnecessary load on the application	Set 3-10 seconds
Missing readinessProbe	Traffic goes to not-yet-ready Pods, 502/503 errors	Always add for stateful applications
Checking only TCP port	Application listens on port but doesn’t process requests	Use httpGet with a real health endpoint

Comparison: Liveness vs Readiness

Characteristic	Liveness Probe	Readiness Probe
Action on failure	Container restart	Removal from Service Endpoints
Purpose	Detect hangs, deadlocks	Ensure traffic goes only to ready Pods
Dependency checking	Not recommended	Recommended (DB, external APIs)
Impact on Rolling Update	No effect	Blocks update until ready
startupProbe interaction	Disabled until startupProbe succeeds	Disabled until startupProbe succeeds

Failed readiness: Pod is removed from load balancing but continues running. Failed liveness: Pod is restarted. This is a critical distinction.

When NOT to Use

Stateless applications without warmup: If the application is ready instantly after startup, Readiness Probe only adds delay
Background workers without incoming traffic: If the Pod doesn’t serve HTTP requests, Readiness makes no sense (though it can be useful for monitoring)
Single-replica Deployments with maxUnavailable=1: If the only Pod is removed from traffic, the service becomes completely unavailable

Startup probe – a separate probe for slow applications. While startup probe hasn’t succeeded – liveness and readiness are not started. For Java applications with long startup.

Senior Level

Deep Mechanics: kubelet and EndpointSlice

Readiness Probe is executed by kubelet in a separate goroutine. The kubelet maintains a probe cycle through probeManager, which stores each probe’s state in workerqueue. Results are aggregated in containerStatus and passed through CRI (Container Runtime Interface).

When the kubelet detects a Readiness status change, it updates PodStatus in the API Server. Then the following chain is triggered:

EndpointSlice Controller (in kube-controller-manager) subscribes to Pod changes
On Ready condition change, the controller recalculates the EndpointSlice
kube-proxy (via iptables/IPVS) updates load balancing rules
Traffic stops (or starts) being directed to the Pod

Latency chain: kubelet detect (~periodSeconds) → API Server write (~10-50ms) → EndpointSlice reconcile (~500ms) → kube-proxy sync (~1s) → traffic switched. Total: 2-5 seconds from probe failure to actual traffic cessation.

Trade-offs

Aspect	Trade-off
Check frequency	Frequent probes = faster reaction, but higher load on application and API Server
Dependency checking	Checking DB = honest status, but risk of cascading failure
failureThreshold	Low = fast removal, but sensitive to network fluctuations. High = more stable, but slower reaction
httpGet vs tcpSocket	httpGet is more accurate but more expensive. tcpSocket is faster but doesn’t check business logic

Edge Cases (6+)

Edge Case 1: Partial Readiness The application has 10 endpoints. 9 work, 1 depends on a failed DB. A Readiness Probe on one endpoint either “kills” the whole Pod (if it checks the DB) or “lies” (if it doesn’t). Solution: well-designed health endpoint strategy with gradation.

Edge Case 2: Pod Termination During Node Draining On kubectl drain, the node is cordoned. Pods receive SIGTERM, but Readiness Probe continues running until full termination. If the Pod fails Readiness during graceful shutdown, traffic is switched. But if SIGTERM takes longer than terminationGracePeriodSeconds, the kubelet sends SIGKILL.

Edge Case 3: Pod in Pending with Readiness A Pod in Pending state (resources not allocated) will never start executing Readiness Probe — the kubelet doesn’t start probes until the container transitions to Running.

Edge Case 4: Race Condition During HPA Scaling HPA scales the Deployment, new Pods are created. If the Readiness Probe has a large initialDelaySeconds, HPA may decide there aren’t enough replicas and create more. This leads to over-provisioning. Solution: adequate timeouts + behavior.stabilizationWindowSeconds in HPA.

Edge Case 5: Readiness Gate Kubernetes 1.14+ supports Readiness Gates — external conditions that must be True for PodReady. For example, a Service Mesh controller may set a readiness gate only after registering the Pod in the mesh. This adds an external dependency to the standard kubelet check.

Edge Case 6: Headless Service and Readiness For Headless Services (clusterIP: None), removing a Pod from Endpoints affects DNS resolution. Clients caching DNS may continue sending traffic to the unavailable Pod until TTL expires.

Performance Numbers

Metric	Value
kubelet probe overhead	~1-5ms CPU per httpGet probe
API Server update latency	10-100ms on PodStatus update
EndpointSlice propagation	500ms-2s to kube-proxy sync
Full traffic removal	2-5 seconds from probe failure to traffic stop
kube-proxy iptables sync	~1s for 1000 Services, ~5s for 5000 Services
kube-proxy IPVS sync	~100ms for 1000 Services

Security

Readiness Probe endpoint must not be public — it should only be accessible from within the cluster
If the probe uses httpGet, make sure it doesn’t expose internal information (versions, configuration)
Do not use exec probes with privileged commands — this is a potential escalation vector
In multi-tenant clusters: NetworkPolicy should restrict access to health endpoints only from kubelet

Production War Story

Situation: Large e-commerce cluster, Black Friday. All Pods have a Readiness Probe checking PostgreSQL. During the peak, the DB started slowing down due to write lock contention. The probe timeout exceeded failureThreshold, and all 200 Pods simultaneously disconnected from traffic.

Result: Complete 503 for 90 seconds until the DB stabilized and Pods returned to Endpoints. Loss of ~$500K in revenue.

Post-mortem and fix:

Readiness Probe changed to check only internal state (thread pool, memory), without DB dependency
Circuit Breaker added for DB-dependent operations
Separate /health/ready without DB and /health/degraded with DB — when DB has issues, the application serves data from cache
Prometheus alert configured on kube_endpoint_address_available < expected

Monitoring (Prometheus/Grafana)

Key metrics:

# Pods not ready for traffic
sum(kube_pod_status_ready{condition="false"}) by (namespace)

# Readiness Probe failure rate
kubelet_pod_worker_duration_seconds{probe_type="readiness"}

# Probe latency
kubelet_pod_worker_duration_seconds_bucket{probe_type="readiness"}

# EndpointSlice without addresses
kube_endpoint_address_available{condition="not_ready"}

Grafana Dashboard:

Panel 1: Number of Pods in Ready/Not Ready state (by namespace)
Panel 2: Readiness Probe success rate over time
Panel 3: Latency between probe failure and EndpointSlice update
Panel 4: Correlate with DB latency and DB connection pool usage

Highload Best Practices

Separate Liveness and Readiness endpoints — Liveness checks “is the application alive”, Readiness checks “can it serve traffic”
Do not check external dependencies directly — use Circuit Breaker and return Ready if the application can function in degraded mode
Use startupProbe for heavy JVM applications — to avoid setting a huge initialDelaySeconds on Readiness
Set failureThreshold: 3 and periodSeconds: 5 — balance between reaction speed and noise resistance
Monitor EndpointSlice sizes — with thousands of Pods, kube-proxy iptables rules become a bottleneck, switch to IPVS mode
Graceful shutdown + preStop hook — give the application time to complete active requests before removing from Endpoints:
```
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 10"]
```

Interview Cheat Sheet

Must know:

Readiness Probe checks “is the Pod ready for traffic”; on failure — removal from Service Endpoints
Failed readiness: Pod removed from load balancing but NOT restarted (difference from liveness)
Updates EndpointSlice → kube-proxy updates iptables/IPVS → traffic switched (2-5 sec)
Do not check external dependencies directly — risk of cascading failure
Readiness Gates (K8s 1.14+) — external conditions for readiness (Service Mesh)
For Rolling Update: K8s doesn’t remove old Pods until new ones pass Readiness
Spring Boot: /actuator/health/readiness — standard endpoint

Common follow-up questions:

“What happens if readiness fails?” — Pod removed from Endpoints, but continues running
“Why not check DB during cascading failure?” — All Pods disconnect simultaneously → complete 503
“Readiness for a stateless application?” — May not be needed if ready instantly
“Headless Service + Readiness?” — Affects DNS resolution; cached DNS may be stale

Red flags (DO NOT say):

“Readiness restarts the Pod on failure” (no, only removes from load balancing)
“Readiness = Liveness” (different actions on failure)
“I check DB in readiness for all services” (risk of cascading failure)
“Readiness is not needed — Pod Running means ready” (Running != ready)

Related topics:

[[What is liveness probe]] — liveness check
[[Why are health checks needed]] — all three probes together
[[How to organize rolling update in Kubernetes]] — readiness blocks the update