Question 18 · Section 14

What is Readiness Probe?

Readiness probe -- K8s checks whether the Pod is ready to accept traffic. If probe fails -- the Pod is removed from Service endpoints, but is NOT restarted.

Language versions: English Russian Ukrainian

Junior Level

Simple Definition

Readiness Probe is a check in Kubernetes that determines whether a Pod is ready to accept incoming traffic. If the probe fails — Kubernetes removes the Pod from the load balancer, but does not restart it.

Readiness probe – K8s checks whether the Pod is ready to accept traffic. If probe fails – the Pod is removed from Service endpoints, but is NOT restarted.

Analogy

Imagine a store. The doors are open (Pod is running), but the cashier hasn’t been trained yet, registers aren’t connected. The manager (Readiness Probe) says: “Don’t accept customers yet.” Once everything is ready — the doors open for shoppers.

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: app
      image: my-app:1.0
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3

kubectl Example

# Check Pod readiness status
kubectl get pods
# STATUS: Running (1/1 READY) — ready, Running (0/1 READY) — not ready

kubectl describe pod my-app | grep -A 10 Readiness

When to Use

  • Application needs warmup time (cache loading, DB initialization)
  • During deployment, so traffic only goes to ready Pods
  • If the application can temporarily “opt out” under overload

Middle Level

How it Works

Readiness Probe is executed by the kubelet on each node. The kubelet polls the specified endpoint at periodSeconds intervals. On a successful response, the Pod gets Ready: True status and its IP is added to the Endpoints (or EndpointSlice) of all services that select this Pod. On failure — the IP is removed from Endpoints.

Types of checks:

  • httpGet — HTTP GET request, expects 2xx/3xx
  • tcpSocket — checks that the TCP port is open
  • exec — executes a command inside the container
  • grpc — gRPC health check (since Kubernetes 1.24+)

Practical Scenarios

Scenario 1: Spring Boot with Actuator

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

Scenario 2: Temporary opt-out under overload If the task queue exceeds the limit, the application can return 503 on /health/ready to temporarily remove itself from load balancing and reduce load.

Scenario 3: Rolling Update Kubernetes will not start removing old Pods until new ones pass the Readiness Probe. This ensures Zero Downtime.

Common Mistakes Table

Mistake Consequence Solution
Checking DB in Readiness during cascading failure All Pods disconnect simultaneously, 503 for everyone Return Ready if the application can work partially (cache, fallback)
Too frequent periodSeconds (1 sec) Unnecessary load on the application Set 3-10 seconds
Missing readinessProbe Traffic goes to not-yet-ready Pods, 502/503 errors Always add for stateful applications
Checking only TCP port Application listens on port but doesn’t process requests Use httpGet with a real health endpoint

Comparison: Liveness vs Readiness

Characteristic Liveness Probe Readiness Probe
Action on failure Container restart Removal from Service Endpoints
Purpose Detect hangs, deadlocks Ensure traffic goes only to ready Pods
Dependency checking Not recommended Recommended (DB, external APIs)
Impact on Rolling Update No effect Blocks update until ready
startupProbe interaction Disabled until startupProbe succeeds Disabled until startupProbe succeeds

Failed readiness: Pod is removed from load balancing but continues running. Failed liveness: Pod is restarted. This is a critical distinction.

When NOT to Use

  • Stateless applications without warmup: If the application is ready instantly after startup, Readiness Probe only adds delay
  • Background workers without incoming traffic: If the Pod doesn’t serve HTTP requests, Readiness makes no sense (though it can be useful for monitoring)
  • Single-replica Deployments with maxUnavailable=1: If the only Pod is removed from traffic, the service becomes completely unavailable

Startup probe – a separate probe for slow applications. While startup probe hasn’t succeeded – liveness and readiness are not started. For Java applications with long startup.


Senior Level

Deep Mechanics: kubelet and EndpointSlice

Readiness Probe is executed by kubelet in a separate goroutine. The kubelet maintains a probe cycle through probeManager, which stores each probe’s state in workerqueue. Results are aggregated in containerStatus and passed through CRI (Container Runtime Interface).

When the kubelet detects a Readiness status change, it updates PodStatus in the API Server. Then the following chain is triggered:

  1. EndpointSlice Controller (in kube-controller-manager) subscribes to Pod changes
  2. On Ready condition change, the controller recalculates the EndpointSlice
  3. kube-proxy (via iptables/IPVS) updates load balancing rules
  4. Traffic stops (or starts) being directed to the Pod

Latency chain: kubelet detect (~periodSeconds) → API Server write (~10-50ms) → EndpointSlice reconcile (~500ms) → kube-proxy sync (~1s) → traffic switched. Total: 2-5 seconds from probe failure to actual traffic cessation.

Trade-offs

Aspect Trade-off
Check frequency Frequent probes = faster reaction, but higher load on application and API Server
Dependency checking Checking DB = honest status, but risk of cascading failure
failureThreshold Low = fast removal, but sensitive to network fluctuations. High = more stable, but slower reaction
httpGet vs tcpSocket httpGet is more accurate but more expensive. tcpSocket is faster but doesn’t check business logic

Edge Cases (6+)

Edge Case 1: Partial Readiness The application has 10 endpoints. 9 work, 1 depends on a failed DB. A Readiness Probe on one endpoint either “kills” the whole Pod (if it checks the DB) or “lies” (if it doesn’t). Solution: well-designed health endpoint strategy with gradation.

Edge Case 2: Pod Termination During Node Draining On kubectl drain, the node is cordoned. Pods receive SIGTERM, but Readiness Probe continues running until full termination. If the Pod fails Readiness during graceful shutdown, traffic is switched. But if SIGTERM takes longer than terminationGracePeriodSeconds, the kubelet sends SIGKILL.

Edge Case 3: Pod in Pending with Readiness A Pod in Pending state (resources not allocated) will never start executing Readiness Probe — the kubelet doesn’t start probes until the container transitions to Running.

Edge Case 4: Race Condition During HPA Scaling HPA scales the Deployment, new Pods are created. If the Readiness Probe has a large initialDelaySeconds, HPA may decide there aren’t enough replicas and create more. This leads to over-provisioning. Solution: adequate timeouts + behavior.stabilizationWindowSeconds in HPA.

Edge Case 5: Readiness Gate Kubernetes 1.14+ supports Readiness Gates — external conditions that must be True for PodReady. For example, a Service Mesh controller may set a readiness gate only after registering the Pod in the mesh. This adds an external dependency to the standard kubelet check.

Edge Case 6: Headless Service and Readiness For Headless Services (clusterIP: None), removing a Pod from Endpoints affects DNS resolution. Clients caching DNS may continue sending traffic to the unavailable Pod until TTL expires.

Performance Numbers

Metric Value
kubelet probe overhead ~1-5ms CPU per httpGet probe
API Server update latency 10-100ms on PodStatus update
EndpointSlice propagation 500ms-2s to kube-proxy sync
Full traffic removal 2-5 seconds from probe failure to traffic stop
kube-proxy iptables sync ~1s for 1000 Services, ~5s for 5000 Services
kube-proxy IPVS sync ~100ms for 1000 Services

Security

  • Readiness Probe endpoint must not be public — it should only be accessible from within the cluster
  • If the probe uses httpGet, make sure it doesn’t expose internal information (versions, configuration)
  • Do not use exec probes with privileged commands — this is a potential escalation vector
  • In multi-tenant clusters: NetworkPolicy should restrict access to health endpoints only from kubelet

Production War Story

Situation: Large e-commerce cluster, Black Friday. All Pods have a Readiness Probe checking PostgreSQL. During the peak, the DB started slowing down due to write lock contention. The probe timeout exceeded failureThreshold, and all 200 Pods simultaneously disconnected from traffic.

Result: Complete 503 for 90 seconds until the DB stabilized and Pods returned to Endpoints. Loss of ~$500K in revenue.

Post-mortem and fix:

  1. Readiness Probe changed to check only internal state (thread pool, memory), without DB dependency
  2. Circuit Breaker added for DB-dependent operations
  3. Separate /health/ready without DB and /health/degraded with DB — when DB has issues, the application serves data from cache
  4. Prometheus alert configured on kube_endpoint_address_available < expected

Monitoring (Prometheus/Grafana)

Key metrics:

# Pods not ready for traffic
sum(kube_pod_status_ready{condition="false"}) by (namespace)

# Readiness Probe failure rate
kubelet_pod_worker_duration_seconds{probe_type="readiness"}

# Probe latency
kubelet_pod_worker_duration_seconds_bucket{probe_type="readiness"}

# EndpointSlice without addresses
kube_endpoint_address_available{condition="not_ready"}

Grafana Dashboard:

  • Panel 1: Number of Pods in Ready/Not Ready state (by namespace)
  • Panel 2: Readiness Probe success rate over time
  • Panel 3: Latency between probe failure and EndpointSlice update
  • Panel 4: Correlate with DB latency and DB connection pool usage

Highload Best Practices

  1. Separate Liveness and Readiness endpoints — Liveness checks “is the application alive”, Readiness checks “can it serve traffic”
  2. Do not check external dependencies directly — use Circuit Breaker and return Ready if the application can function in degraded mode
  3. Use startupProbe for heavy JVM applications — to avoid setting a huge initialDelaySeconds on Readiness
  4. Set failureThreshold: 3 and periodSeconds: 5 — balance between reaction speed and noise resistance
  5. Monitor EndpointSlice sizes — with thousands of Pods, kube-proxy iptables rules become a bottleneck, switch to IPVS mode
  6. Graceful shutdown + preStop hook — give the application time to complete active requests before removing from Endpoints:
    lifecycle:
      preStop:
        exec:
          command: ["sh", "-c", "sleep 10"]
    

Interview Cheat Sheet

Must know:

  • Readiness Probe checks “is the Pod ready for traffic”; on failure — removal from Service Endpoints
  • Failed readiness: Pod removed from load balancing but NOT restarted (difference from liveness)
  • Updates EndpointSlice → kube-proxy updates iptables/IPVS → traffic switched (2-5 sec)
  • Do not check external dependencies directly — risk of cascading failure
  • Readiness Gates (K8s 1.14+) — external conditions for readiness (Service Mesh)
  • For Rolling Update: K8s doesn’t remove old Pods until new ones pass Readiness
  • Spring Boot: /actuator/health/readiness — standard endpoint

Common follow-up questions:

  • “What happens if readiness fails?” — Pod removed from Endpoints, but continues running
  • “Why not check DB during cascading failure?” — All Pods disconnect simultaneously → complete 503
  • “Readiness for a stateless application?” — May not be needed if ready instantly
  • “Headless Service + Readiness?” — Affects DNS resolution; cached DNS may be stale

Red flags (DO NOT say):

  • “Readiness restarts the Pod on failure” (no, only removes from load balancing)
  • “Readiness = Liveness” (different actions on failure)
  • “I check DB in readiness for all services” (risk of cascading failure)
  • “Readiness is not needed — Pod Running means ready” (Running != ready)

Related topics:

  • [[What is liveness probe]] — liveness check
  • [[Why are health checks needed]] — all three probes together
  • [[How to organize rolling update in Kubernetes]] — readiness blocks the update