Health Checks Done Right: Liveness vs Readiness vs Deep Checks

Learn the difference between liveness, readiness, and deep health checks — and how to implement each one correctly so your monitoring actually catches real problems.

Why Most `/health` Endpoints Lie

A health endpoint that always returns 200 OK is worse than no health endpoint at all. It gives you false confidence while real failures hide behind it. The fix isn't a smarter alerting rule — it's designing your health checks with intent from the start.

There are three distinct check types worth understanding: liveness, readiness, and deep checks. Each answers a different question and belongs in a different part of your stack.

Liveness Checks

Question answered: Is this process alive and not deadlocked?

A liveness check is deliberately shallow. It should return 200 if the process is running and able to handle the HTTP request. Nothing more.

GET /healthz/live → 200 OK

Kubernetes uses liveness probes to decide whether to restart a container. If your liveness check queries the database, a database outage will cause cascading container restarts across your entire fleet — a self-inflicted outage on top of an already bad situation.

Keep liveness checks to:

Responding within a tight timeout (50–100 ms)
Checking only in-process state (e.g., a flag set during graceful shutdown)
Never touching external dependencies

Readiness Checks

Question answered: Is this instance ready to serve production traffic?

Readiness sits one level deeper. A pod or instance can be alive but not yet ready — it might still be warming up a cache, waiting for a database migration to finish, or re-establishing a connection pool after a restart.

GET /healthz/ready → 200 OK | 503 Service Unavailable

Kubernetes removes unready pods from Service endpoints, so traffic stops hitting them without killing them. The same pattern applies outside Kubernetes: your load balancer's health check should point at your readiness endpoint, not your liveness endpoint.

Readiness checks typically verify:

Database connection pool has at least one healthy connection
Required caches are populated (if startup depends on them)
Any dependent services needed at request time are reachable
The instance is not in a draining/shutdown state

Keep response time under 200–300 ms. If a dependency check takes longer, time it out and return 503 rather than blocking indefinitely.

Deep Checks

Question answered: Is the system actually working end-to-end?

Deep checks (sometimes called synthetic checks or diagnostic endpoints) go further: they exercise real code paths and real dependencies. Think of them as a lightweight smoke test running on a schedule.

GET /healthz/deep → 200 OK with JSON body

A useful response body:

{
  "status": "ok",
  "checks": {
    "postgres": {"status": "ok", "latency_ms": 4},
    "redis": {"status": "ok", "latency_ms": 1},
    "s3": {"status": "degraded", "latency_ms": 812}
  }
}

Return 200 if critical dependencies pass, even if non-critical ones are degraded. Use 503 only when the service genuinely cannot function.

What to include in a deep check

Database round-trip — run a cheap query, not just a ping (SELECT 1)
Cache connectivity — SET and GET a test key with a known TTL
External API reachability — HEAD request or equivalent, with a short timeout
Queue depth — flag if a queue is backed up beyond an acceptable threshold
Disk / memory headroom — optional, but useful on stateful nodes

Do not expose this endpoint publicly without authentication. It leaks your infrastructure topology.

Wiring It Into External Monitoring

Internal probes (Kubernetes, load balancers) catch local failures. They don't catch DNS misconfiguration, BGP route leaks, CDN issues, or region-wide cloud problems that make your service unreachable from the outside world.

That's where external uptime monitoring earns its keep. Point a monitor at your readiness or deep check endpoint from multiple geographic regions. If requests from Frankfurt succeed but Tokyo times out, you have a routing problem, not an application problem — and you know it within a minute rather than when a customer tweets at you.

A few practical notes:

Monitor your deep check endpoint, not just your homepage
Set your check interval to match your SLA — 1-minute checks are appropriate for most production services
Alert on consecutive failures, not single failures, to reduce noise from transient network hiccups

Key Takeaways

Liveness = is the process alive? Keep it trivial, no external calls.
Readiness = is this instance safe to receive traffic? Check dependencies, but stay fast.
Deep checks = is the system working end-to-end? Useful for dashboards, debugging, and external monitors.
Never point a load balancer or Kubernetes liveness probe at the same endpoint as your deep check.
Protect diagnostic endpoints — they expose internal architecture.
External, multi-region monitoring catches failures that internal probes structurally cannot.

Health Checks Done Right: Liveness vs Readiness vs Deep Checks

Why Most `/health` Endpoints Lie

Liveness Checks

Readiness Checks

Deep Checks

What to include in a deep check

Wiring It Into External Monitoring

Key Takeaways

More in Uptime

Setting Realistic SLOs, SLAs, and Error Budgets

How to Design for Five-Nines (99.999%) Uptime

Health Checks Done Right: Liveness vs Readiness vs Deep Checks

Why Most /health Endpoints Lie

Liveness Checks

Readiness Checks

Deep Checks

What to include in a deep check

Wiring It Into External Monitoring

Key Takeaways

More in Uptime

Setting Realistic SLOs, SLAs, and Error Budgets

How to Design for Five-Nines (99.999%) Uptime

Why Most `/health` Endpoints Lie