Heartbeat & Health Checks

In a distributed system with hundreds of servers, machines fail constantly. Hard drives crash, network cables get disconnected, and processes run out of memory. The system must quickly detect these failures and route traffic away from dead nodes.

1. Heartbeat Mechanism

A Heartbeat is a periodic signal sent between nodes to indicate that they are still alive and functioning.

How it works:

Each server in the cluster periodically sends a small "I'm alive" message (heartbeat) to a central monitoring service or to its peers.
If a node fails to send a heartbeat within a configured timeout (e.g., 3 consecutive missed heartbeats over 30 seconds), it is declared dead and removed from the active pool.

Types:

Push-based: Each node proactively sends heartbeats to the monitor. Simple but requires the monitor to track all nodes.
Pull-based: The monitor periodically polls each node. Simpler for the nodes but puts load on the monitor.
Gossip-based: Each node randomly contacts a few peers and exchanges health information. No central monitor needed. Extremely scalable. Used by Cassandra and Consul.

2. Health Checks

A Health Check is a more sophisticated mechanism, typically used by load balancers and container orchestrators (Kubernetes).

Types of Health Checks:

Liveness Check

Answers the question: "Is the process running?"

A simple TCP connection check or an HTTP GET to /healthz that returns 200 OK.
If the liveness check fails, Kubernetes will restart the container.

Readiness Check

Answers the question: "Is the process ready to accept traffic?"

A service might be running but still initializing (loading a large ML model, warming up caches, establishing database connections).
If the readiness check fails, the load balancer stops sending traffic to that instance but does not restart it.

Startup Check

Answers the question: "Has the process finished its initial startup?"

For slow-starting applications, this prevents the liveness check from killing the container during a legitimate long startup sequence.

3. Failure Detection Challenges

Network Partition vs Node Failure: If a monitoring node cannot reach Server B, is Server B dead, or is the network between them broken? Acting too aggressively (declaring nodes dead on a single missed heartbeat) can lead to unnecessary failovers. Acting too slowly means traffic continues to be sent to dead nodes.
The Phi Accrual Failure Detector: Instead of a binary "alive/dead" decision, this sophisticated algorithm (used by Akka and Cassandra) calculates a continuous suspicion level based on the historical arrival times of heartbeats, allowing for adaptive and accurate failure detection.