High Availability (HA)

High Availability (HA) refers to the quality of a system or component that ensures a high level of operational performance, usually uptime, for a higher than normal period. It aims to minimize or eliminate downtime and ensure that the system is accessible and responsive to users or other systems, even in the presence of component failures.

HA is typically achieved through redundancy, automatic failover mechanisms, and robust monitoring, ensuring that if one part of the system fails, another can take over its function with minimal interruption.

The Goal: Minimizing Downtime

In many applications, especially real-time systems like stream processors or critical databases, downtime can lead to significant business impact, data loss, or violation of service level agreements (SLAs). High Availability aims to prevent this by designing systems that can automatically handle common failures (node crashes, network glitches) without requiring manual intervention and with very short recovery times.

Key metrics associated with HA include:

Uptime: The percentage of time the system is operational (e.g., 99.9%, 99.99%, 99.999% - "three nines", "four nines", "five nines").
Recovery Time Objective (RTO): The maximum acceptable time for the system to recover and become operational again after a failure event. HA systems strive for very low RTOs.

Techniques for Achieving High Availability

HA strategies often overlap with Fault Tolerance techniques but focus specifically on minimizing service interruption:

Redundancy: Deploying multiple instances of critical components (servers, databases, network paths, power supplies). If one fails, others are available.
Automatic Failover: Implementing mechanisms that automatically detect the failure of an active component and switch operations over to a standby or redundant component without manual intervention. This often involves:
- Heartbeating: Components periodically send "I'm alive" signals.
- Failure Detection: Monitoring systems detect missed heartbeats or other failure symptoms.
- Leader Election: Automatically selecting a new active component (leader) from the available replicas if the current leader fails.
Load Balancing: Distributing incoming requests or workload across multiple redundant components. This improves performance and also means that if one component fails, the load balancer can redirect traffic to the remaining healthy instances.
Replication: Keeping data synchronized across redundant components (e.g., database replication, state replication) so that a standby component has the necessary information to take over.
Stateless Services: Designing components to be stateless where possible makes failover easier, as there's no critical state to transfer or recover on the specific failed instance. State is managed elsewhere (e.g., in a distributed state store or database).

High Availability vs. Fault Tolerance

Fault Tolerance focuses on correctness and data integrity despite failures (the system recovers to a consistent state, possibly after some downtime).
High Availability focuses on continuous operation and minimizing downtime during failures (the system remains accessible).

HA relies heavily on underlying fault tolerance mechanisms (like checkpointing and state replication) to ensure that the failover process results in a consistent and correct system state.

High Availability in RisingWave

RisingWave is designed with High Availability in mind for its various components:

Compute Nodes: Compute nodes execute the streaming dataflows. They are designed to be relatively stateless regarding the core processing logic (relying on Hummock for durable state).
- Failover: If a compute node fails, the Meta Node detects this (via heartbeating) and reschedules the failed node's tasks (fragments of the dataflow graph) onto other available compute nodes. These rescheduled fragments load their necessary state from the last successful checkpoint in Hummock, allowing processing to resume with minimal data loss (determined by checkpoint frequency) and relatively low recovery time (RTO primarily dependent on state loading).
Meta Node: The Meta Node is the central coordinator and metadata manager. It can be a single point of failure if not configured for HA.
- HA Configuration: RisingWave supports Meta Node HA by running multiple Meta Node instances backed by a distributed coordination service like etcd or using an embedded Raft implementation. If the active Meta Node leader fails, the coordination service facilitates the election of a new leader from the standby instances, ensuring the control plane remains available.
Frontend Nodes: These nodes handle client connections (e.g., PSQL clients) and query parsing/planning. Multiple Frontend nodes can be run behind a load balancer for HA and scalability. They are largely stateless regarding query execution itself (which happens on Compute nodes).
Hummock State Store: While Hummock relies on external Cloud Object Storage (like S3) for durability, the Hummock components within RisingWave (e.g., the Hummock Manager within the Meta Node) are made HA via the Meta Node HA mechanism. The underlying object storage itself provides extremely high durability and availability.
Compactor Nodes: These nodes perform background state compaction tasks for Hummock. Multiple Compactor nodes can be deployed; if one fails, others continue compacting, ensuring the health of the state store.

By providing HA for its core components and leveraging fault-tolerant mechanisms like checkpointing and durable state storage, RisingWave aims to deliver a reliable streaming database service that minimizes downtime and ensures continuous processing.

High Availability (HA)

The Goal: Minimizing Downtime

Techniques for Achieving High Availability

High Availability vs. Fault Tolerance

High Availability in RisingWave

Related Blog Posts

Related Glossary Terms