LavinMQ 2.0: High Availability through clustering

With the release of LavinMQ 2.0, we’re excited to introduce High Availability (HA). LavinMQ can be fully clustered with multiple other LavinMQ nodes. This feature minimizes downtime by distributing data across multiple nodes, ensuring data consistency and availability, allowing others to take over if one node fails immediately.

One node is always the leader, and the others stream all changes in real-time. When a leader node is elected, it tracks changes by recording actions (read, append, delete)—such as published messages, acknowledge messages and metadata changes—in a log. This log represents the shared state as a sequence of actions, which the leader replicates to the follower nodes. LavinMQ ensures that these actions are consistently processed across the cluster.

LavinMQ message broker

etcd is the distributed key-value store manages leader election and cluster management in LavinMQ. It determines which node will act as the leader and which will follow. etcd acts similar to how Zookeeper functions in Kafka. LavinMQ handles data replication independently using a custom protocol between the nodes.

LavinMQ: Responsible for replication of data.

etcd: Responsible for leader election and maintaining a list of synchronized nodes in the cluster.

In this blog, I’ll explore how HA operates in LavinMQ, covering essential concepts like etcd, Raft, leader election, and related terms.

etcd in LavinMQ

When setting up a LavinMQ multi-node cluster, the process starts with configuring an etcd cluster. Each member of the etcd cluster is then defined on the respective etcd nodes. You can either have one etcd that manages all nodes or run an etcd instance on each node, as seen in the figure below.

Once the etcd cluster is operational, LavinMQ registers itself with etcd during initialization.

LavinMQ message broker

LavinMQ interacts with etcd through the etcd HTTP API, which requires version 3.4 or higher. When a LavinMQ node connects, it asks etcd about its role in the cluster - whether it should be a leader or a follower. This process makes sure that all nodes in the cluster stay in sync.

LavinMQ message broker

Why etcd?

We explored several other options before deciding on etcd for LavinMQ, including Zookeeper and Consul. We chose etcd because it is resource-efficient, matching our focus on efficiency. etcd is also easy to set up and has a user-friendly API for leader election. Plus, its use by Kubernetes demonstrates its reliability and robustness.

Equally important, etcd is licensed under the same open-source license as LavinMQ, Apache License 2.0.

Core concepts

1. Cluster: A cluster is a group of connected nodes that work together to ensure the system stays available and reliable, even if some nodes fail.

2. Nodes: A node is a server.

3. Leader: One of the nodes is elected as the leader. The leader is responsible for managing the replication of logs across the cluster.

4. Follower: All other nodes in the cluster are followers. They respond to requests from the leader and forward client requests to the leader.

5. Log: A log records all changes and operations applied to the system, ensuring every node has the same history.

6. Log Replication: All logs are replicated to followers.

Leader election and the RAFT consensus algorithm

etcd, which implements the Raft protocol, ensures that all nodes agree on the roles of leader and followers and manages leader elections as needed. When a LavinMQ node connects, it asks for its role within the cluster setup. etcd stores the leader-follower configuration for LavinMQ, but the etcd leader does not need to be the same as the LavinMQ leader. The LavinMQ leader is responsible for replicating data within the cluster.

LavinMQ message broker

Core concepts

1. Candidate: When a leader fails, nodes transition to the candidate state and initiate an election to select a new leader.

2. Term: Raft operates in terms where each term begins with an election and ends with a new leader being elected or re-elected.

3. Heartbeat: The etcd leader regularly sends heartbeat signals to other nodes to show it is active. If followers miss these heartbeats, they assume the leader may have failed and start an election process.

4. Keep alive: etcd keep-alive signals are periodic messages sent by clients (like LavinMQ) to maintain the validity of their leases and prevent them from expiring.

5. Leases: In etcd, a lease is a time-limited identifier used to manage the lifecycle of keys and resources. It expires if the client fails to renew it with keep-alive signals.

Two fundamental mechanisms in etcd are heartbeats and leases. etcd uses heartbeats to monitor the health of nodes and maintain cluster stability by detecting failures. LavinMQ leader clients, on the other hand, grant leases and send keep-alive signals to etcd to track their activity. A lease is like a temporary contract that the LavinMQ leader clients obtain from etcd to prove they are still active. The lease has a time limit, and it is extended as long as the client sends keep-alive signals. If the client stops sending these signals, etcd knows that the client might have failed, and the lease expires. This ensures that only healthy and active clients are recognized.

First, we will explain how leader elections are triggered when a heartbeat is missed, followed by an explanation of the role of keep-alive signals.

etcd heartbeats

In an etcd cluster, the leader node regularly sends heartbeat messages to the other nodes to signal that it is still active. These heartbeat messages help the followers keep track of the leader’s status. The follower nodes notice the missing heartbeat messages when the leader fails or becomes unreachable. Upon detecting the leader’s failure, the follower nodes transition to a candidate state and initiate an election process to select a new leader.

LavinMQ message broker

During this election, nodes in the candidate state request votes from the other nodes in the cluster. For a candidate to become the new leader, it must receive votes from a majority of the nodes. This process ensures the new leader has widespread support and can effectively manage the cluster.

If a server with active connections goes down, clients will be disconnected and attempt to reconnect to one of the remaining IP addresses in the cluster (etcd maintains a list of follower nodes). Clients may reconnect to either a follower or a leader. Follower nodes forward all client connections to the leader, which may result in a slight performance impact when connecting a client to a follower instead of a leader.

LavinMQ message broker

etcd is built on the Raft protocol, which operates in cycles called terms. Each term starts with an election to choose an etcd leader and ends when a leader is elected or re-elected. If no new leader is chosen by the end of a term, the process restarts, allowing the system to recover and maintain its operation with minimal disruption.

etcd lease and keep alive

While heartbeats detect if a node is down, leases track whether a client is still active within the cluster. Each lease has a specified time-to-live (TTL) and will expire if etcd does not receive a keep-alive signal from the client within this period.

LavinMQ regularly sends keep-alive updates to the etcd server to maintain its leases. As long as etcd receives these updates, the LavinMQ leader remains in place. If etcd fails to receive a keep-alive signal, it assumes the leader is no longer active and starts a new leader election.

ISR in LavinMQ

To keep data consistent, nodes in the etcd cluster need to be in-sync with the leader, called In-Sync Replicas (ISR). Each node should have the same data as the leader. etcd stores the list of ISRs to manage this synchronization.

max_lag

A LavinMQ follower will be in sync with the leader if there are no more uncommitted messages than the configuration setting max_lag allows. If a follower falls behind by more than the max_lag, it will tell the leader that it’s not in sync with the leader. The leader will then slow down (like cutting the bandwidth) by not committing the message to the publisher until the behind follower has caught up. This is called back pressure.

Note: You should always use publisher confirms if you cannot afford message loss, as they ensure that messages are in sync and reliably delivered.

Adjusting the max_lag setting in the configuration file (standard: etc/lavinmq/lavinmq.ini) can control the number of actions that need to be in sync. The default max_lag value is 8192, but you can modify this number to suit your needs.

Data replication

When an action is performed on the LavinMQ leader, it is written to the page cache in LavinMQ and added to an internal queue for replication. After this, the leader confirms the action to the client.

The max_lag setting defines the internal queue’s maximum length. This queue ensures that logs are replicated to the followers. Once a follower receives the log, it returns an acknowledgment to the leader.

Downtime

If a follower node goes down and later rejoins the cluster, it needs to catch up with the current state of the leader. Even if it is behind by a significant number of messages, such as 10,000,000, the process of synchronization involves several steps:

Log synchronization: Upon reconnecting, the LavinMQ follower receives a snapshot or a checksum from the LavinMQ leader and a list of logs (files). This information helps the follower understand how much it needs to catch up.
Log requests: The follower then requests missing log entries from the leader. It does this in chunks, asking for the logs that must be fully synchronized.
Update state: The leader sends the requested log entries to the follower until the follower’s log is up-to-date.
Final check: Once the follower is caught up, a checksum is requested again. The leader then temporarily pauses all client operations to ensure the follower and leader are fully synchronized before resuming regular activity. This pause is generally brief and often barely noticeable, as only a small number of logs remain to be synchronized.

Setting up a LavinMQ cluster

When setting up a LavinMQ cluster, you can either run etcd on the same servers as LavinMQ or configure it as a standalone cluster that links to LavinMQ instances. If you are running LavinMQ hosted on CloudAMQP, you can select between 1, 3, or 5 nodes for your specific requirements (coming soon!).

Please refer to the documentation page Setting up Clustering with LavinMQ 2.0 for detailed instructions on configuring clustering.

Please note that etcd does not affect LavinMQ’s single-node setups. There is no requirement for etcd when running single node LavinMQ.

Conclusions

LavinMQ 2.0 introduces High Availability (HA), a new feature that keeps your messaging system reliable even during outages. Leader elections are handled by etcd using the Raft protocol, while LavinMQ manages data replication.

Thank you for your continued support. I would be happy to hear your thoughts!