Menu Search

10.4. Behaviour of the Group

This section first describes the behaviour of the group in its default configuration. It then goes on to talk about the various controls that are available to override it. It describes the controls available that affect the durability of transactions and the data consistency between the master and replicas and thus make trade offs between performance and reliability.

10.4.1. Default Behaviour

Let's first look at the behaviour of a group in default configuration.

In the default configuration, for any messaging work to be done, there must be at least quorum nodes present. This means for example, in a three node group, this means there must be at least two nodes available.

When a messaging client sends a transaction, it can be assured that, before the control returns back to his application after the commit call that the following is true:

  • At the master, the transaction is written to disk and OS level caches are flushed meaning the data is on the storage device.

  • At least quorum minus 1 replicas, acknowledge the receipt of transaction. The replicas will write the data to the storage device sometime later.

If there were to be a master failure immediately after the transaction was committed, the transaction would be held by at least quorum minus one replicas. For example, if we had a group of three, then we would be assured that at least one replica held the transaction.

In the event of a master failure, if quorum nodes remain, those nodes hold an election. The nodes will elect master the node with the most recent transaction. If two or more nodes have the most recent transaction the group makes an arbitrary choice. If quorum number of nodes does not remain, the nodes cannot elect a new master and will wait until nodes rejoin. You will see later that manual controls are available allow service to be restored from fewer than quorum nodes and to influence which node gets elected in the event of a tie.

Whenever a group has fewer than quorum nodes present, the virtualhost will be unavailable and messaging connections will be refused. If quorum disappears at the very moment a messaging client sends a transaction that transaction will fail.

You will have noticed the difference in the synchronization policies applied the master and the replicas. The replicas send the acknowledgement back before the data is written to disk. The master synchronously writes the transaction to storage. This is an example of a trade off between durability and performance. We will see more about how to control this trade off later.

10.4.2. Synchronization Policy

The synchronization policy dictates what a node must do when it receives a transaction before it acknowledges that transaction to the rest of the group.

The following options are available:

  • SYNC. The node must write the transaction to disk and flush any OS level buffers before sending the acknowledgement. SYNC is offers the highest durability but offers the least performance.

  • WRITE_NO_SYNC. The node must write the transaction to disk before sending the acknowledgement. OS level buffers will be flush as some point later. This typically provides an assurance against failure of the application but not the operating system or hardware.

  • NO_SYNC. The node immediately sends the acknowledgement. The transaction will be written and OS level buffers flushed as some point later. NO_SYNC offers the highest performance but the lowest durability level. This synchronization policy is sometimes known as commit to the network.

It is possible to assign a one policy to the master and a different policy to the replicas. These are configured as attributes on the virtualhost. By default the master uses SYNC and replicas use NO_SYNC.

10.4.3. Node Priority

Node priority can be used to influence the behaviour of the election algorithm. It is useful in the case were you want to favour some nodes over others. For instance, if you wish to favour nodes located in a particular data centre over those in a remote site.

The following options are available:

  • Highest. Nodes with this priority will be more favoured. In the event of two or more nodes having the most recent transaction, the node with this priority will be elected master. If two or more nodes have this priority the algorithm will make an arbitrary choice.

  • High. Nodes with this priority will be favoured but not as much so as those with Highest.

  • Normal. This is default election priority.

  • Never. The node will never be elected even if the node has the most recent transaction. The node will still keep up to date with the replication stream and will still vote itself, but can just never be elected.

Node priority is configured as an attribute on the virtualhost node and can be changed at runtime and is effective immediately.

Important

Use of the Never priority can lead to transaction loss. For example, consider a group of three where replica-2 is marked as Never. If a transaction were to arrive and it be acknowledged only by Master and Replica-2, the transaction would succeed. Replica 1 is running behind for some reason (perhaps a full-GC). If a Master failure were to occur at that moment, the replicas would elect Replica-1 even though Replica-2 had the most recent transaction.

Transaction loss is reported by message HA-1014.

10.4.4. Required Minimum Number Of Nodes

This controls the required minimum number of nodes to complete a transaction and to elect a new master. By default, the required number of nodes is set to Default (which signifies quorum).

It is possible to reduce the required minimum number of nodes. The rationale for doing this is normally to temporarily restore service from fewer than quorum nodes following an extraordinary failure.

For example, consider a group of three. If one node were to fail, as quorum still remained, the system would continue work without any intervention. If the failing node were the master, a new master would be elected.

What if a further node were to fail? Quorum no longer remains, and the remaining node would just wait. It cannot elect itself master. What if we wanted to restore service from just this one node?

In this case, Required Number of Nodes can be reduced to 1 on the remain node, allowing the node to elect itself and service to be restored from the singleton. Required minimum number of nodes is configured as an attribute on the virtualhost node and can be changed at runtime and is effective immediately.

Important

The attribute must be used cautiously. Careless use will lead to lost transactions and can lead to a split-brain in the event of a network partition. If used to temporarily restore service from fewer than quorum nodes, it is imperative to revert it to the Default value as the failed nodes are restored.

Transaction loss is reported by message HA-1014.

10.4.5. Allow to Operate Solo

This attribute only applies to groups containing exactly two nodes.

In a group of two, if a node were to fail then in default configuration work will cease as quorum no longer exists. A single node cannot elect itself master.

The allow to operate solo flag allows a node in a two node group to elect itself master and to operate sole. It is configured as an attribute on the virtualhost node and can be changed at runtime and is effective immediately.

For example, consider a group of two where the master fails. Service will be interrupted as the remaining node cannot elect itself master. To allow it to become master, apply the allow to operate solo flag to it. It will elect itself master and work can continue, albeit from one node.

Important

It is imperative not to allow the allow to operate solo flag to be set on both nodes at once. To do so will mean, in the event of a network partition, a split-brain will occur.

Transaction loss is reported by message HA-1014.

10.4.6. Maximum message size

Internally, BDB JE HA restricts the maximum size of replication stream records passed from the master to the replica(s). This helps prevent DOS attacks. If expected application maximum message size is greater than 5MB, the BDB JE setting je.rep.maxMessageSize and Qpid context variable qpid.max_message_size needs to be adjusted to reflect this in order to avoid running into the BDB HA JE limit.