A domain is fully available only when all of its subcomponents are available. However, transaction processing can still run over the domain even if only the mediator and the sequencer are available. The domain services handle new connections to domains, and the identity manager handles the changes to the identity state; ghd unavailability of these two components affects only the services they handle. As all of these components can run in separate processes, we handle the HA of each component separately.
The HA properties of the Sequencer depend on the chosen implementation. When the sequencer is based on a HA ledger, such as Hyperledger Fabric, the sequencer automatically becomes HA. The domain service can return multiple sequenced endpoints, any of which can be used to interact with the underlying ledger.
For the database sequencer, we use an active-active setup over a shared database. The setup relies on the database for both HA and consistency.
Database Sequencer HA¶
The database Sequencer uses the database itself to ensure that events are sequenced with a consistent order. Many Sequencer nodes can be deployed where each node has a Sequencer reader and writer component, all of these components can concurrently read and write to the same database. A load balancer can be used to evenly distribute requests between these nodes. The canton health endpoint can be used to halt sending requests to a node that reports itself as unhealthy.
Sequencers nodes are statically configured with the total number of possible Sequencer nodes and each node is assigned a distinct index from this range. This index is used to partition available event timestamps to ensure two sequencer node will never use the same event id/timestamp.
Events are written to the events table and can be read in ascending timestamp order. To provide a continuous monotonic stream of events, readers need to know the point at which events can be read without the risk of an earlier event being inserted by a writer process. To do this writers regularly update a watermark table where they publish their latest event timestamp. Readers take the minimum timestamp from this table as the point they can safely query events for.
If a Sequencer node was to fail, it would stop updating its watermark value and when it becomes the minimum timestamp this will cause all readers to effectively pause at this point (at they cannot read beyond this point). Other Sequencers writers when updating their own watermark also check that the other sequencer watermarks are being updated in a timely manner. If it is noticed that a Sequencer node has not updated its watermark within a configurable interval then it will be marked as offline and this watermark will no longer be included in the query for the minimum event timestamp. This causes future events from the offline Sequencer to be ignored after this timestamp. For this process to operate optimally the clocks of the hosts of the Sequencer nodes are expected to be synchronized - this is considered reasonable for where all Sequencer hosts are co-located and NTP is used.
If the failed Sequencer has recovered and would like to resume operation, it should delete all events past its last know watermark to avoid incorrectly re-inserting them into the events the readers will see, as readers may have read subsequent events by this time. This is safe to do without effecting events that have been read as any events written by the offline Sequencer after it is marked offline are ignored by readers. It should then replace its old watermark with a new timestamp for events it will start inserting then resume normal operation, ensuring that this is greater than any existing value.
When a Sequencer fails and resumes operation there will be short pause in reading from other Sequencers due to updates to the watermark table. However requests to the other Sequencer nodes should continue successfully, and any events written during this period will be available to read as soon as the pause has completed. Any send requests that were being processed by the failed Sequencer process will likely be lost, but can be safely retried once their max-sequencing-time has been exceeded without the risk of creating duplicate events.
If the mediator service was to fall offline, all transactions within the domain during this time would time out due to the absence of transaction results. To improve on this, many mediator nodes can be attached to a domain. These could all be run by a single operator, or could be run by many operators in distinct organizations ( particularly useful for domain drivers implemented on a distributed architecture).
Mediator nodes use a hot/warm approach to attempt to ensure that the vast majority of transactions will receive transaction results. To start a domain, at least one mediator must be available. After a domain is started, additional mediator nodes can then be added at any time.
All mediators will read from the sequencer as the same member, and receive the same events in the same order. This allows and requires all mediators to come to the same results when observing transaction confirmation requests and responses. Technically all mediators could send transaction results as these are duplicated by the receiving recipients. However to prevent most transactions from receiving duplicate results an active mediator is assigned within the set of mediators which is primarily responsible for sending these results. This active mediator then holds an implicit lease that will be retained while the mediator continues to send events, and the warm mediators will observe these events tracking that the assigned mediator is still active. If a period of inactivity is noticed (potentially due to a crash), the other mediators will request the active mediator lease.
The active mediator lease is initially assigned to the first mediator to successfully send a transaction result message. As these transaction result messages are sent to the domain sequencer that places events into global order, even if many mediators send this message at a similar time the sequencer will ensure that there is a consistent order to these events ensuring that one comes first. Although the other mediators are no longer sending results, they are still “warm” in that they are keeping track of pending transaction requests and responses in case they become active.
- There are two configurable domain parameters that control the behavior of HA mediators.
Mediator inactivity timeout: the duration passive mediators will wait to see an event from the active mediator before assuming them offline.
Mediator heartbeat interval: how frequently the active mediator will send events to passive mediators to indicate it is still online.
The time for the inactivity timeout is measured using the domain sequencer clock provided in sequenced events to ensure all mediators can deterministically know that an active mediator lease has expired regardless of differences between their local clocks.
If an active mediator lease expires (potentially due to the active mediator going offline), all passive mediators will send a mediator leadership request via the sequencer to all domain mediators. This request will contain a timestamp of the last witnessed active mediator activity.
When passive mediators receive this leadership request, they will compare the last activity timestamp with what they have last witnessed. If this timestamp matches they will now assign that this new mediator holds the active mediator lease. These values may be different if the current active mediator sent a new event between when their lease expired and the leadership request was received by mediators, in which case they will remain the active mediator. With many passive mediators they may all send this leadership request around the same time, however due to the sequencer these will be globally ordered and only the first request will win.
Importantly to reduce fail-over time, all mediators will begin sending transaction results as soon as the lease for the currently active mediator expires. With more than one passive mediator this will almost certainly cause the same transaction result to be delivered to participants more than once. However again this is permitted as they are ordered and only the first received will be acted on. This behavior may be extended in the future to limit the number of duplicate transactions sent and potentially backfill results so even transactions from before the lease expiry that are missing results may be sent by passive mediators.
The domain services uses a warm standby mechanism, same as the one described for the SQL-based sequencers. The replicas share the same database, there is one active replica, and the replicas expose a starting/stopping mechanism, as well as a health endpoint. There is a simple built-in HA tool to perform liveness/health checks in a setting with two replicas, but the mechanisms can also be integrated into a more robust HA setup.