Important

This feature is only available in Canton Enterprise

High Availability Usage

Overview

Canton nodes can be deployed in a highly available manner to ensure that domains and participants will continue operating despite isolated machine failures. See High Availability for a detailed description of the architecture in each Canton component to support HA.

Mediator

The mediator service uses a hot-standby mechanism, with an arbitrary number of replicas. The failover protocol issues leases to the active mediator, which can be reassigned to a standby mediator on inactivity. As this protocol runs over the sequencer, the failover times depend on the latency of the connection to the sequencer.

HA Configuration

HA mediator support is only available in the Enterprise version of canton. It is enabled by default, but can be disabled in either domain nodes or separate mediator nodes with:

The domain parameters include two settings to configure the behavior of HA mediators:

  • mediator inactivity timeout: If this duration of time elapses without the active mediator having successfully sequenced an event, the lease will be considered expired. Setting this to a low value may cause an increase in network traffic and processing load on the participants, as passive mediators may erroneously consider an active mediator offline and request the active mediator lease. For example if the active mediator is unable to send new requests for a very short period (network blip, reconnect) or it may exceed a small threshold. Setting this to a high value will cause passive mediators to wait for a long period before considering the active mediator offline, which will cause a high fail-over time before the passive mediators start responding to transactions.

  • mediator heartbeat interval: How often should the mediator send a empty event to other mediators to indicate that it is actively maintaining its lease? This is particularly helpful in the absence of transaction activity.

Running a mediator separately from the domain node

A domain may be statically configured with a single embedded mediator node or it may be configured to work with external mediators. Once the domain has been initialized further mediators can be added at runtime.

By default a domain node will run an embedded mediator node itself. This is useful in simple deployments where all domain functionality can be co-located on a single host. In a distributed setup where domain services are operated over many machines you can instead configure the domain node to indicate that the mediator(s) will be running externally.

  domains {
    da {
      mediator {
        type = external
      }
      # other domain configuration is skipped for this example but would likely be present
    }
  }

Mediator nodes can be defined the same manner as Canton participants and domains.

  mediators {
    mediator1 {
      public-api.port = 7070
      admin-api.port = 7071
      mediator.high-availability.enabled = true
    }

When the domain starts it will automatically provide the mediator information about the domain if it is embedded. External mediators have to be initialized using runtime administration in order to complete the domains initialization.

      da.setup.onboard_mediator(mediator1)

Adding additional mediator nodes to a domain

Additional mediator nodes can be added to a domain after it has been initialized, even after the domain has been running for a period. The new node can be added dynamically using runtime administration. This allows adding many mediators over time, even with on-going domain activity. To remotely administer running domain and mediator nodes, they can be configured as remote-domains and remote-mediators in the canton configuration and commands can be issued using canton console.

      val mediatorKey = da.setup.onboard_mediator(mediator2)

Currently the new mediator must read all events that have been addressed to mediators since the domain was initialized for it to know of all identity and topology changes. It will only fully start its mediator service when it witnesses its key being registered as a mediator (allowing it to sign requests). If this initialization takes a significant amount of time, the mediator may respond to transactions that occurred after the mediator was first onboarded but that have now already been finalized. It may also attempt to obtain a Active mediator lease for a fail-over event that is now in the past. These are not problems but you may observe an increased number of log messages at participants and mediators indicating duplicate verdicts or failed lease requests. In future versions of canton this initialization behavior may be updated to reduce on-boarding time.

Disable the highly available mediator protocol

The HA mediator protocol should not have any significant impact on a domain with only a single mediator. However the HA behavior can be optionally disabled entirely in the configuration for a mediator within a domain node, or in the separate mediator node.

# This demonstrates disabling HA for both a mediator running in the domain node
# and a meditor running in a separate node. But pratically if you are disabling
# this functionality it makes sense to only have one mediator on the domain.
canton {
  domains {
    da {
      mediator {
        type = embedded
        high-availability.enabled = false
      }
    }
  }

  mediators {
    mediator1 {
      mediator {
        high-availability.enabled = false
      }
    }
  }
}

Sequencer

The database based sequencer can be horizontally scaled and placed behind a load-balancer to provide high availability and performance improvements.

Deploy multiple sequencer nodes for the Domain with the following configuration:

  • All sequencer nodes share the same database so ensure that the storage configuration for each sequencer matches.

  • The sequencer nodes must each be configured with a unique index and the number of sequencer nodes that will potentially ever be operated in this topology (it does not matter if this many nodes are initially deployed but will be difficult to change in the future).

canton {
  sequencers {
    sequencer1 {
      sequencer {
        type = database
        high-availability = {
          # must be unique for every deployed sequencer instance
          node-index = 0
          # must be equal or greater than the total number of instances _ever_ deployed
          # it is advisable to set this larger than initially required to allow for future expansion
          total-node-count = 10
        }
      }

The Domain node itself must then be configured to use these Sequencer nodes by pointing it at these external services.

  domains {
    da {
      sequencer.type = external

Once configured the domain must be bootstrapped with the new external sequencer using the bootstrap_domain operational process. These sequencers share a database so just use a single instance for bootstrapping and the replicas will come online once the shared database has sufficient state for starting.

As these nodes are likely running in separate processes you could run this command entirely externally using a remote administration configuration.

canton {
  remote-domains {
    da {
      # these details are provided to other nodes to use for how they should connect to the sequencer if the domain node
      # has an embedded sequencer
      public-api {
        address = da-domain.local
        port = 1234
      }
      admin-api {
        address = da-domain.local
        port = 1235
      }
    }
  }

  remote-sequencers {
    sequencer1 {
      # these details are provided to other nodes to use for how they should connect to the sequencer
      public-api {
        address = sequencer1.local
        port = 1234
      }
      # the server used from running administration commands
      admin-api {
        address = sequencer1.local
        port = 1235
      }
    }
  }
}

There are two methods available for exposing the horizontally scaled sequencer instances to participants.

External load balancer

Using a load balancer is recommended when you have a http2+grpc supporting load balancer available, and can’t/don’t want to expose details of the backend sequencers to clients. An advanced deployment could also support elastically scaling the number of sequencers available and dynamically reconfigure the load balancer for this updated set.

An example HAProxy configuration for exposing GRPC services without TLS looks like:

frontend domain_frontend
  bind 1234 proto h2
  default_backend domain_backend

backend domain_backend
  balance roundrobin
  server sequencer1 sequencer1.local:1234 proto h2
  server sequencer2 sequencer2.local:1234 proto h2
  server sequencer3 sequencer3.local:1234 proto h2

Client-side load balancing

Using client-side load balancing is recommended where a external load-balancing service is unavailable (or lacks http2+grpc support), and the set of sequencers is static and can be configured at the client.

To simply specify multiple sequencers use the domains.connect_ha console command when registering/connecting to the domain:

myparticipant.domains.connect_ha(
  "my_domain_alias",
  "https://sequencer1.example.com",
  "https://sequencer2.example.com",
  "https://sequencer3.example.com"
)

See the documentation on the connect command using a domain connection config for how to add many sequencer urls when combined with other domain connection options. The domain connection configuration can also be changed at runtime to add or replace configured sequencer connections. Note the domain will have to be disconnected and reconnected at the participant for the updated configuration to be used.

Participant

High availability of a participant node is achieved by running multiple participant node replicas that have access to a shared database.

Participant node replicas are configured in the Canton configuration file as individual participants with two required changes for each participant node replica:

  • Using the same storage configuration to ensure access to the shared database. Only PostgreSQL and Oracle based storage is supported for HA.

  • Set replicated = true for each participant node replica.

Permissions On Oracle

All replicas of a participant node must be configured with the same DB user name. The DB user must have the following permissions granted:

GRANT EXECUTE ON SYS.DBMS_LOCK TO $username
GRANT SELECT ON V_$MYSTAT TO $username
GRANT SELECT ON V_$LOCK TO $username

In the above commands the $username must be replaced with the configured DB user name. These permissions allow the DB user to request application-level locks on Oracle, as well as to query the state of locks and its own session information.

Manual Trigger of a Fail-over

Fail-over from the active to a passive replica is done automatically when the active replica has a failure, but one can also initiate a graceful fail-over with the following command:

        activeParticipantReplica.replication.set_passive()

The command succeeds if there is at least another passive replica that takes over from the current active replica, otherwise the active replica remains active.