Repairing Nodes

Canton enables interoperability of distributed participants and domains. Particularly in distributed settings without trust assumptions, faults in one part of the system should ideally produce minimal irrecoverable damage to other parts. For example if a domain is irreparably lost, the participants previously connected to that domain need to recover and be empowered to continue their workflows on a new domain.

This guide will illustrate how to replace a lost domain with a new domain providing business continuity to affected participants.

Recovering from a Lost Domain

Suppose that a set of participants have been conducting workflows via a domain that runs into trouble. In fact consider that the domain has gotten into such a disastrous state that the domain is beyond repair, for example:

  • The domain has experienced data loss and is unable to be restored from backups or the backups are missing crucial recent history.
  • The domain data is found to be corrupt causing participants to lose trust in the domain as a mediator.

Next the participant operators each examine their local state, and upon coordinating conclude that their participants’ active contracts are “mostly the same”. This domain-recovery repair demo illustrates how the participants can

  • coordinate to agree on a set of contracts to use moving forward, serving as a new consistent state,
  • copying over the agreed-upon set of contracts to a brand new domain,
  • “fail over” to the new domain,
  • and finally continue running workflows on the new domain having recovered from the permanent loss of the old domain.

Repairing an actual Topology

To follow along with this guide, ensure you have installed and unpacked the Canton release bundle and run the following commands from the “canton-X.Y.Z” directory to set up the initial topology.

export CANTON=`pwd`
export CONF="$CANTON/examples/03-advanced-configuration"
export REPAIR="$CANTON/examples/07-repair"
bin/canton -c $CONF -c $CONF/storage/h2.conf,$REPAIR/enable-repair-commands.conf \
  -c $REPAIR/participant1.conf,$REPAIR/participant2.conf,$REPAIR/domain-repair-lost.conf,$REPAIR/domain-repair-new.conf \
  --bootstrap $REPAIR/domain-repair-init.canton

To simplify the demonstration, this not only sets up the starting topology of

  • two participants, “participant1” and “participant2”, along with
  • one domain “lostDomain” that is about to become permanently unavailable leaving “participant1” and “participant2” unable to continue executing workflows,

but also already includes the ingredients needed to recover:

  • The setup includes “newDomain” that we will rely on as a replacement domain, and
  • we already enable the “enable-repair-commands” configuration needed to make available the “repair.change_domain” command.

In practice you would only add the new domain once you have the need to recover from domain loss and also only then enable the repair commands.

We simulate “lostDomain” permanently disappearing by stopping the domain and never bringing it up again to emphasize the point that the participants no longer have access to any state from domain1. We also disconnect “participant1” and “participant2” from “lostDomain” to reflect that the participants have “given up” on the domain and recognize the need for a replacement for business continuity. The fact that we disconnect the participants “at the same time” is somewhat artificial as in practice the participants might have lost connectivity to the domain at different times (more on reconciling contracts below).

      lostDomain.stop()
      Seq(participant1, participant2).foreach { p =>
        p.domains.disconnect(lostDomain.name)
        // Also let the participant know not to attempt to reconnect to lostDomain
        p.domains
          .config(lostDomain.name)
          .foreach(c => p.domains.modify(c.copy(manualConnect = true)))
      }
"lostDomain" has become unavailable and neither participant can connect anymore

Even though the domain is “the node that has broken”, recovering entails repairing the participants using the “newDomain” already set up. As of now, participant repairs have to be performed in an offline fashion requiring participants being repaired to be disconnected from the the new domain. However we temporarily connect to the domain, to let the identity state initialize, and disconnect only once the parties can be used on the new domain.

      Seq(participant1, participant2).foreach(_.domains.connect_local(newDomain))

      // Wait for identity state to appear before disconnecting again.
      utils.retry_until_true()(
        participant1.domains.active(newDomain.name) && participant2.domains.active(newDomain.name),
        "newDomain initialization timed out"
      )

      Seq(participant1, participant2).foreach(_.domains.disconnect(newDomain.name))

With the participants connected neither to “lostDomain” nor “newDomain”, each participant can

  • locally look up the contracts assigned to the lost domain using the “testing.pcs_search” command made available via the “features.enable-testing-commands” configuration,
  • and invoke “repair.change_domain” (enabled via the “features.enable-repair-commands” configuration) in order to “move” the contracts to the new domain.
      // Extract participant contracts from "lostDomain".
      val contracts1 = participant1.testing.pcs_search(lostDomain.name, filterTemplate = "^Iou")
      val contracts2 = participant2.testing.pcs_search(lostDomain.name, filterTemplate = "^Iou")

      // Ensure that shared contracts match.
      val Seq(sharedContracts1, sharedContracts2) = Seq(contracts1, contracts2).map(
        _.filter {
          case (_isActive, contract)
              if contract.metadata.stakeholders.contains(Alice.toLf) && contract.metadata.stakeholders.contains(
                Bob.toLf) =>
            true
          case _ => false
        }
      )
      utils.retry_until_true(timeout = java.time.Duration.ZERO)(
        sharedContracts1.equals(sharedContracts2),
        s"Contracts don't match: Participant1 and participant2 operators need to coordinate to agree on a common set of contracts"
      )

      // Finally change the contracts from "lostDomain" to "newDomain"
      participant1.repair.change_domain(contracts1.map(_._2.contractId), lostDomain.name, newDomain.name)
      participant2.repair.change_domain(contracts2.map(_._2.contractId),
                                        lostDomain.name,
                                        newDomain.name,
                                        skipArchived = false)

Note

The code snippet above includes a check that the contracts shared among the participants match (as determined by each participant, “sharedContracts1” by “participant1” and “sharedContracts2” by “participant2). Should the contracts not match (as could happen if the participants had lost connectivity to the domain at different times), this check fails soliciting the participant operators to reach an agreement on the set of contracts. The agreed-upon set of active contracts may for example be

  • the intersection of the active contracts among the participants
  • or perhaps the union (for which the operators can use the “repair.add” command to create the contracts missing from one participant).

Also note that both the repair commands and the “testing.pcs_search” command are currently “preview” features, and therefore their names may change.

Once each participant has associated the contracts with “newDomain”, let’s have them reconnect, and we should be able to confirm that the new domain is able to execute workflows from where the lost domain disappeared.

    Seq(participant1, participant2).foreach(_.domains.reconnect(newDomain.name))

    // Look up a couple of contracts moved from lostDomain
    val Seq(iouAlice, iouBob) = Seq(participant1 -> Alice, participant2 -> Bob).map {
      case (participant, party) =>
        participant.ledger_api.acs.await[Iou.Iou](party, Iou.Iou, _.value.owner == party.toPrim)
    }

    // Ensure that we can create new contracts
    Seq(participant1 -> ((Alice, Bob)), participant2 -> ((Bob, Alice))).foreach {
      case (participant, (payer, owner)) =>
        participant.ledger_api.commands.submit_flat(
          payer,
          Seq(Iou.Iou(payer.toPrim, owner.toPrim, Iou.Amount(value = 200, currency = "USD"), List.empty).create.command)
        )
    }

    // Even better: Confirm that we can exercise choices on the moved contracts
    Seq(participant2 -> ((Bob, iouBob)), participant1 -> ((Alice, iouAlice))).foreach {
      case (participant, (owner, iou)) =>
        participant.ledger_api.commands
          .submit_flat(owner, Seq(iou.contractId.exerciseCall(owner.toPrim).command))
    }
"newDomain" has replaced "lostDomain"

In practice, we would now be in a position to remove the “lostDomain” from both participants and to disable the repair commands again to prevent accidental use of these “dangerously powerful” tools.

This guide has demonstrated how participants can recover from losing a domain that has been permanently lost or somehow become irreparably corrupted.