# Monitoring¶

## Logging¶

Canton uses Logback as the logging library. All Canton logs will derive from the root logger com.digitalasset.canton. By default, Canton will write a log to the file log/canton.log using the INFO log-level.

Using the flag -v (or --verbose) will increase the log-level of canton classes. Using -d (or --debug) will increase logging even further. Using -f (or --debug-full) will increase logging for the root logger instead of just for canton classes.

The log-file can be truncated on startup using --truncate-log.

A custom logback configuration can be provided using JAVA_OPTS.

JAVA_OPTS="-Dlogback.configurationFile=./path-to-file.xml" ./bin/canton --config ...


A starting point for a custom logback.xml file would be

<?xml version="1.0" encoding="UTF-8"?>
<configuration scan="true" debug="false">

<!-- propagate logback changes to jul handlers -->
<contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator">
<resetJUL>true</resetJUL>
</contextListener>

<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%highlight([%level] %logger{10} %replace(cid:%mdc{correlation-id} ){'cid: ', ''}: %msg) %n
</pattern>
</encoder>
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>\${STDOUT_LOG_LEVEL:-WARN}</level>
</filter>
</appender>
<!-- using rolling file appender to keep log files small and delete old logs -->
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>log/canton.log</file>
<append>true</append>
<!-- don't force a flush on each log-line (faster, but may miss logs when crashing) -->
<immediateFlush>false</immediateFlush>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- hourly rollover and compress (gz), change pattern if you want different roll-overs -->
<fileNamePattern>log/canton.%d{yyyy-MM-dd-HH}.log.gz</fileNamePattern>
<!-- keep max 12 archived log files -->
<maxHistory>12</maxHistory>
</rollingPolicy>
<encoder>
<!-- include correlation-id in log-line -->
<pattern>%date [%thread] %-5level %logger{35} %replace(cid:%mdc{correlation-id} ){'cid: ', ''}- %msg%n
</pattern>
</encoder>
</appender>

<!--
Include the Canton rewrite appender to rewrite certain log levels of certain messages

The Canton rewrite appender expects two appender-ref: FILE and STDOUT and will create
a new appender named REWRITE_LOG_LEVEL.
This appender allows us to change the log-level of some log messages created by certain
libraries such that we don't get messy and unnecessary warnings or errors.
-->
<include resource="rewrite-appender.xml"/>

<root level="INFO">
<appender-ref ref="REWRITE_LOG_LEVEL"/>
</root>

</configuration>


## Tracing¶

For further debuggability, Canton provides a correlation-id which allows to trace the processing of requests through the system. The correlation-id is exposed to logback through the mapping diagnostic context and can be included in the logback output pattern using %mdc{correlation-id}.

The correlation-id needs to be enabled by setting the canton.monitoring.trace-context-propagation = enabled configuration option.

## Health Check¶

The canton process can optionally expose an HTTP endpoint indicating if the process believes it is healthy. This is intended for use in uptime checks and liveness probes. If enabled, the /health endpoint will respond to a GET http request with a 200 HTTP status code if healthy or 500 if unhealthy (with a plain text description of why it is unhealthy).

To enable this health endpoint add a monitoring section to the canton configuration. As this health check is for the whole process, it is added directly to the canton configuration rather than for a specific node.

canton {
monitoring.health {
server {
port = 7000
}

check {
type = ping
participant = participant1
interval = 30s
}
}


This health check will have participant1 “ledger ping” itself every 30 seconds. The process will be considered healthy if the ping is successful.

## Metrics¶

Canton uses dropwizard’s metrics library to report metrics. The metrics library supports a variety of reporting backends. JMX based reporting (only for testing purposes) can be enabled using

canton.monitoring.metrics.reporter.type = jmx


Alternatively, metrics can be written to a file

canton.monitoring.metrics.reporter = {
type = csv
file = "output.csv"
interval = 5s // default
}


or reported via Graphite (to Grafana) using

canton.monitoring.metrics.reporter = {
type = graphite
address = "localhost" // default
port = 2003
interval = 30s // default
prefix.type = hostname // default
}


In addition to Canton metrics, the process will report DAML metrics (of the ledger api server). Optionally, JVM metrics can be included using

canton.monitoring.metrics.report-jvm-metrics = yes // default no


### Participant Metrics¶

canton.<domain>.conflict-detection.sequencer-counter-queue

(Gauge): Size of conflict detection sequencer counter queue

The task scheduler will work off tasks according to the timestamp order, scheduling the tasks whenever a new timestamp has been observed. This metric exposes the number of un-processed sequencer messages that will trigger a timestamp advancement.

(Gauge): Size of conflict detection task queue

The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass. A huge number does not necessarily indicate a bottleneck; it could also mean that a huge number of tasks have not yet arrived at their execution time.

canton.<domain>.request-tracker.sequencer-counter-queue

(Gauge): Size of record order publisher sequencer counter queue

Same as for conflict-detection, but measuring the sequencer counter queues for the publishing to the ledger api server according to record time.

(Gauge): Size of record order publisher task queue

The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass.

canton.<domain>.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.<domain>.sequencer-client.delay

(Gauge): The delay on the event processing

Every message recevied from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.<domain>.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.commitments.compute

(Timer): Time spent on commitment computations.

Participant nodes compute bilateral commitments at regular intervals. This metric exposes the time spent on each computation. If the time to compute the metrics starts to exceed the commitment intervals, this likely indicates a problem.

canton.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.prune

(Timer): Duration of prune operations.

This timer exposes the duration of pruning requests from the Canton portion of the ledger.

(Meter): Number of updates published through the read service to the indexer

When an update is published through the read service, it has already been committed to the ledger. The indexer will subsequently store the update in a form that allows for querying the ledger efficiently.

### Domain Metrics¶

canton.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.mediator.outstanding-requests

(Gauge): Number of currently outstanding requests

This metric provides the number of currently open requests registered with the mediator.

canton.mediator.requests

(Meter): Number of totally processed requests

This metric provides the number of totally processed requests since the system has been started.

canton.mediator.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.mediator.sequencer-client.delay

(Gauge): The delay on the event processing

Every message recevied from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.mediator.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.sequencer.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.sequencer.processed

(Meter): Number of messages processed by the sequencer

This metric measures the number of successfully validated messages processed by the sequencer since the start of this process.

canton.sequencer.processed-bytes

(Meter): Number of message bytes processed by the sequencer

This metric measures the total number of message bytes processed by the sequencer.

canton.sequencer.subscriptions

(Gauge): Number of active sequencer subscriptions

This metric indicates the number of active subscriptions currently open and actively served subscriptions at the sequencer.