# Monitoring¶

## Logging¶

Canton uses Logback as the logging library. All Canton logs will derive from the root logger com.digitalasset.canton. By default, Canton will write a log to the file log/canton.log using the INFO log-level and will log WARN and ERROR to stdout.

How Canton produces log files can be configured extensively on the command line using the following options:

• -v (or --verbose) is a short option to set the canton log level to DEBUG.

• --log-level-root=<level> configures the log-level of the root logger, including external libraries.

• --log-level-canton=<level> configures the log-level of the Canton logger.

• --log-level-stdout=<level> configures the log-level of stdout.

• --log-file-name=log/canton.log configures the location of the log file.

• --log-file-appender=flat|rolling|off configures if and how logging to a file should be done. The rolling appender will roll the files according to the defined date-time pattern.

• --log-file-rolling-history=12 configures the number of historical files to keep when using the rolling appender.

• --log-file-rolling-pattern=YYYY-mm-dd configures the rolling file suffix (and therefore the frequency) of how files should be rolled.

• --log-truncate configures whether the log file should be truncated on startup.

• --log-profile=container provides a default set of logging settings for a particular setup. Right now, we only support the container profile which logs to STDOUT and turns of flat file logging to avoid storage leaks due to log files within a container.

Please note that if you use --log-profile, the order of the command line arguments matters. The profile settings can be overridden on the command line by placing adjustments after the profile has been selected.

Canton supports the normal log4j logging levels: TRACE, DEBUG, INFO, WARN, ERROR.

For further customization, a custom logback configuration can be provided using JAVA_OPTS.

JAVA_OPTS="-Dlogback.configurationFile=./path-to-file.xml" ./bin/canton --config ...


If you use a custom log-file, the command line arguments for logging will not have any effect, except --log-level-canton and --log-level-root which can still be used to adjust the log level of the root loggers.

## Tracing¶

For further debuggability, Canton provides a trace-id which allows to trace the processing of requests through the system. The trace-id is exposed to logback through the mapping diagnostic context and can be included in the logback output pattern using %mdc{trace-id}.

The trace-id needs to be enabled by setting the canton.monitoring.tracing.propagation = enabled configuration option.

It is also possible to configure the service where traces and spans are reported to. Currently Jaeger and Zipkin are supported. For example, Jaeger reporting can be configure as follows:

monitoring.tracing.tracer.exporter {
type = jaeger
address = ... // default: "localhost"
port = ... // default: 14250
}


It is possible to try it out locally very easily by running Jaeger on a Docker container as follows:

docker run --rm -it --name jaeger\
-p 16686:16686 \
-p 14250:14250 \
jaegertracing/all-in-one:1.22.0


### Sampling¶

It is also possible to change how often spans are sampled (i.e. reported to the configured exporter). By default it will always report (monitoring.tracing.tracer.sampler.type = always-on). It can also be configured to never report (monitoring.tracing.tracer.sampler.type = always-off, although not super useful). And it can also be configured so that a specific fraction of spans are reported like below:

monitoring.tracing.tracer.sampler = {
type = trace-id-ratio
ratio = 0.5
}


There is one last property of sampling that can be optionally changed. By default we have parent-based sampling on (monitoring.tracing.tracer.sampler.parent-based = true) which means that a span is sampled iff its parent is sampled (the root span will follow the configured sampling strategy). This way, there will never be incomplete traces, so either the full trace is sampled or not. If this property is changed, all spans will follow the configured sampling strategy ignoring whether the parent is sampled or not.

## Health Check¶

The canton process can optionally expose an HTTP endpoint indicating if the process believes it is healthy. This is intended for use in uptime checks and liveness probes. If enabled, the /health endpoint will respond to a GET http request with a 200 HTTP status code if healthy or 500 if unhealthy (with a plain text description of why it is unhealthy).

To enable this health endpoint add a monitoring section to the canton configuration. As this health check is for the whole process, it is added directly to the canton configuration rather than for a specific node.

canton {
monitoring.health {
server {
port = 7000
}

check {
type = ping
participant = participant1
interval = 30s
}
}


This health check will have participant1 “ledger ping” itself every 30 seconds. The process will be considered healthy if the ping is successful.

## Metrics¶

Canton uses dropwizard’s metrics library to report metrics. The metrics library supports a variety of reporting backends. JMX based reporting (only for testing purposes) can be enabled using

canton.monitoring.metrics.reporters = [{ type = jmx }]


Additionally, metrics can be written to a file

canton.monitoring.metrics.reporters = [{
type = jmx
}, {
type = csv
directory = "metrics"
interval = 5s // default
filters = [{
contains = "canton"
}]
}]


or reported via Graphite (to Grafana) using

canton.monitoring.metrics.reporters = [{
type = graphite
port = 2003
prefix.type = hostname // default
interval = 30s // default
filters = [{
contains = "canton"
}]
}]


When using the graphite or csv reporters, Canton will periodically evaluate all metrics matching the given filters. It is therefore advisable to filter for only those metrics that are relevant to you.

In addition to Canton metrics, the process can also report DAML metrics (of the ledger api server). Optionally, JVM metrics can be included using

canton.monitoring.metrics.report-jvm-metrics = yes // default no


### Participant Metrics¶

canton.<domain>.conflict-detection.sequencer-counter-queue

(Gauge): Size of conflict detection sequencer counter queue

The task scheduler will work off tasks according to the timestamp order, scheduling the tasks whenever a new timestamp has been observed. This metric exposes the number of un-processed sequencer messages that will trigger a timestamp advancement.

(Gauge): Size of conflict detection task queue

The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass. A huge number does not necessarily indicate a bottleneck; it could also mean that a huge number of tasks have not yet arrived at their execution time.

canton.<domain>.protocol-messages.confirmation-request-creation

(Timer): Time to create a confirmation request

The time that the transaction protocol processor needs to create a confirmation request.

canton.<domain>.protocol-messages.confirmation-request-size

(Histogram): Confirmation request size

Records the histogram of the sizes of (transaction) confirmation requests.

canton.<domain>.protocol-messages.transaction-message-receipt

(Timer): Time to parse a transaction message

The time that the transaction protocol processor needs to parse and decrypt an incoming confirmation request.

canton.<domain>.request-tracker.sequencer-counter-queue

(Gauge): Size of record order publisher sequencer counter queue

Same as for conflict-detection, but measuring the sequencer counter queues for the publishing to the ledger api server according to record time.

(Gauge): Size of record order publisher task queue

The task scheduler will schedule tasks to run at a given timestamp. This metric exposes the number of tasks that are waiting in the task queue for the right time to pass.

canton.<domain>.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.<domain>.sequencer-client.delay

(Gauge): The delay on the event processing

Every message received from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.<domain>.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.<domain>.sequencer-client.submissions.dropped

(Counter): Count of send requests that did not cause an event to be sequenced

Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.

canton.<domain>.sequencer-client.submissions.in-flight

(Gauge): Number of sequencer send requests we have that are waiting for an outcome or timeout

Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.

Counter that is incremented if a send request receives an overloaded response from the sequencer.

canton.<domain>.sequencer-client.submissions.sends

(Timer): Rate and timings of send requests to the sequencer

Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.

canton.<domain>.sequencer-client.submissions.sequencing

(Timer): Rate and timings of sequencing requests

This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.

canton.commitments.compute

(Timer): Time spent on commitment computations.

Participant nodes compute bilateral commitments at regular intervals. This metric exposes the time spent on each computation. If the time to compute the metrics starts to exceed the commitment intervals, this likely indicates a problem.

canton.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.db-storage.general.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.db-storage.general.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.db-storage.general.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.db-storage.write.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.db-storage.write.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.db-storage.write.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.prune

(Timer): Duration of prune operations.

This timer exposes the duration of pruning requests from the Canton portion of the ledger.

(Meter): Number of updates published through the read service to the indexer

When an update is published through the read service, it has already been committed to the ledger. The indexer will subsequently store the update in a form that allows for querying the ledger efficiently.

### Domain Metrics¶

canton.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.db-storage.general.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.db-storage.general.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.db-storage.general.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.db-storage.write.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.db-storage.write.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.db-storage.write.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.identity-manager.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.identity-manager.sequencer-client.delay

(Gauge): The delay on the event processing

Every message received from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.identity-manager.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.identity-manager.sequencer-client.submissions.dropped

(Counter): Count of send requests that did not cause an event to be sequenced

Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.

canton.identity-manager.sequencer-client.submissions.in-flight

(Gauge): Number of sequencer send requests we have that are waiting for an outcome or timeout

Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.

Counter that is incremented if a send request receives an overloaded response from the sequencer.

canton.identity-manager.sequencer-client.submissions.sends

(Timer): Rate and timings of send requests to the sequencer

Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.

canton.identity-manager.sequencer-client.submissions.sequencing

(Timer): Rate and timings of sequencing requests

This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.

canton.mediator.outstanding-requests

(Gauge): Number of currently outstanding requests

This metric provides the number of currently open requests registered with the mediator.

canton.mediator.requests

(Meter): Number of totally processed requests

This metric provides the number of totally processed requests since the system has been started.

canton.mediator.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.mediator.sequencer-client.delay

(Gauge): The delay on the event processing

Every message received from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.mediator.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.mediator.sequencer-client.submissions.dropped

(Counter): Count of send requests that did not cause an event to be sequenced

Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.

canton.mediator.sequencer-client.submissions.in-flight

(Gauge): Number of sequencer send requests we have that are waiting for an outcome or timeout

Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.

Counter that is incremented if a send request receives an overloaded response from the sequencer.

canton.mediator.sequencer-client.submissions.sends

(Timer): Rate and timings of send requests to the sequencer

Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.

canton.mediator.sequencer-client.submissions.sequencing

(Timer): Rate and timings of sequencing requests

This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.

canton.sequencer.db-storage.<storage>

(Timer): Timer monitoring duration and rate of accessing the given storage

Covers both read from and writes to the storage.

(Gauge): The load on the given storage

The load is a factor between 0 and 1 describing how much of an existing interval has been spent reading from or writing to the storage.

(Counter): Number of failed writes to the multi-domain event log

Failed writes to the multi domain event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

(Counter): Number of failed writes to the event log

Failed writes to the single dimension event log indicate an issue requiring user intervention. In the case of domain event logs, the corresponding domain no longer emits any subsequent events until domain recovery is initiated (e.g. by disconnecting and reconnecting the participant from the domain). In the case of the participant event log, an operation might need to be reissued. If this counter is larger than zero, check the canton log for errors for details.

canton.sequencer.db-storage.general.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.sequencer.db-storage.general.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.sequencer.db-storage.general.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.sequencer.db-storage.write.executor.queued

(Gauge): Number of database access tasks waiting in queue

Database access tasks get scheduled in this queue and get executed using one of the existing asynchronous sessions. A large queue indicates that the database connection is not able to deal with the large number of requests. Note that the queue has a maximum size. Tasks that do not fit into the queue will be retried, but won't show up in this metric.

canton.sequencer.db-storage.write.executor.running

(Gauge): Number of database access tasks currently running

Database access tasks run on an async executor. This metric shows the current number of tasks running in parallel.

canton.sequencer.db-storage.write.executor.waittime

(Timer): Scheduling time metric for database tasks

Every database query is scheduled using an asynchronous executor with a queue. The time a task is waiting in this queue is monitored using this metric.

canton.sequencer.processed

(Meter): Number of messages processed by the sequencer

This metric measures the number of successfully validated messages processed by the sequencer since the start of this process.

canton.sequencer.processed-bytes

(Meter): Number of message bytes processed by the sequencer

This metric measures the total number of message bytes processed by the sequencer.

canton.sequencer.sequencer-client.application-handle

(Timer): Timer monitoring time and rate of sequentially handling the event application logic

All events are received sequentially. This handler records the the rate and time it takes the application (participant or domain) to handle the events.

canton.sequencer.sequencer-client.delay

(Gauge): The delay on the event processing

Every message received from the sequencer carries a timestamp. The delay provides the difference between the sequencing time and the processing time. The difference can be a result of either clock-skew or if the system is overloaded and doesn't manage to keep up with processing events.

canton.sequencer.sequencer-client.event-handle

(Timer): Timer monitoring time and rate of entire event handling

Most event handling cost should come from the application-handle. This timer measures the full time (which should just be marginally more than the application handle.

(Gauge): The load on the event subscription

The event subscription processor is a sequential process. The load is a factor between 0 and 1 describing how much of an existing interval has been spent in the event handler.

canton.sequencer.sequencer-client.submissions.dropped

(Counter): Count of send requests that did not cause an event to be sequenced

Counter of send requests we did not witness a corresponding event to be sequenced by the supplied max-sequencing-time. There could be many reasons for this happening: the request may have been lost before reaching the sequencer, the sequencer may be at capacity and the the max-sequencing-time was exceeded by the time the request was processed, or the supplied max-sequencing-time may just be too small for the sequencer to be able to sequence the request.

canton.sequencer.sequencer-client.submissions.in-flight

(Gauge): Number of sequencer send requests we have that are waiting for an outcome or timeout

Incremented on every successful send to the sequencer. Decremented when the event or an error is sequenced, or when the max-sequencing-time has elapsed.

Counter that is incremented if a send request receives an overloaded response from the sequencer.

canton.sequencer.sequencer-client.submissions.sends

(Timer): Rate and timings of send requests to the sequencer

Provides a rate and time of how long it takes for send requests to be accepted by the sequencer. Note that this is just for the request to be made and not for the requested event to actually be sequenced.

canton.sequencer.sequencer-client.submissions.sequencing

(Timer): Rate and timings of sequencing requests

This timer is started when a submission is made to the sequencer and then completed when a corresponding event is witnessed from the sequencer, so will encompass the entire duration for the sequencer to sequence the request. If the request does not result in an event no timing will be recorded.

canton.sequencer.subscriptions

(Gauge): Number of active sequencer subscriptions

This metric indicates the number of active subscriptions currently open and actively served subscriptions at the sequencer.