Monitoring and Alerts¶
K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data. This section documents the install instructions to enable Prometheus usage, the metrics currently exposed, and how to generate alerts based on these metrics.
Using K10's Prometheus Endpoint¶
Exporting metrics from K10 is enabled by default. However, if you had
explicitly disabled it at install time, it can be re-enabled with a
helm upgrade
(see more about upgrade at Upgrading K10)
command to modify an already existing installation. The upgrade option
to add is:
--set prometheus.enabled=true
Note
By default, Prometheus is configured with persistent storage size 8Gi and retention period of 30d. That can be changed with --set prometheus.server.persistentVolume.size=<size> and --set prometheus.server.retention=<days>. The complete list of configurable parameters can be found at Advanced Install Options.
Once Prometheus is enabled, metrics can be consumed from your Prometheus system by enabling the following scrape config:
- job_name: k10
scrape_interval: 15s
honor_labels: true
scheme: http
metrics_path: '/<k10-release-name>/prometheus/federate'
params:
'match[]':
- '{__name__=~"jobs.*"}'
static_configs:
- targets:
- 'prometheus-server.kasten-io.svc.cluster.local'
labels:
app: "k10"
K10 Metrics¶
While K10 exports a number of metrics, the jobs_duration
metric is
the easiest one for monitoring job status because it is already
aggregated. This metric captures the running time of jobs that have
completed, whether they succeed or fail.
The jobs_duration
metric is a Prometheus histogram, and thus it
comprises 3 "sub" metrics: jobs_duration_count
, jobs_duration_sum
and
jobs_duration_bucket
. There is a single "status" label for the metric
which can take the following values: [succeeded, failed]
.
K10 also exposes catalog_actions_count
metric which is labeled with
status
, type
, policy
and namespace
labels. Possible values
for status
label are [complete, failed, running, pending]
and
for type
label, the possible values are
[backup, restore, import, export]
. policy
label value points
to the policy which initiates the action. namespace
label value
is the application namespace involved in the action.
While extensive documentation on how to query metrics is available
elsewhere,
some of the queries that will be the most useful are
sum(jobs_duration_count) by (status)
and
sum(jobs_duration_count{status="failed"}) by (status)
. The first
will, for a point query result, will generate two results, one for
succeeded and one for failed jobs. The second will produce one result
for the all the failed K10 jobs in the system.
Finally, if you are collecting metrics from distinct K10 instances running across different clusters, you can distinguish between metrics from different clusters by the use of labels.
Cluster License Status¶
K10 exports the metering_license_compliance_status
metric related
to cluster license compliance. This metric contains information on
when the cluster was out of license compliance.
The metering_license_compliance_status
metric is a Prometheus gauge
,
and has a value of 1 if the cluster's is license status is compliant and 0
otherwise. To see the timeline of when K10 was out of license compliance, the
metering_license_compliance_status
metric can be queried and graphed.
Generating Alerts¶
Prometheus supports the creating of complex alerts based on gathered
metrics. To get you started, the following example will create an
alert if you have had any K10 jobs fail for the prod-daily
policy
within the last 10 minutes.
- alert: JobsFailing
expr: increase(catalog_actions_count{status="failed", policy="prod-daily"}[10m]) > 0
for: 1m
annotations:
summary: "More than 1 failed K10 jobs for policy prod-daily for the last 10 min"
description: "{{ $labels.app }} jobs amount of errors for the last 10 mins {{ $value }} for {{ $labels.policy }} policy"