Monitoring and Alerts

K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data. This section documents the install instructions to enable Prometheus usage, the metrics currently exposed, and how to generate alerts based on these metrics.

Using K10's Prometheus Endpoint

Exporting metrics from K10 is enabled by default. However, if you had explicitly disabled it at install time, it can be re-enabled with a helm upgrade (see more about upgrade at Installing and Upgrading K10) command to modify an already existing installation. The upgrade option to add is:

--set prometheus.enabled=true

Once Prometheus is enabled, metrics can be consumed from your Prometheus system by enabling the following scrape config:

- job_name: k10
  scrape_interval: 15s
  honor_labels: true
  scheme: http
  metrics_path: '/k10/prometheus/federate'
  params:
    'match[]':
      - '{__name__=~"jobs.*"}'
  static_configs:
    - targets:
      - 'prometheus-server.kasten-io.svc.cluster.local'
      labels:
        app: "k10"

K10 Metrics

While K10 exports a number of metrics, the jobs_duration metric is the easiest one for monitoring job status because it is already aggregated. This metric captures the running time of jobs that have completed, whether they succeed or fail.

The jobs_duration metric is a Prometheus histogram, and thus it comprises 3 "sub" metrics: jobs_duration_count, jobs_duration_sum and jobs_duration_bucket. There is a single "status" label for the metric which can take the following values: [succeeded, failed].

K10 also exposes catalog_actions_count metric which is labeled with status and type labels. Possible values for status label is [complete, failed, running, pending]. For type label, the possible values are [backup, restore, import, export].

While extensive documentation on how to query metrics is available elsewhere, some of the queries that will be the most useful are sum(jobs_duration_count) by (status) and sum(jobs_duration_count{status="failed"}) by (status). The first will, for a point query result, will generate two results, one for succeeded and one for failed jobs. The second will produce one result for the all the failed K10 jobs in the system.

Finally, if you are collecting metrics from distinct K10 instances running across different clusters, you can distinguish between metrics from different clusters by the use of labels.

Generating Alerts

Prometheus supports the creating of complex alerts based on gathered metrics. To get you started, the following example will create an alert if you have had any K10 job fail within the last 10 minutes.

- alert: JobsFailing
  expr: increase(catalog_actions_count{status="failed"}[10m]) > 0
  for: 1m
  annotations:
    summary: "More than 1 failed K10 job for the last 10 min"
    description: "{{ $labels.app }} jobs amount of errors for the last 10 mins {{ $value }}"