Monitoring

K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data.

K10 can be installed with Grafana in the same namespace. This instance of Grafana is setup to automatically query metrics from K10's prometheus instance. It also comes with a pre created dashboard that helps visualize the status of K10's operations such as backup, restore, export and import of applications.

This section documents how to install and enable Grafana and Prometheus, usage of the metrics currently exposed, generation of alerts and reports based on these metrics, and integration with external tools.

Using K10's Prometheus Endpoint

Exporting metrics from K10 is enabled by default. However, if you had explicitly disabled it at install time, it can be re-enabled with a helm upgrade (see more about upgrade at Upgrading K10) command to modify an already existing installation. The upgrade option to add is:

--set prometheus.server.enabled=true

Note

By default, Prometheus is configured with persistent storage size 8Gi and retention period of 30d. That can be changed with --set prometheus.server.persistentVolume.size=<size> and --set prometheus.server.retention=<days>. The complete list of configurable parameters can be found at Advanced Install Options.

Once Prometheus is enabled, metrics can be consumed from your Prometheus system by enabling the following scrape config:

- job_name: k10
  scrape_interval: 15s
  honor_labels: true
  scheme: http
  metrics_path: '/<k10-release-name>/prometheus/federate'
  params:
    'match[]':
      - '{__name__=~"jobs.*"}'
  static_configs:
    - targets:
      - 'prometheus-server.kasten-io.svc.cluster.local'
      labels:
        app: "k10"

K10 Metrics

While K10 exports a number of metrics, the jobs_duration metric is the easiest one for monitoring job status because it is already aggregated. This metric captures the running time of jobs that have completed, whether they succeed or fail.

The jobs_duration metric is a Prometheus histogram, and thus it comprises 3 "sub" metrics: jobs_duration_count, jobs_duration_sum and jobs_duration_bucket. There is a single "status" label for the metric which can take the following values: [succeeded, failed].

K10 Action Metrics

As K10 performs various actions throughout the system, metrics are collected about those actions. Counts for both cluster and application-specific actions are collected.

These action metrics include labels describing the context of the action. For actions specific to an application, the application name is included as app. For actions initiated by a policy, the policy name is included as policy. For ended actions, the final status is included as state (i.e. succeeded, failed, or cancelled).

Tip

When using K10 Multi-Cluster Manager, a cluster label must also be included. The cluster label should match the name of a secondary cluster or be blank ("") to query metrics for the primary cluster.

Separate metrics are collected for the number of times the action was started, ended, or skipped. This is indicated by the suffix of the metric (i.e. _started_count, _ended_count, or _skipped_count).

Metrics are collected for the following actions:

backup and backup_cluster
restore and restore_cluster
export
import
report
run

For example, to query the number of successful backups for each application in the past 24 hours:

sum by (app) (round(increase(action_backup_ended_count{state="succeeded"}[24h])))

Or to query the number of failed restores across all applications for the past hour:

sum(round(increase(action_restore_ended_count{state="failed"}[1h])))

Important

Due to action metrics being reported as counters, the increase or rate functions must be used when querying. See Prometheus query functions for more information.

Additional documentation on querying data from Prometheus can be found in the Prometheus docs.

Cluster License Status

K10 exports the metering_license_compliance_status metric related to cluster license compliance. This metric contains information on when the cluster was out of license compliance.

The metering_license_compliance_status metric is a Prometheus gauge, and has a value of 1 if the cluster's is license status is compliant and 0 otherwise. To see the timeline of when K10 was out of license compliance, the metering_license_compliance_status metric can be queried and graphed.

Using K10's Grafana Endpoint

Installation

To enable/disable Grafana and Prometheus, use this helm value while installing/upgrading K10. The helm value is enabled by default.

--set grafana.enabled=true

Accessing Grafana from K10's dashboard

Click on the "Usage and Reports" card on K10's dashboard.

Click on "More Charts and Alerts" to access the instance of Grafana installed with K10.

Charts and Graphs

The Grafana dashboard can be used to monitor how many application scoped or cluster scoped actions (backup, restore, export and import) have completed, failed or been skipped.

It shows the number of policy runs that have completed or been skipped.

The amount of disk space consumed and the percentage of free space available in K10's stateful services (catalog, jobs, and logging) are also shown.

Grafana Alerts

Grafana can be used to create alerts based on an alert rule. Each rule uses a query that fetches data from a data source. Each query involves a metric such as the K10 metrics described in a previous section.

Before adding an alert, make sure you performed more than one backup action. Select Notification channels from the Alerting menu.

Click on Add Channel. Add a channel name and select Slack as the channel type. In Optional Slack settings add Webhook URL. In Notification settings select Default setting.

As shown in above screenshot click on Test button. If webhook is configured properly. Test notification will be sent successfully.

Click on the Save button.

To add alerts in Grafana click on + → Dashboard

On the new Dashboard screen click on Add an empty panel

In the Edit Panel select Prometheus as the Datasource. In the Metrics browser select action_backup_started_count metrics

Click on the Use query button.

On the top-right corner of the Edit Panel , change the type of visualization to Graph (old)

Change the query name from A to action backup started.

On the top-right corner, click on the Save button.

Go to the Alert tab and click on Create Alert. Configure a name for the alert rule. In the Conditions section, select the query named action backup started created in the previous step. Set IS ABOVE to a value that is more than the current number of backup actions that were started on this cluster. is number of backup actions you executed in your cluster.

In the Notifications section, the Send To option will default to the notification channel created in a previous step. Click on the Save button on the top-right corner , followed by the Apply button. Once the number of backup actions started crosses the threshold set in the alert rule, notifications should be seen in the slack channel.

Generating Reports

K10 Reporting provides regular insights into key performance and operational states of the system. It uses prometheus to obtain information about action runs and storage consumption. For more information about K10 Reporting see Reporting

Integration with External Tools

Exporting Metrics to Datadog