Monitoring

K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data.

K10 can be installed with Grafana in the same namespace. This instance of Grafana is setup to automatically query metrics from K10's prometheus instance. It also comes with a pre created dashboard that helps visualize the status of K10's operations such as backup, restore, export and import of applications.

This section documents how to install and enable Grafana and Prometheus, usage of the metrics currently exposed, generation of alerts and reports based on these metrics, and integration with external tools.

Using K10's Prometheus Endpoint

By default, Prometheus is configured with persistent storage size 8Gi and retention period of 30d. That can be changed with --set prometheus.server.persistentVolume.size=<size> and --set prometheus.server.retention=<days>. The complete list of configurable parameters can be found at Advanced Install Options.

An external Prometheus server can be configured to scrape K10's built-in server. The following scrape config is an example of how a Prometheus server hosted in the same cluster might be configured:

- job_name: k10
  scrape_interval: 15s
  honor_labels: true
  scheme: http
  metrics_path: '/<k10-release-name>/prometheus/federate'
  params:
    'match[]':
      - '{__name__=~"jobs.*"}'
  static_configs:
    - targets:
      - 'prometheus-server.kasten-io.svc.cluster.local'
      labels:
        app: "k10"

Note

An additional NetworkPolicy may need to be applied in certain environments.

Although it's possible to disable K10's built-in Prometheus server enabled, it's recommended to leave it enabled. Disabling the server reduces functionality in various parts of the system such as usage data, reporting, and the multi-cluster dashboard. To disable the built-in server, set the prometheus.server.enabled value to false.

If the built-in server has previously been disabled, it can be re-enabled during a helm upgrade (see Upgrading K10) with: --set prometheus.server.enabled=true.

K10 Metrics

While K10 exports a number of metrics, the jobs_duration metric is the easiest one for monitoring job status because it is already aggregated. This metric captures the running time of jobs that have completed, whether they succeed or fail.

The jobs_duration metric is a Prometheus histogram, and thus it comprises 3 "sub" metrics: jobs_duration_count, jobs_duration_sum and jobs_duration_bucket. There is a single "status" label for the metric which can take the following values: [succeeded, failed].

K10 Compliance Metrics

If you would like to track the number of applications outside of compliance you can use the metric compliance_count and check for the following states: [NotCompliant, Unmanaged]

K10 Action Metrics

As K10 performs various actions throughout the system, metrics are collected about those actions. Counts for both cluster and application-specific actions are collected.

These action metrics include labels describing the context of the action. For actions specific to an application, the application name is included as app. For actions initiated by a policy, the policy name is included as policy. For ended actions, the final status is included as state (i.e. succeeded, failed, or cancelled).

Tip

When using K10 Multi-Cluster Manager, a cluster label must also be included. The cluster label should match the name of a secondary cluster or be blank ("") to query metrics for the primary cluster.

Separate metrics are collected for the number of times the action was started, ended, or skipped. This is indicated by the suffix of the metric (i.e. _started_count, _ended_count, or _skipped_count).

An overall set of metrics is also collected that does not include the app or policy labels. These metrics end with _overall rather than _count. The overall metrics should be preferred unless application or policy specific information is required.

Metrics are collected for the following actions:

  • backup and backup_cluster

  • restore and restore_cluster

  • export

  • import

  • report

  • run

For example, to query the number of successful backups in the past 24 hours:

sum(round(increase(action_backup_ended_overall{state="succeeded"}[24h])))

Or to query the number of failed restores for the past hour:

sum(round(increase(action_restore_ended_overall{state="failed"}[1h])))

Important

Due to action metrics being reported as counters, the increase or rate functions must be used when querying. See Prometheus query functions for more information.

Additional documentation on querying data from Prometheus can be found in the Prometheus docs.

K10 Storage Metrics

To check exported storage consumption (Object, NFS or Veeam Backup & Replication) you can use export_storage_size_bytes with types [logical, physical], e.g. export_storage_size_bytes{type="logical"}. The deduplication ratio is calculated by logical / physical.

To check local backup data size you can use snapshot_storage_size_bytes, also with logical and physical types.

Cluster License Status

K10 exports the metering_license_compliance_status metric related to cluster license compliance. This metric contains information on when the cluster was out of license compliance.

The metering_license_compliance_status metric is a Prometheus gauge, and has a value of 1 if the cluster's is license status is compliant and 0 otherwise. To see the timeline of when K10 was out of license compliance, the metering_license_compliance_status metric can be queried and graphed.

Using K10's Grafana Endpoint

Installation

To enable/disable Grafana and Prometheus, use this helm value while installing/upgrading K10. The helm value is enabled by default.

--set grafana.enabled=true

Accessing Grafana from K10's dashboard

Click on the "Usage and Reports" card on K10's dashboard.

Click on "More Charts and Alerts" to access the instance of Grafana installed with K10.

Charts and Graphs

The Grafana dashboard can be used to monitor how many application scoped or cluster scoped actions (backup, restore, export and import) have completed, failed or been skipped.

It shows the number of policy runs that have completed or been skipped.

The amount of disk space consumed and the percentage of free space available in K10's stateful services (catalog, jobs, and logging) are also shown.

Grafana Alerts

Grafana can be used to create alerts based on an alert rule. Each rule uses a query that fetches data from a data source. Each query involves a metric such as the K10 metrics described in a previous section.

Before adding an alert, make sure you performed more than one backup action. Select Notification channels from the Alerting menu.

Click on Add Channel. Add a channel name and select Slack as the channel type. In Optional Slack settings add Webhook URL. In Notification settings select Default setting.

As shown in above screenshot click on Test button. If webhook is configured properly. Test notification will be sent successfully.

Click on the Save button.

To add alerts in Grafana click on + Dashboard

On the new Dashboard screen click on Add an empty panel

In the Edit Panel select Prometheus as the Datasource. In the Metrics browser select action_backup_started_overall metrics

Click on the Use query button.

On the top-right corner of the Edit Panel , change the type of visualization to Graph (old)

Change the query name from A to action backup started.

On the top-right corner, click on the Save button.

Go to the Alert tab and click on Create Alert. Configure a name for the alert rule. In the Conditions section, select the query named action backup started created in the previous step. Set IS ABOVE to a value that is more than the current number of backup actions that were started on this cluster. is number of backup actions you executed in your cluster.

In the Notifications section, the Send To option will default to the notification channel created in a previous step. Click on the Save button on the top-right corner , followed by the Apply button. Once the number of backup actions started crosses the threshold set in the alert rule, notifications should be seen in the slack channel.

Generating Reports

K10 Reporting provides regular insights into key performance and operational states of the system. It uses prometheus to obtain information about action runs and storage consumption. For more information about K10 Reporting see Reporting

Integration with External Tools