Monitoring

Warning

Grafana will no longer be included in the Veeam Kasten installation starting in the upcoming release 7.5.0. Upon upgrading to this version, the integrated version of Grafana will be removed.

It is important to install Grafana separately and follow the procedure described in the knowledge base article to configure the Kasten dashboards and alerts before upgrading Kasten to version 7.5.0.

K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data.

K10 can be installed with Grafana in the same namespace. This instance of Grafana is setup to automatically query metrics from K10's prometheus instance. It also comes with a pre created dashboard that helps visualize the status of K10's operations such as backup, restore, export and import of applications.

This section documents how to install and enable Grafana and Prometheus, usage of the metrics currently exposed, generation of alerts and reports based on these metrics, and integration with external tools.

Using K10's Prometheus Endpoint

By default, Prometheus is configured with persistent storage size 8Gi and retention period of 30d. That can be changed with --set prometheus.server.persistentVolume.size=<size> and --set prometheus.server.retention=<days>.

Prometheus requires Kubernetes API access to discover K10 pods to scrape their metrics. Thus, by default Role and RoleBinding entries are created in K10 namespace. However, if you set prometheus.rbac.create=true, global ClusterRole and ClusterRoleBinding will be created instead.

The complete list of configurable parameters can be found at Advanced Install Options.

If for some reason you don't want helm to create RBAC for you automatically and you have both rbac.create=false and prometheus.rbac.create=false, you can create Role and RoleBinding manually:

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k10-prometheus-server
  namespace: kasten-io
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - nodes/metrics
  - services
  - endpoints
  - pods
  - ingresses
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  - networking.k8s.io
  resources:
  - ingresses/status
  - ingresses
  verbs:
  - get
  - list
  - watch
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: k10-prometheus-server
  namespace: kasten-io
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name:  k10-prometheus-server
subjects:
  - kind: ServiceAccount
    name: prometheus-server
    namespace: kasten-io

An external Prometheus server can be configured to scrape K10's built-in server. The following scrape config is an example of how a Prometheus server hosted in the same cluster might be configured:

- job_name: k10
  scrape_interval: 15s
  honor_labels: true
  scheme: http
  metrics_path: '/<k10-release-name>/prometheus/federate'
  params:
    'match[]':
      - '{__name__=~"jobs.*"}'
  static_configs:
    - targets:
      - 'prometheus-server.kasten-io.svc.cluster.local'
      labels:
        app: "k10"

Note

An additional NetworkPolicy may need to be applied in certain environments.

Although it's possible to disable K10's built-in Prometheus server enabled, it's recommended to leave it enabled. Disabling the server reduces functionality in various parts of the system such as usage data, reporting, and the multi-cluster dashboard. To disable the built-in server, set the prometheus.server.enabled value to false.

If the built-in server has previously been disabled, it can be re-enabled during a helm upgrade (see Upgrading K10) with: --set prometheus.server.enabled=true.

K10 Metrics

Tip

When using K10 Multi-Cluster Manager (i.e., a cluster setup as a primary), to query metrics for the primary cluster from its Prometheus instance a cluster label with a blank value ("") is required.

K10 Action Metrics

When K10 performs various actions throughout the system, it collects metrics associated with these actions. It records counts for both cluster and application-specific actions.

These action metrics include labels that describe the context of the action. For actions specific to an application, the application name is included as app. For actions initiated by a policy, the policy name is included as policy. For ended actions, the final status is included as state (i.e., succeeded, failed, or cancelled).

Separate metrics are collected for the number of times the action was started, ended, or skipped. This is indicated by the suffix of the metric (i.e., _started_count, _ended_count, or _skipped_count).

An overall set of metrics is also collected that does not include the app or policy labels. These metrics end with _overall rather than _count. It is recommended to use the overall metrics unless specific application or policy information is required.

Metrics are collected for the following actions:

  • backup and backup_cluster

  • restore and restore_cluster

  • export

  • import

  • report

  • run

For example, to query the number of successful backups in the past 24 hours:

sum(round(increase(action_backup_ended_overall{state="succeeded"}[24h])))

Or, to query the number of failed restores for the past hour:

sum(round(increase(action_restore_ended_overall{state="failed"}[1h])))

Important

When querying metrics that are reported as counters, such as action metrics, the increase or rate functions must be used. See Prometheus query functions for more information.

Examples of Action Metrics

action_export_processed_bytes The overall bytes processed during the export. Labels: policy, app action_export_transferred_bytes The overall bytes transferred during the export. Labels: policy, app

See the Prometheus docs for more information on how to query data from Prometheus.

K10 Artifact Metrics

You can monitor both the rate of artifact creation and the current count within K10. Similar to the action counts mentioned above, there are also the following metrics, which track the number of artifacts backed up by K10 within a defined time frame:

  • action_artifact_count

  • action_artifact_count_by_app

  • action_artifact_count_by_policy

To see the number of artifacts protected by snapshots currently you can use the following metrics.

  • artifact_sum

  • artifact_sum_by_app

  • artifact_sum_by_policy

If an artifact is protected by multiple snapshots then it will be counted multiple times.

K10 Compliance Metrics

To track the number of applications that fall outside of compliance, you can use the compliance_count metric, which includes the following states of interest: [NotCompliant, Unmanaged]. If the cluster contains pre-existing namespaces, which are not subject to compliance concerns, you have the option to use the Helm flag excludedApps to exclude them. This action will remove both the application(s) from the dashboard and exclude them from the compliance_count. You can set this exclusion using the inline array (excludedApps: ["app1", "app2"]) or the multi-line array, specifying the applications to be excluded:

excludedApps:
  - app1
  - app2

If you prefer to set Helm values inline rather than through a YAML file, you can do this with the following:

--set excludedApps[0]="app1"
--set excludedApps[1]="app2"

See the knowledge base article for more information.

K10 Execution Metrics

Aggregating Job and Phase Runner Metrics

Designed especially for measuring the parallelism usage:

Name

Type

Description

Labels

exec_active_job_count

gauge

Number of active jobs at a time

  • action - Action name (e.g. manualSnapshot, retire)

exec_started_job_count_total

counter

Total number of started jobs per executor instance

  • action - Action name (e.g. manualSnapshot, retire)

exec_active_phase_count

gauge

Number of active phases for a given action and with a given name per executor instance

  • action - Action name (e.g. manualSnapshot, retire)

  • phase - Phase name (e.g. copySnapshots, reportMetrics)

exec_started_phase_count_total

counter

Total number of started phases for a given action and with a given name per executor instance

  • action - Action name (e.g. manualSnapshot, retire)

  • phase - Phase name (e.g. copySnapshots, reportMetrics)

exec_phase_error_count_total

counter

Total number of errors for a given action and phase per executor instance

  • action - Action name (e.g. manualSnapshot, retire)

  • phase - Phase name (e.g. copySnapshots, reportMetrics)

Rate Limiter Metrics

These metrics might be useful for monitoring current pressure:

Name

Type

Description

Labels

limiter_inflight_count

gauge

Number of in-flight operations

  • operation - Operation name (e.g. csiSnapshot, genericCopy)

limiter_pending_count

gauge

Number of pending operations

  • operation - Operation name (e.g. csiSnapshot, genericCopy)

limiter_request_seconds

histogram

Duration in seconds of:

  • how long operation wait for the token (label stage = wait)

  • how long operation hold the token (label stage = hold)

  • operation - Operation name (e.g. csiSnapshot, genericCopy)

  • stage - This label indicates the essence of the metric. Can be wait or hold. See description for more details

Jobs Metrics

These metrics measure the time range between the creation of the job and its completion:

Name

Type

Description

Labels

jobs_completed

gauge

Number of finished jobs (the job is considered to be finished if it has failed, skipped, or succeeded status)

  • status - Status name (e.g. succeeded, failed)

jobs_duration

histogram

Duration in seconds of completed K10 jobs.

  • status - Status name (e.g. succeeded, failed)

  • policy_id - Policy ID (e.g. 264aae0e-07ac-4aa5-a38f-aa131c053cbe, UNKNOWN)

The jobs_duration metric is the easiest one for monitoring job status because it is already aggregated. This metric captures the running time of jobs that have completed, whether they succeed or fail.

K10 License Status

K10 exports the metering_license_compliance_status metric related to the cluster's license compliance. This metric contains information on when the cluster was out of license compliance.

The metering_license_compliance_status metric is a Prometheus gauge, and has a value of 1 if the cluster's license status is compliant and 0 otherwise. To see the timeline of when K10 was out of license compliance, the metering_license_compliance_status metric can be queried and graphed.

../_images/licensestatus_prometheus.png

It is possible to see the peak node usage for the last two months e.g. by querying node_usage_history{timePeriod="202210"}. The label format is YYYYMM.

K10 Status Metrics

The state of profiles and policies can be monitored with profiles_count and policies_count respectively.

profiles_count{type="Location", status="Failed"} reporting a value greater than 0 would be grounds for further investigation as it would create issues for any related policies. type="Infra" is also available for Infrastructure profiles.

policies_count{action="backup", chained="export", status="Failed"} reports on policies involving both a backup and export that are in a failed state.

K10 Storage Metrics

To check exported storage consumption (Object, NFS or Veeam Backup & Replication) there is export_storage_size_bytes with types [logical, physical], e.g. export_storage_size_bytes{type="logical"}. The deduplication ratio is calculated by logical / physical.

snapshot_storage_size_bytes, also with logical and physical types, reports the local backup space utilization.

Data Transfer Metrics

Metrics are collected for individual snapshot upload and download operation steps within K10 export and import actions. These metrics differ from those collected for K10 actions because they are captured on a per-volume basis, whereas K10 actions, in general, could involve multiple volume operations and other activities.

The following data operations metrics are recorded:

Metric Name

Type

Description

data_operation_duration

Histogram

This metric captures the total time taken to complete an operation.

data_operation_normalized_duration

Histogram

This metric captures the normalized time taken by an operation. The value is expressed in time/MiB. Normalized duration values allow comparisons between different time series, which is not possible for duration metric values due to the dependency on the amount of data transferred.

data_operation_bytes

Counter

This metric counts the bytes transferred by an operation, and is typically used to compute the data transfer rate. Note: This metric is not collected for Download operations involving the Filesystem export mechanism.

data_operation_volume_count

Gauge

This metric counts the number of volumes involved in an operation. It is set to 1 at the beginning of an operation and changes to 0 upon completion. When aggregated, it displays the total number of volumes being transferred over time.

The following labels are applied to the operation metrics:

Label Name

Description

operation

The type of operation: one of Upload or Download

repo_type

The type of LocationProfile object that identifies the storage repository: one of ObjectStore, FileStore or VBR.

repo_name

The name of the LocationProfile object that identifies the storage repository.

data_format

The export mechanism used: one of Filesystem or Block.

namespace

The namespace of the application involved.

pvc_name

The name of the PVC involved.

storage_class

The storage class of the PVC involved.

Upload operation metrics do not include the time taken to snapshot the volumes or the time to upload the action's metadata. However, they do include the time taken to instantiate a PersistentVolume from a snapshot when needed. Similarly, Download operation metrics do not involve the allocation of the PersistentVolume or the node affinity enforcement steps.

Some query examples:

# average duration over 2-minute intervals
sum by (data_format,operation,namespace,pvc_name) (rate(data_operation_duration_sum{}[2m]))
/ sum by (data_format,operation,namespace,pvc_name) (rate(data_operation_duration_count{}[2m]))

# average transfer rate over 2-minute intervals
avg by (data_format, operation, storage_class, repo_name) (rate(data_operation_bytes{}[2m]))

# count of data transfer operations over 2-minute intervals
sum (max_over_time(data_operation_volume_count{}[2m]))

When a Veeam Backup Repository is involved, additional metrics are recorded:

Metric Name

Type

Description

data_upload_session_duration

Histogram

This metric captures the total time taken for an upload session.

data_upload_session_volume_count

Gauge

This metric counts the number of volumes in an upload session. When aggregated, it shows the total number of volumes across all upload sessions over time.

The following labels are applied to the upload session metrics:

Label Name

Description

repo_type

The type of LocationProfile object that identifies the storage repository: VBR.

repo_name

The name of the LocationProfile object that identifies the storage repository.

namespace

The namespace of the application involved.

A query example:

# count of volumes involved in VBR upload sessions over 2-minute intervals
sum (max_over_time(data_upload_session_volume_count{repo_type="VBR"}[2m]))

K10 Multi-Cluster Metrics

The Multi-Cluster primary instance exports the following metrics collected from all clusters within the multi-cluster system.

Use the cluster label with cluster name as the value to query metrics for an individual cluster.

For example, to query the number of successful actions in the past 24 hours:

sum(round(increase(mc_action_ended_count{state="succeeded",cluster="<cluster-name>"}[24h])))

Policy Metrics

Name

Type

Description

Labels

mc_policies_count

gauge

Number of policies in cluster

  • cluster - Cluster name

mc_compliance_count

gauge

Number of namespaces by compliance state. See K10 Compliance Metrics about exclusions

  • cluster - Cluster name

  • state - Compliance state (e.g. Compliant, NotCompliant, Unmanaged)

Action Metrics

Name

Type

Description

Labels

mc_action_ended_count

counter

Number of actions that have ended

  • cluster - Cluster name

  • state - Terminal state (e.g. cancelled, failed, succeeded)

mc_action_skipped_count

counter

Number of actions that were skipped

  • cluster - Cluster name

Storage Metrics

Name

Type

Description

Labels

mc_export_storage_physical_size_bytes

gauge

Exported storage consumption in bytes

  • cluster - Cluster name

mc_snapshot_storage_physical_size_bytes

gauge

Local backup space utilization in bytes

  • cluster - Cluster name

Using K10's Grafana Endpoint

Installation

To enable/disable Grafana and Prometheus, use this helm value while installing/upgrading K10. The helm value is enabled by default.

--set grafana.enabled=true

Accessing Grafana from K10's dashboard

Click on the "Data Usage" card on K10's dashboard.

../_images/data_usage.png

Click on "More Charts and Alerts" to access the instance of Grafana installed with K10.

../_images/grafana_link.png

Charts and Graphs

../_images/grafana_dashboard.png

The Grafana dashboard can be used to monitor how many application scoped or cluster scoped actions (backup, restore, export and import) have completed, failed or been skipped.

It shows the number of policy runs that have completed or been skipped.

The amount of disk space consumed and the percentage of free space available in K10's stateful services (catalog, jobs, and logging) are also shown.

The Data reduction section provides graphs which show the amount of data being transferred (e.g, when the new volume has been exported it will be close to 100%, as all data needs to be transferred, but with an unchanged volume it will be 0% since most of the data has already been exported):

../_images/data_reduction_dashboard.png

The K10 System Resource Usage section provides CPU/Memory usage graphs specific to K10 and metrics that describe task execution performance:

../_images/resource_usage_dashboard.png

The Data transfer operations section provides graphs on the transfer of data to and from storage repositories that are captured by the data transfer metrics described above.

../_images/data_operations_panel.png

The column on the left is organized by storage class, location profile, and the export mechanism used. The upper panel displays the normalized duration of transfer operations, while the lower panel shows the data transfer rate. (The normalized duration expresses the time taken to transfer one MiB of data, and hence is comparable between the different time series displayed in the panel).

The column on the right is organized by individual PVC and data format used, with the upper panel showing the actual duration of individual operations and the lower panel showing the transfer rate.

All panels have an overlay that displays the number of volume operations in progress. In addition, if VBR is used, the number of volumes involved in VBR upload sessions will be shown in a shaded area.

Grafana Alerts

Grafana can be used to create alerts to get notified moments after something unexpected happens in your system. An alert can be generated by specifying a condition or evaluation criteria and, these conditions can be configured using Alert rules. Each rule uses a query that fetches data from a data source. Each query involves a metric such as the K10 metrics described in a previous section. More can be read about this by following the Grafana Alerting documentation.

There are three main constructs that are involved while creating alerts in Grafana:

Alert rules

The condition on which the alerts should be fired can be configured using alert rules.

A new alert rule can be created by going to the dashboard's edit option and then clicking on the Alert tab at the bottom of the page. In this example, it's assumed that a dashboard panel named Dashboard Local is already created.

../_images/edit_dashboard.png

Once there, the Create alert rule from this panel button can be used to set the query and alert condition for this alert rule. Configure the datasource that should be used in this alert and the metric that should be queried.

In this example, datasource Prometheus and metric action_backup_ended_overall were used.

../_images/create_alert_rule.png

After setting the query and alert condition, the label of this alert rule can be configured by scrolling down the same page, until Notifications options.

Labels are useful to configure where these alerts are going to be sent.

../_images/alert_rule_labels.png

In this example, the labels team:operations and resource:backup have been used.

Click on Save and Exit to save the dashboard with this alert rule and exit.

Contact Points

Contact points are used to configure the communication medium for the alerts that are going to be generated. For example, in some scenarios, it might be useful to get a slack message as soon as an alert is fired. In that case, slack must be configured as a contact point. To see a list of all the contact point types, refer to this Grafana documentation.

A contact point can be configured by going to the Alerting dashboard and then clicking on New contact point under the Contact points tab. In the example below, slack has been chosen as the contact point type.

../_images/new_contact_point.png

Notification Policies

Once the alerts rule and contact points have been configured, the relationship between these two configurations is established by creating a Notification policy.

A notification policy can be configured by going to the Alerting dashboard and then clicking on New specific policy under the Notification policies tab.

The example below uses the same labels specified while creating the alert rule in the previous step.

../_images/new_notification_policy.png

When an alert is generated based on the rule configured, notifications will be sent to the slack channel.

Integrating External Prometheus with K10

To integrate external Prometheus with K10, set the flags global.prometheus.external.host and global.prometheus.external.port. If external Prometheus is setup with a base URL, set the global.prometheus.external.baseURL flag. Make sure RBAC was enabled while setting up external Prometheus to enable target discovery.

It's also possible to disable kasten built-in prometheus by setting the flag prometheus.server.enabled: false

Scrape Config

Update the Prometheus scrape configuration by adding two additional targets.

- job_name: httpServiceDiscovery
  http_sd_configs:
    - url: http://metering-svc.kasten-io.svc.cluster.local:8000/v0/listScrapeTargets
- job_name: k10-pods
  scheme: http
  metrics_path: /metrics
  kubernetes_sd_configs:
    - role: pod
      namespaces:
        own_namespace: true
      selectors:
        - role: pod
          label: "component=executor"
  relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      action: keep
      regex: 8\d{3}

It is possible to obtain those targets from K10's Prometheus' configuration, if Prometheus was installed with K10, you should skip job:prometheus. (Note. yq utility is needed to execute commands successfully)

# Get prometheus job
kubectl get cm k10-k10-prometheus-config -n kasten-io -o "jsonpath={.data['prometheus\.yml']}" | yq '.scrape_configs'

# Update prometheus configmap with given output.

The targets will show up after adding the scrape config. Note that the targets will not be scraped until a network policy is added.

../_images/external-prom-service-down.png

Network Policy

Once the scrape config is in place, the targets will be discovered but Prometheus won't be able to scrape them as K10 has strict network policies for inter-service communication. To enable communication between external Prometheus and K10, a new network policy should be added as follows.

Add a label to the namespace where external Prometheus is installed - kubectl label namespace/prometheus app=prometheus and apply the following network policy to enable communication.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app: k10
    heritage: Helm
    release: k10
  name: allow-external-prometheus
spec:
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              app: prometheus
  podSelector:
    matchLabels:
      release: k10

Once the network policy enables communication, all the service targets will start coming up and the metrics will be scraped.

../_images/external-prom-service-up.png

Generating Reports

K10 Reporting provides regular insights into key performance and operational states of the system. It uses prometheus to obtain information about action runs and storage consumption. For more information about K10 Reporting see Reporting

Integration with External Tools