Monitoring
Veeam Kasten enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data. This section documents how to use the built-in Prometheus instance, and describes the metrics currently exposed by Veeam Kasten.
Metrics can also be exported to external or on-cluster monitoring platforms for unified visibility across multiple clusters:
- General (Grafana, Datadog, Thanos, etc.): See Exporting Metrics to External Monitoring Systems.
- Red Hat OpenShift: See Red Hat ACM Observability.
Using Veeam Kasten's Prometheus Endpoint
By default, Prometheus is configured with persistent storage size 8Gi
and retention period of 30d. That can be changed with
--set prometheus.server.persistentVolume.size=<size> and
--set prometheus.server.retention=<days>.
Prometheus requires Kubernetes API access to discover Veeam Kasten pods
to scrape their metrics. Thus, by default Role and RoleBinding
entries are created in Veeam Kasten namespace. However, if you set
prometheus.rbac.create=true, global ClusterRole and
ClusterRoleBinding will be created instead.
The complete list of configurable parameters can be found at Advanced Install Options.
If for some reason you don't want helm to create RBAC for you
automatically and you have both rbac.create=false and
prometheus.rbac.create=false, you can create Role and RoleBinding
manually:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: k10-prometheus-server
namespace: kasten-io
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
- ingresses
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- extensions
- networking.k8s.io
resources:
- ingresses/status
- ingresses
verbs:
- get
- list
- watch
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: k10-prometheus-server
namespace: kasten-io
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: k10-prometheus-server
subjects:
- kind: ServiceAccount
name: prometheus-server
namespace: kasten-io
Veeam Kasten Metrics
When using Veeam Kasten Multi-Cluster Manager (i.e., a
cluster setup as a primary), to query metrics for the primary cluster
from its Prometheus instance a cluster label with a blank value ("")
is required.
Veeam Kasten Action Metrics
When Veeam Kasten performs various actions throughout the system, it collects metrics associated with these actions. It records counts for both cluster and application-specific actions.
These action metrics include labels that describe the context of the
action. For actions specific to an application, the application name is
included as app. For actions initiated by a policy, the policy name is
included as policy. For ended actions, the final status is included as
state.
Action States
Veeam Kasten actions progress through various states during their lifecycle. The following table describes each state and its corresponding metric label value:
| Action State | Metric Label | Description |
|---|---|---|
| Pending | pending |
Action has been created but not yet started |
| Running | running |
Action has been validated and is currently executing |
| AttemptFailed | attempt_failed |
At least one action phase needs to retry |
| Failed | failed |
Action has failed (at least one phase failed permanently) |
| Complete | succeeded |
Action has completed successfully or with exceptions |
| Cancelled | cancelled |
Action was cancelled before completion |
| Skipped | skipped |
Action has been skipped |
| Deleting | deleting |
Action is being deleted |
Separate metrics are collected for the number of times the action was
started, ended, or skipped. This is indicated by the suffix of the
metric (i.e., _started_count, _ended_count, or _skipped_count).
An overall set of metrics is also collected that does not include the
app or policy labels. These metrics end with _overall rather than
_count. It is recommended to use the overall metrics unless specific
application or policy information is required.
Metrics are collected for the following actions:
backupandbackup_clusterrestoreandrestore_clusterexportimportreportrun
For example, to query the number of successful backups in the past 24 hours:
sum(round(increase(action_backup_ended_overall{state="succeeded"}[24h])))
Or, to query the number of failed restores for the past hour:
sum(round(increase(action_restore_ended_overall{state="failed"}[1h])))
When querying metrics that are reported as counters, such as action
metrics, the increase or rate functions must be used. See
Prometheus query
functions
for more information.
Examples of Action Metrics
action_export_processed_bytes The overall bytes processed during the
export. Labels: policy, app action_export_transferred_bytes The
overall bytes transferred during the export. Labels: policy, app
See the Prometheus docs for more information on how to query data from Prometheus.