Configuring Kasten ACM Alerts
This guide shows how to deploy Prometheus alerting for Veeam Kasten on OpenShift clusters managed by Red Hat Advanced Cluster Management (RHACM) with the MultiCluster Observability (MCO) add-on. Alerts route to ACM Observe → Alerts automatically once applied.
For more information on Prometheus Alerts and using alerts in Red Hat ACM, please visit:
What You Get
| Alert | Severity | Fires when |
|---|---|---|
KastenBackupJobFailed |
warning | A backup job fails or is cancelled (triggers once per event) |
KastenBackupJobsWithFailures |
warning | Unresolved failed backup actions exist in K10 catalog (persists until resolved) |
KastenBackupJobSkipped |
warning | Skipped backup actions exist in K10 catalog (typically indicates a mirror backup job did not run) |
KastenBackupDurationHigh |
warning | Average backup duration exceeds 130% of the 1-hour baseline |
KastenRetireJobFailed |
warning | A retire job fails (triggers once per event) |
KastenPVNearFullWarning |
warning | K10 catalog or jobs PVC exceeds 50% capacity |
KastenPVNearFullCritical |
critical | K10 catalog or jobs PVC exceeds 85% capacity |
Prerequisites
On each cluster (hub and managed):
-
Kasten installed in the
kasten-ionamespace. -
User Workload Monitoring enabled. Check with:
oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath='{.data.config\.yaml}'If
enableUserWorkload: trueis not present, enable it:oc -n openshift-monitoring patch configmap cluster-monitoring-config --type merge -p '{"data":{"config.yaml":"enableUserWorkload: true\n"}}'Wait ~60 seconds for the
openshift-user-workload-monitoringpods to start.infoEnabling User Workload Monitoring alone does NOT create an Alertmanager pod in
openshift-user-workload-monitoring. The UWM Alertmanager must be explicitly enabled separately. No action is needed here — Prerequisite 4 below appliesuwm-alertmanager-config.yaml, which handles both enabling the Alertmanager and configuring routing in a single step. -
Logged in with cluster-admin or a role that can create PrometheusRules and ServiceMonitors in
kasten-io. -
MCO Hub Alertmanager routing configured on every cluster (hub and managed). UWM Prometheus routes alerts through UWM Alertmanager, not the main OCP Alertmanager. To reach ACM Observe → Alerts, UWM Alertmanager on each cluster must be configured to forward to the MCO hub Alertmanager.
a. Copy the MCO auth secrets from
openshift-monitoringtoopenshift-user-workload-monitoringon each cluster:oc get secret observability-alertmanager-accessor -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: openshift-user-workload-monitoring/' | oc apply -f -oc get secret hub-alertmanager-router-ca -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: openshift-user-workload-monitoring/' | oc apply -f -b. Download
uwm-alertmanager-config.yamland replace<HUB_ALERTMANAGER_HOSTNAME>with the MCO Alertmanager hostname. This value comes from the hub cluster — run this once on the hub, then reuse the same hostname for every managed cluster:oc get route alertmanager -n open-cluster-management-observability -o jsonpath='{.spec.host}'warningDo not run the route lookup on a managed cluster —
open-cluster-management-observabilityonly exists on the hub. Managed clusters use the same hub hostname value.Download the config file, edit it to set
<HUB_ALERTMANAGER_HOSTNAME>, then apply on each cluster (switching context as needed):curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/uwm-alertmanager-config.yaml# Edit the file and replace <HUB_ALERTMANAGER_HOSTNAME>oc apply -f uwm-alertmanager-config.yaml -n openshift-user-workload-monitoringinfoThis file does two things in one apply: enables the UWM Alertmanager (
alertmanager: enabled: true) and configures it to forward to the MCO hub Alertmanager. Both are required — applying it once covers both. It is safe to re-apply if already done. Ifuser-workload-monitoring-configalready exists inopenshift-user-workload-monitoring, useoc editor merge the content into the existingconfig.yamlkey rather than overwriting it.Wait ~60 seconds for the UWM Alertmanager pods to start:
oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager
Download Files
Download all files before starting. The ServiceMonitor and alert rule files are applied to every cluster running Kasten (hub and each managed cluster):
curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/kasten-k10-acm-servicemonitor.yaml
curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/kasten-k10-acm-alerts-k10.yaml
curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/kasten-k10-acm-alerts-infra.yaml
| File | Purpose | Apply namespace |
|---|---|---|
kasten-k10-acm-servicemonitor.yaml |
Federates Kasten-native metrics into UWM Prometheus | kasten-io |
kasten-k10-acm-alerts-k10.yaml |
PrometheusRule for Kasten-native alerts (backup, retire) | kasten-io |
kasten-k10-acm-alerts-infra.yaml |
PrometheusRule for infrastructure alerts (PV usage) | kasten-io |
Apply Order
The ServiceMonitor and both alert rule files are applied to every cluster running Kasten (hub and each managed cluster). The dashboard ConfigMap is the only hub-only resource — it lives in open-cluster-management-observability, not kasten-io, and is covered separately in Deploying the Kasten ACM Dashboard.
Repeat Steps 1–4 below on the hub cluster and then on each managed cluster.
Before applying the ServiceMonitor, edit kasten-k10-acm-servicemonitor.yaml and set the replacement field in metricRelabelings to the ACM managed cluster name for the cluster you are currently targeting. The placeholder is REPLACE_WITH_ACM_CLUSTER_NAME on a fresh copy — but if you have previously used this file on another cluster, the placeholder will already be substituted with that cluster's name. Always verify the current value before applying, not just that the placeholder is gone.
Step 1 — Apply the ServiceMonitor (Kasten metric federation into UWM):
oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io
Verify the object was accepted by the API server:
oc get servicemonitor kasten-k10-federation -n kasten-io
Expected: NAME / AGE line with kasten-k10-federation. This only confirms the manifest was accepted — use the Verification section below to confirm UWM is actually scraping Kasten metrics.
Step 2 — Apply the Kasten-native alert rules:
oc apply -f kasten-k10-acm-alerts-k10.yaml -n kasten-io
Step 3 — Apply the infrastructure alert rules (kubelet PV metrics):
oc apply -f kasten-k10-acm-alerts-infra.yaml -n kasten-io
Step 4 — Verify the rules were accepted:
oc get prometheusrule -n kasten-io
Expected output:
NAME AGE
kasten-k10-acm-alerts-infra <age>
kasten-k10-acm-alerts-k10 <age>
How the Two Alert Files Work Differently
Understanding this prevents confusion when troubleshooting routing.
kasten-k10-acm-alerts-k10.yaml — UWM Prometheus scope
Has the label openshift.io/prometheus-rule-evaluation-scope: "leaf-prometheus".
Kasten Prometheus (kasten-io)
→ ServiceMonitor federation → UWM Prometheus (openshift-user-workload-monitoring)
→ evaluates KastenBackupJobFailed, KastenBackupJobsWithFailures,
KastenBackupJobSkipped, KastenBackupDurationHigh, KastenRetireJobFailed
→ UWM Alertmanager → MCO Hub AM (via additionalAlertmanagerConfigs)
→ ACM Observe → Alerts
Kasten metrics (action_backup_ended_overall, catalog_actions_count, snapshot_storage_size_bytes, metering_pvc_size, catalog_storage_artifact_count) live only in Kasten's bundled Prometheus. The ServiceMonitor federates them into UWM Prometheus so these rules can evaluate them.
kasten-k10-acm-alerts-infra.yaml — Thanos Ruler scope
Has no prometheus-rule-evaluation-scope label.
OCP kubelet scrape → OCP Thanos Querier (built-in)
→ Thanos Ruler (openshift-user-workload-monitoring)
→ evaluates KastenPVNearFullWarning, KastenPVNearFullCritical
→ main OCP Alertmanager (openshift-monitoring)
→ MCO Hub AM (via MCO's own additionalAlertmanagerConfigs)
→ ACM Observe → Alerts
Kubelet metrics (kubelet_volume_stats_*) are already in OCP's monitoring stack. No ServiceMonitor needed.
Verification
Run all checks below on each cluster (hub and every managed cluster) after applying. Switch oc context between clusters as needed.
Check that Kasten metrics are being scraped by UWM Prometheus (~2 minutes after apply):
oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:9090/api/v1/query?query=action_backup_ended_overall" | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d[\"data\"][\"result\"]), \"series found\")"'
Expected: 1 series found (or more on multi-policy clusters). If 0: the ServiceMonitor hasn't scraped yet, or Kasten's prometheus-server pod isn't running.
Check that PrometheusRules are loaded into UWM Prometheus:
oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:9090/api/v1/rules" | python3 -c "import json,sys; [print(g[\"name\"]) for g in json.load(sys.stdin)[\"data\"][\"groups\"] if \"kasten\" in g[\"name\"]]"'
Expected: kasten.backup printed.
Check that the infra rules are loaded into Thanos Ruler:
oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=thanos-ruler -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:10902/api/v1/rules" | python3 -c "import json,sys; [print(g[\"name\"]) for g in json.load(sys.stdin)[\"data\"][\"groups\"] if \"kasten\" in g[\"name\"]]"'
Expected: kasten.pv printed.
Trigger a test alert — run a backup policy and let it fail (or cancel a running job), then check the alert state in UWM Prometheus:
Both the RHOS Alerting UI and ACM Observe → Alerts default to showing Platform alerts only. Kasten alerts are not platform alerts — uncheck the Platform filter to see them. This filter resets on every page refresh.
oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager -o jsonpath='{.items[0].metadata.name}') -c alertmanager -- sh -c 'curl -sg "http://localhost:9093/api/v2/alerts" | python3 -c "import json,sys; [print(a[\"labels\"][\"alertname\"]) for a in json.load(sys.stdin) if \"Kasten\" in a[\"labels\"].get(\"alertname\",\"\")]"'
Tuning
KastenBackupDurationHigh threshold: The default of 130% (30% above baseline) is a starting point. To raise or lower the sensitivity, edit the 1.3 multiplier in kasten-k10-acm-alerts-k10.yaml. The baseline window is 1 hour — environments with highly variable backup schedules may benefit from a longer window (e.g., [6h]).
KastenPVNearFullWarning / KastenPVNearFullCritical thresholds: Defaults are 50% (warning) and 85% (critical). Adjust based on your PVC size and data growth rate. Smaller PVCs (10Gi) with rapid growth should alert at lower thresholds.
Troubleshooting
| Symptom | Check |
|---|---|
jsonpath … array index out of bounds on the alertmanager exec command |
UWM Alertmanager not enabled — uwm-alertmanager-config.yaml must include alertmanager: enabled: true. Apply the config, wait ~60s for the pod to start. |
KeyError: 'data' / "The Alertmanager v1 API was deprecated … removed as of version 0.28.0" |
Use the v2 API: replace /api/v1/alerts with /api/v2/alerts and remove ["data"] from the python snippet — v2 returns a bare array. |
| Rules applied but no series after 5 min | oc get servicemonitor kasten-k10-federation -n kasten-io exists? Kasten's prometheus-server pod running? |
kasten.backup group missing from UWM Prometheus |
Missing leaf-prometheus label on PrometheusRule, or UWM not enabled |
kasten.pv group missing from Thanos Ruler |
Thanos Ruler pod not running (oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=thanos-ruler) |
| Alerts firing locally but not in ACM Observe | additionalAlertmanagerConfigs not configured, or secrets not copied |
KastenBackupJobsWithFailures never clears |
Expected — it stays firing until Kasten retires the failed action. Retire it from the Kasten dashboard. |
increase() alert fires once then stops |
Correct behavior — KastenBackupJobFailed is a momentary trigger. KastenBackupJobsWithFailures is the persistent companion. |