Version: 9.0.2 (latest)

Configuring Kasten ACM Alerts

This guide shows how to deploy Prometheus alerting for Veeam Kasten on OpenShift clusters managed by Red Hat Advanced Cluster Management (RHACM) with the MultiCluster Observability (MCO) add-on. Alerts route to ACM Observe → Alerts automatically once applied.

For more information on Prometheus Alerts and using alerts in Red Hat ACM, please visit:

What You Get

Alert	Severity	Fires when
`KastenBackupJobFailed`	warning	A backup job fails or is cancelled (triggers once per event)
`KastenBackupJobsWithFailures`	warning	Unresolved failed backup actions exist in K10 catalog (persists until resolved)
`KastenBackupJobSkipped`	warning	Skipped backup actions exist in K10 catalog (typically indicates a mirror backup job did not run)
`KastenBackupDurationHigh`	warning	Average backup duration exceeds 130% of the 1-hour baseline
`KastenRetireJobFailed`	warning	A retire job fails (triggers once per event)
`KastenPVNearFullWarning`	warning	K10 catalog or jobs PVC exceeds 50% capacity
`KastenPVNearFullCritical`	critical	K10 catalog or jobs PVC exceeds 85% capacity

Prerequisites

On each cluster (hub and managed):

Kasten installed in the kasten-io namespace.
User Workload Monitoring enabled. Check with:
```
oc get configmap cluster-monitoring-config -n openshift-monitoring -o jsonpath='{.data.config\.yaml}'
```
If enableUserWorkload: true is not present, enable it:
```
oc -n openshift-monitoring patch configmap cluster-monitoring-config --type merge -p '{"data":{"config.yaml":"enableUserWorkload: true\n"}}'
```
Wait ~60 seconds for the openshift-user-workload-monitoring pods to start.

info

Enabling User Workload Monitoring alone does NOT create an Alertmanager pod in openshift-user-workload-monitoring. The UWM Alertmanager must be explicitly enabled separately. No action is needed here — Prerequisite 4 below applies uwm-alertmanager-config.yaml, which handles both enabling the Alertmanager and configuring routing in a single step.
Logged in with cluster-admin or a role that can create PrometheusRules and ServiceMonitors in kasten-io.
MCO Hub Alertmanager routing configured on every cluster (hub and managed). UWM Prometheus routes alerts through UWM Alertmanager, not the main OCP Alertmanager. To reach ACM Observe → Alerts, UWM Alertmanager on each cluster must be configured to forward to the MCO hub Alertmanager.

a. Copy the MCO auth secrets from openshift-monitoring to openshift-user-workload-monitoring on each cluster:
```
oc get secret observability-alertmanager-accessor -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: openshift-user-workload-monitoring/' | oc apply -f -
oc get secret hub-alertmanager-router-ca -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: openshift-user-workload-monitoring/' | oc apply -f -
```
b. Download uwm-alertmanager-config.yaml and replace <HUB_ALERTMANAGER_HOSTNAME> with the MCO Alertmanager hostname. This value comes from the hub cluster — run this once on the hub, then reuse the same hostname for every managed cluster:
```
oc get route alertmanager -n open-cluster-management-observability -o jsonpath='{.spec.host}'
```
warning

Do not run the route lookup on a managed cluster — open-cluster-management-observability only exists on the hub. Managed clusters use the same hub hostname value.

Download the config file, edit it to set <HUB_ALERTMANAGER_HOSTNAME>, then apply on each cluster (switching context as needed):
```
curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/alerts/acm/uwm-alertmanager-config.yaml
# Edit the file and replace <HUB_ALERTMANAGER_HOSTNAME>
oc apply -f uwm-alertmanager-config.yaml -n openshift-user-workload-monitoring
```
info

This file does two things in one apply: enables the UWM Alertmanager (alertmanager: enabled: true) and configures it to forward to the MCO hub Alertmanager. Both are required — applying it once covers both. It is safe to re-apply if already done. If user-workload-monitoring-config already exists in openshift-user-workload-monitoring, use oc edit or merge the content into the existing config.yaml key rather than overwriting it.

Wait ~60 seconds for the UWM Alertmanager pods to start:
```
oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager
```

Download Files

Download all files before starting. The ServiceMonitor and alert rule files are applied to every cluster running Kasten (hub and each managed cluster):

curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/alerts/acm/kasten-k10-acm-servicemonitor.yaml
curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/alerts/acm/kasten-k10-acm-alerts-k10.yaml
curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/alerts/acm/kasten-k10-acm-alerts-infra.yaml

File	Purpose	Apply namespace
`kasten-k10-acm-servicemonitor.yaml`	Federates Kasten-native metrics into UWM Prometheus	`kasten-io`
`kasten-k10-acm-alerts-k10.yaml`	PrometheusRule for Kasten-native alerts (backup, retire)	`kasten-io`
`kasten-k10-acm-alerts-infra.yaml`	PrometheusRule for infrastructure alerts (PV usage)	`kasten-io`

Apply Order

The ServiceMonitor and both alert rule files are applied to every cluster running Kasten (hub and each managed cluster). The dashboard ConfigMap is the only hub-only resource — it lives in open-cluster-management-observability, not kasten-io, and is covered separately in Deploying the Kasten ACM Dashboard.

Repeat Steps 1–4 below on the hub cluster and then on each managed cluster.

warning

Before applying the ServiceMonitor, edit kasten-k10-acm-servicemonitor.yaml and set the replacement field in metricRelabelings to the ACM managed cluster name for the cluster you are currently targeting. The placeholder is REPLACE_WITH_ACM_CLUSTER_NAME on a fresh copy — but if you have previously used this file on another cluster, the placeholder will already be substituted with that cluster's name. Always verify the current value before applying, not just that the placeholder is gone.

Step 1 — Apply the ServiceMonitor (Kasten metric federation into UWM):

oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io

Verify the object was accepted by the API server:

oc get servicemonitor kasten-k10-federation -n kasten-io

Expected: NAME / AGE line with kasten-k10-federation. This only confirms the manifest was accepted — use the Verification section below to confirm UWM is actually scraping Kasten metrics.

Step 2 — Apply the Kasten-native alert rules:

oc apply -f kasten-k10-acm-alerts-k10.yaml -n kasten-io

Step 3 — Apply the infrastructure alert rules (kubelet PV metrics):

oc apply -f kasten-k10-acm-alerts-infra.yaml -n kasten-io

Step 4 — Verify the rules were accepted:

oc get prometheusrule -n kasten-io

Expected output:

NAME                          AGE
kasten-k10-acm-alerts-infra   <age>
kasten-k10-acm-alerts-k10     <age>

How the Two Alert Files Work Differently

Understanding this prevents confusion when troubleshooting routing.

`kasten-k10-acm-alerts-k10.yaml` — UWM Prometheus scope

Has the label openshift.io/prometheus-rule-evaluation-scope: "leaf-prometheus".

Kasten Prometheus (kasten-io)
  → ServiceMonitor federation → UWM Prometheus (openshift-user-workload-monitoring)
    → evaluates KastenBackupJobFailed, KastenBackupJobsWithFailures,
      KastenBackupJobSkipped, KastenBackupDurationHigh, KastenRetireJobFailed
    → UWM Alertmanager → MCO Hub AM (via additionalAlertmanagerConfigs)
    → ACM Observe → Alerts

Kasten metrics (action_backup_ended_overall, catalog_actions_count, snapshot_storage_size_bytes, metering_pvc_size, catalog_storage_artifact_count) live only in Kasten's bundled Prometheus. The ServiceMonitor federates them into UWM Prometheus so these rules can evaluate them.

`kasten-k10-acm-alerts-infra.yaml` — Thanos Ruler scope

Has no prometheus-rule-evaluation-scope label.

OCP kubelet scrape → OCP Thanos Querier (built-in)
  → Thanos Ruler (openshift-user-workload-monitoring)
    → evaluates KastenPVNearFullWarning, KastenPVNearFullCritical
    → main OCP Alertmanager (openshift-monitoring)
    → MCO Hub AM (via MCO's own additionalAlertmanagerConfigs)
    → ACM Observe → Alerts

Kubelet metrics (kubelet_volume_stats_*) are already in OCP's monitoring stack. No ServiceMonitor needed.

Verification

Run all checks below on each cluster (hub and every managed cluster) after applying. Switch oc context between clusters as needed.

Check that Kasten metrics are being scraped by UWM Prometheus (~2 minutes after apply):

oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:9090/api/v1/query?query=action_backup_ended_overall" | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d[\"data\"][\"result\"]), \"series found\")"'

Expected: 1 series found (or more on multi-policy clusters). If 0: the ServiceMonitor hasn't scraped yet, or Kasten's prometheus-server pod isn't running.

Check that PrometheusRules are loaded into UWM Prometheus:

oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:9090/api/v1/rules" | python3 -c "import json,sys; [print(g[\"name\"]) for g in json.load(sys.stdin)[\"data\"][\"groups\"] if \"kasten\" in g[\"name\"]]"'

Expected: kasten.backup printed.

Check that the infra rules are loaded into Thanos Ruler:

oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=thanos-ruler -o jsonpath='{.items[0].metadata.name}') -- sh -c 'curl -sg "http://localhost:10902/api/v1/rules" | python3 -c "import json,sys; [print(g[\"name\"]) for g in json.load(sys.stdin)[\"data\"][\"groups\"] if \"kasten\" in g[\"name\"]]"'

Expected: kasten.pv printed.

Trigger a test alert — run a backup policy and let it fail (or cancel a running job), then check the alert state in UWM Prometheus:

info

Both the RHOS Alerting UI and ACM Observe → Alerts default to showing Platform alerts only. Kasten alerts are not platform alerts — uncheck the Platform filter to see them. This filter resets on every page refresh.

oc exec -n openshift-user-workload-monitoring $(oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=alertmanager -o jsonpath='{.items[0].metadata.name}') -c alertmanager -- sh -c 'curl -sg "http://localhost:9093/api/v2/alerts" | python3 -c "import json,sys; [print(a[\"labels\"][\"alertname\"]) for a in json.load(sys.stdin) if \"Kasten\" in a[\"labels\"].get(\"alertname\",\"\")]"'

Tuning

KastenBackupDurationHigh threshold: The default of 130% (30% above baseline) is a starting point. To raise or lower the sensitivity, edit the 1.3 multiplier in kasten-k10-acm-alerts-k10.yaml. The baseline window is 1 hour — environments with highly variable backup schedules may benefit from a longer window (e.g., [6h]).

KastenPVNearFullWarning / KastenPVNearFullCritical thresholds: Defaults are 50% (warning) and 85% (critical). Adjust based on your PVC size and data growth rate. Smaller PVCs (10Gi) with rapid growth should alert at lower thresholds.

Troubleshooting

Symptom	Check
`jsonpath … array index out of bounds` on the alertmanager exec command	UWM Alertmanager not enabled — `uwm-alertmanager-config.yaml` must include `alertmanager: enabled: true`. Apply the config, wait ~60s for the pod to start.
`KeyError: 'data'` / `"The Alertmanager v1 API was deprecated … removed as of version 0.28.0"`	Use the v2 API: replace `/api/v1/alerts` with `/api/v2/alerts` and remove `["data"]` from the python snippet — v2 returns a bare array.
Rules applied but no series after 5 min	`oc get servicemonitor kasten-k10-federation -n kasten-io` exists? Kasten's `prometheus-server` pod running?
`kasten.backup` group missing from UWM Prometheus	Missing `leaf-prometheus` label on PrometheusRule, or UWM not enabled
`kasten.pv` group missing from Thanos Ruler	Thanos Ruler pod not running (`oc get pod -n openshift-user-workload-monitoring -l app.kubernetes.io/name=thanos-ruler`)
Alerts firing locally but not in ACM Observe	`additionalAlertmanagerConfigs` not configured, or secrets not copied
`KastenBackupJobsWithFailures` never clears	Expected — it stays firing until Kasten retires the failed action. Retire it from the Kasten dashboard.
`increase()` alert fires once then stops	Correct behavior — `KastenBackupJobFailed` is a momentary trigger. `KastenBackupJobsWithFailures` is the persistent companion.

What You Get​

Prerequisites​

Download Files​

Apply Order​

How the Two Alert Files Work Differently​

kasten-k10-acm-alerts-k10.yaml — UWM Prometheus scope​

kasten-k10-acm-alerts-infra.yaml — Thanos Ruler scope​

Verification​

Tuning​

Troubleshooting​