Version: 9.0.2 (latest)

Deploying the Kasten ACM Dashboard

Once Kasten metrics are flowing to the ACM Thanos backend (see Step 4: Verifying Metrics Collection), you can deploy the Kasten multi-cluster Grafana dashboard into the ACM Observability Grafana instance. The dashboard provides a unified view of backup jobs, restore points, storage usage, and policy status across all clusters managed by RHACM.

What You Get

A single Grafana multi-cluster Kasten dashboard accessible from ACM Observe → Dashboards → General with the following Kasten data and features:

Per-cluster selector that filters all panels by protected Kasten cluster
Policy and application selectors
Backup job history, success/failure rates, and restore point counts
Local snapshot storage usage and PVC utilization
License usage and consumption for single-cluster Kasten deployments and for Multi-Cluster Manager license leasing deployments

How Dashboard Injection Works

MCO runs a dashboard loader sidecar (grafana-dashboard-loader) inside the Grafana pod. The loader watches for ConfigMaps in the open-cluster-management-observability namespace that carry the label grafana-custom-dashboard: "true" and automatically posts them to Grafana's internal API. This is the supported injection path — no direct Grafana API access is required.

ConfigMap (open-cluster-management-observability)
  label: grafana-custom-dashboard: "true"
    ↓
grafana-dashboard-loader sidecar detects ConfigMap
    ↓
Loader POSTs dashboard JSON to Grafana API (internal)
    ↓
Dashboard appears in ACM Observe → Dashboards → General

The loader retries up to 40 times at 10-second intervals if the upload fails, so it is safe to apply the ConfigMap before Grafana is fully ready.

Additional Metrics Allowlist

The Kasten ACM Dashboard uses several metrics beyond the core set added to the allowlist in Step 1 of the ACM Observability setup. Before deploying the dashboard, update the allowlist ConfigMap on the Hub cluster to include these additional entries:

Hub Cluster

Run this command on the ACM Hub cluster.

oc apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: observability-metrics-custom-allowlist
  namespace: open-cluster-management-observability
data:
  uwl_metrics_list.yaml: |
    names:
      - action_started_total
      - action_skipped_total
      - action_ended_total
      - action_duration_seconds_bucket
      - action_duration_seconds_sum
      - action_duration_seconds_count
      - action_backup_ended_overall
      - action_backup_ended_count
      - action_backup_duration_seconds_sum_overall
      - action_restore_ended_overall
      - action_restore_duration_seconds_sum_overall
      - action_export_ended_overall
      - action_export_duration_seconds_sum_overall
      - action_import_ended_overall
      - action_import_duration_seconds_sum_overall
      - action_retire_ended_overall
      - catalog_actions_count
      - catalog_storage_artifact_count
      - compliance_count
      - licenses_info
      - mc_licenses_info
      - metering_license_compliance_status
      - metering_pvc_size
      - multicluster_fulfilled_leases
      - multicluster_licensed_nodes_contributed
      - multicluster_licenses
      - multicluster_nodes_issued
      - multicluster_nodes_requested
      - multicluster_unfulfilled_leases
      - nodes_count
      - policies_count
      - profiles_count
      - snapshot_storage_size_bytes
  metrics_list.yaml: |
    names:
      - kubelet_volume_stats_used_bytes
      - kubelet_volume_stats_capacity_bytes
EOF

info

Kasten metrics are collected via User Workload Monitoring (UWM) and must be listed under uwl_metrics_list.yaml. Kubelet PV metrics (kubelet_volume_stats_*) are collected via the standard metrics-collector path and go under metrics_list.yaml. Without the kubelet entries, the PV Capacity panels show "No data." If the observability-metrics-custom-allowlist ConfigMap already exists, merge these entries into the existing sections rather than replacing the whole ConfigMap.

ACM's metrics-collector picks up allowlist changes automatically — no restarts are needed.

Step 1 — Download the Dashboard JSON

Download the dashboard JSON:

curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/dashboards/grafana/kasten-k10-acm-dashboard-slim.json

The dashboard is also published to Grafana Cloud Dashboards if you need the latest upstream version.

Grafana Cloud download requires manual edits

If you download from Grafana Cloud instead of using the curl command above, you must edit the JSON before wrapping it in a ConfigMap:

Remove the __inputs, __elements, and __requires blocks entirely — these are present in all Grafana Cloud downloads and will cause import errors in MCO Grafana.
Set the top-level "id" field to null.
Keep the top-level "uid" field exactly as downloaded.

The version available via curl from docs.kasten.io has these edits pre-applied.

Step 2 — Wrap the JSON in a ConfigMap

MCO's dashboard loader requires the JSON to be delivered as a ConfigMap with specific labels. Create a file named kasten-acm-dashboard-cm.yaml.

A quick way to produce a correctly structured ConfigMap from the JSON:

oc create configmap kasten-acm-dashboard \
  --from-file=kasten-acm-dashboard.json=./kasten-k10-acm-dashboard-slim.json \
  -n open-cluster-management-observability \
  --dry-run=client -o yaml | \
  python3 -c "import sys; c=sys.stdin.read(); c=c.replace('metadata:\n', 'metadata:\n  labels:\n    grafana-custom-dashboard: \"true\"\n    general-folder: \"true\"\n', 1); print(c, end='')" \
  > kasten-acm-dashboard-cm.yaml

Review the output before applying to confirm the labels and namespace are correct.

warning

Do not use Red Hat's generate-dashboard-configmap-yaml.sh tool — it strips the uid field and causes 412 conflicts on subsequent updates.

The resulting ConfigMap must look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kasten-acm-dashboard
  namespace: open-cluster-management-observability
  labels:
    grafana-custom-dashboard: "true"   # Required — loader checks for this exact key/value
    general-folder: "true"             # Places dashboard in General folder
data:
  kasten-acm-dashboard.json: |
    { ... dashboard JSON ... }

Step 3 — Apply the ConfigMap to the Hub Cluster

Hub Cluster

Run this command on the ACM Hub cluster. The namespace must be open-cluster-management-observability — the loader only watches its own namespace. Applying to any other namespace has no effect.

oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

Step 4 — Verify the Loader Detected the ConfigMap

oc logs -n open-cluster-management-observability \
  $(oc get pod -n open-cluster-management-observability \
    -l app=multicluster-observability-grafana \
    -o jsonpath='{.items[0].metadata.name}') \
  -c grafana-dashboard-loader --since=2m

Look for a line like:

Successfully updated dashboard kasten-acm-dashboard

Allow 30–60 seconds after applying the ConfigMap before expecting a success line — the loader retries every 10 seconds, and earlier entries in the --since=2m window may show retry attempts from the initial load cycle.

If you see name-exists or 412, see Dashboard Troubleshooting below.

Step 5 — Open ACM Grafana

Navigate to the ACM console → Observe → Dashboards → General → Kasten Multi-Cluster.

Dashboard Variables

The dashboard provides four drop-down variables at the top:

Variable	Description
`datasource`	Thanos datasource (auto-selected)
`cluster_name`	Filter by managed cluster; `All` shows aggregate across all clusters
`policy`	Filter backup panels by Kasten policy name
`app`	Filter by application/namespace

Updating the Dashboard

When a new version of the dashboard is published, or to make local changes:

Download the latest JSON using the curl command from Step 1, or from the Grafana Cloud Dashboards page.
Rebuild the ConfigMap YAML using the same method in Step 2, keeping the same ConfigMap name and uid field from the JSON.

Re-apply:

oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

The loader detects the ConfigMap update and re-posts the dashboard to Grafana with overwrite: true. The existing dashboard is replaced in-place — no manual deletion needed.

Dashboard Metrics Reference

The following Kasten Prometheus metrics are used in the dashboard. All metrics must flow to the MCO Thanos backend via Prometheus remote_write for the dashboard to display data.

Metric	Description
`action_started_total`	Counter of started actions; labeled by `action` (backup, restore, export, import, report, run, retire), `scope` (namespace, cluster), `app`, `app_namespace`, `policy`, `subtype`
`action_skipped_total`	Counter of skipped actions; same labels as `action_started_total`
`action_ended_total`	Counter of ended actions; same labels as `action_started_total` plus `state` (succeeded, failed, cancelled)
`action_duration_seconds`	Histogram of action duration in seconds; same labels as `action_ended_total` — exposed as `action_duration_seconds_sum`, `action_duration_seconds_count`, `action_duration_seconds_bucket`
`action_backup_ended_overall`	(deprecated) Counter of completed backup actions, labeled by state (success/failed/cancelled)
`action_backup_ended_count`	(deprecated) Count of ended backup actions per policy and application
`action_backup_duration_seconds_sum_overall`	(deprecated) Cumulative backup action duration in seconds
`action_restore_ended_overall`	(deprecated) Counter of completed restore actions, labeled by state
`action_restore_duration_seconds_sum_overall`	(deprecated) Cumulative restore action duration in seconds
`action_export_ended_overall`	(deprecated) Counter of completed export actions, labeled by state
`action_export_duration_seconds_sum_overall`	(deprecated) Cumulative export action duration in seconds
`action_import_ended_overall`	(deprecated) Counter of completed import actions, labeled by state
`action_import_duration_seconds_sum_overall`	(deprecated) Cumulative import action duration in seconds
`action_retire_ended_overall`	(deprecated) Counter of completed retire actions, labeled by state (success/failed/cancelled)
`catalog_actions_count`	Gauge of current actions in the Kasten catalog, labeled by type, status, and namespace
`catalog_storage_artifact_count`	Count of stored artifacts in the Kasten catalog, labeled by category and retirement status
`compliance_count`	Count of policy compliance states across managed applications
`policies_count`	Count of Kasten policies, labeled by action type
`profiles_count`	Count of Kasten location profiles, labeled by status
`metering_pvc_size`	Total PVC capacity allocated to the Kasten catalog storage
`snapshot_storage_size_bytes`	Total local snapshot storage consumed
`nodes_count`	Count of nodes labeled by type (`licensed_total`, `tainted`, etc.) for capacity and compliance tracking
`licenses_info`	License metadata including node capacity, license type, and compliance status
`metering_license_compliance_status`	Per-cluster license compliance status (compliant/non-compliant)
`mc_licenses_info`	MCM hub: per-cluster license state with `cluster` and `license_info_state` labels; states are `locally_installed`, `nodes_contributed`, `nodes_leased`, `expired`
`multicluster_licenses`	MCM hub: count of unique licenses across all managed clusters (hub-global, no `cluster_name` label)
`multicluster_licensed_nodes_contributed`	MCM hub: sum of license node limits across all managed clusters — note this is the sum of limits, not nodes actively contributed (hub-global, no `cluster_name` label)
`multicluster_nodes_issued`	MCM hub: total licensed nodes issued to managed clusters
`multicluster_nodes_requested`	MCM hub: total licensed nodes requested by managed clusters
`multicluster_fulfilled_leases`	MCM hub: count of fulfilled license leases across managed clusters
`multicluster_unfulfilled_leases`	MCM hub: count of unfulfilled license lease requests — demand exceeds capacity signal

action_backup_ended_overall vs action_backup_ended_count (deprecated)

Both counters increment on the same code path — once per backup action transition to a completed state (succeeded/failed/cancelled). The difference is in labels and initialization:

	`action_backup_ended_overall`	`action_backup_ended_count`
Labels	`state` only	`app`, `policy`, `state`
Pre-initialized at startup	Yes — one count per completion state	No — only increments on real events

The _overall metric is pre-populated at Kasten startup so that all state label combinations exist in Prometheus before any backups run. That initialization is a real counter increment, which means increase() over a window that includes a Kasten restart will show +1 per state regardless of actual backup activity. The dashboard uses action_backup_ended_count in per-cluster and per-application table panels for accurate counts, and _overall only in headline stat panels where the pre-initialization artifact is a minor cosmetic issue.

Both of these metrics are deprecated. The consolidated action_ended_total{action="backup", ...} metric is not pre-initialized at startup, so it does not have the startup-artifact issue and can be used consistently across all panel types.

ServiceMonitor and cluster_name

If using User Workload Monitoring (UWM) federation via kasten-k10-acm-servicemonitor.yaml, the ServiceMonitor contains a metricRelabelings entry that overrides cluster_name on the UWM scrape path. Before applying it to each cluster, replace the placeholder value with the correct ACM managed cluster name:

replacement: REPLACE_WITH_ACM_CLUSTER_NAME

Download and apply the ServiceMonitor:

curl -sO https://docs.kasten.io/downloads/9.0.2/prometheus/alerts/acm/kasten-k10-acm-servicemonitor.yaml
# Edit the file and set replacement: <your-acm-cluster-name>
oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io

Do not reuse a ServiceMonitor file that was previously applied to another cluster — it will carry the old cluster name and produce a duplicate metric stream in Thanos.

Dashboard Troubleshooting

Dashboard not appearing after 5 minutes:

oc logs -n open-cluster-management-observability \
  $(oc get pod -n open-cluster-management-observability \
    -l app=multicluster-observability-grafana \
    -o jsonpath='{.items[0].metadata.name}') \
  -c grafana-dashboard-loader --since=10m

Log message	Cause	Fix
No log lines mentioning `kasten`	Loader did not detect the ConfigMap	Check the ConfigMap label: `oc get cm kasten-acm-dashboard -n open-cluster-management-observability -o jsonpath='{.metadata.labels}'` — must have `grafana-custom-dashboard: "true"`
`name-exists` / 412	Dashboard with same title but different `uid` exists in Grafana	Delete the conflicting dashboard from the Grafana UI, then delete and re-apply the ConfigMap
`version-mismatch` / 412	Dashboard `uid` exists with a version mismatch	Loader retries automatically with `overwrite: true` — wait for a retry cycle (~40 retries × 10 s)
`context deadline exceeded`	Grafana pod not ready	Wait for Grafana to fully start; the loader will retry

Check if a conflicting dashboard exists:

Port-forward to Grafana and query the API:

oc port-forward -n open-cluster-management-observability \
  $(oc get pod -n open-cluster-management-observability \
    -l app=multicluster-observability-grafana \
    -o jsonpath='{.items[0].metadata.name}') 3001:3001 &

curl -s "http://localhost:3001/api/search?query=Kasten" \
  -H "X-Forwarded-User: WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000"

If multiple results appear, delete the one whose uid does not match the downloaded JSON's uid field from the Grafana UI.

PV Utilization panel shows "No data" but kubelet metrics exist in Thanos:

ACM's metrics-collector labels kubelet metrics with cluster, not cluster_name. The dashboard queries kubelet panels using cluster=~"$cluster_name" for this reason. If you have customized the dashboard JSON, ensure kubelet metric queries filter on the cluster label, not cluster_name.

All panels show "No data":

Kasten metrics are not yet flowing to MCO Thanos. Check remote_write on each Kasten cluster:

oc logs -n kasten-io \
  $(oc get pod -n kasten-io -l app=prometheus,release=k10 \
    -o jsonpath='{.items[0].metadata.name}') \
  -c prometheus-server | grep "remote_write\|send"

If using the Thanos Receive direct endpoint, verify the port. The remote-write port is typically 19291 — port 10901 is gRPC and returns a protocol error. Refer to Troubleshooting Metrics Collection for the full port reference.

What You Get​

How Dashboard Injection Works​

Additional Metrics Allowlist​

Step 1 — Download the Dashboard JSON​

Step 2 — Wrap the JSON in a ConfigMap​

Step 3 — Apply the ConfigMap to the Hub Cluster​

Step 4 — Verify the Loader Detected the ConfigMap​

Step 5 — Open ACM Grafana​

Dashboard Variables​

Updating the Dashboard​

Dashboard Metrics Reference​

Dashboard Troubleshooting​