Deploying the Kasten ACM Dashboard
Once Kasten metrics are flowing to the ACM Thanos backend (see Step 4: Verifying Metrics Collection), you can deploy the Kasten multi-cluster Grafana dashboard into the ACM Observability Grafana instance. The dashboard provides a unified view of backup jobs, restore points, storage usage, and policy status across all clusters managed by RHACM.
What You Get
A single Grafana multi-cluster Kasten dashboard accessible from ACM Observe → Dashboards → General with the following Kasten data and features:
- Per-cluster selector that filters all panels by protected Kasten cluster
- Policy and application selectors
- Backup job history, success/failure rates, and restore point counts
- Local snapshot storage usage and PVC utilization
- License usage and consumption for single-cluster Kasten deployments and for Multi-Cluster Manager license leasing deployments
How Dashboard Injection Works
MCO runs a dashboard loader sidecar (grafana-dashboard-loader) inside the Grafana pod. The loader watches for ConfigMaps in the open-cluster-management-observability namespace that carry the label grafana-custom-dashboard: "true" and automatically posts them to Grafana's internal API. This is the supported injection path — no direct Grafana API access is required.
ConfigMap (open-cluster-management-observability)
label: grafana-custom-dashboard: "true"
↓
grafana-dashboard-loader sidecar detects ConfigMap
↓
Loader POSTs dashboard JSON to Grafana API (internal)
↓
Dashboard appears in ACM Observe → Dashboards → General
The loader retries up to 40 times at 10-second intervals if the upload fails, so it is safe to apply the ConfigMap before Grafana is fully ready.
Additional Metrics Allowlist
The Kasten ACM Dashboard uses several metrics beyond the core set added to the allowlist in Step 1 of the ACM Observability setup. Before deploying the dashboard, update the allowlist ConfigMap on the Hub cluster to include these additional entries:
Run this command on the ACM Hub cluster.
oc apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
data:
uwl_metrics_list.yaml: |
names:
- action_backup_ended_overall
- action_backup_ended_count
- action_backup_duration_seconds_sum_overall
- action_restore_ended_overall
- action_restore_duration_seconds_sum_overall
- action_export_ended_overall
- action_export_duration_seconds_sum_overall
- action_import_ended_overall
- action_import_duration_seconds_sum_overall
- action_retire_ended_overall
- catalog_actions_count
- catalog_storage_artifact_count
- compliance_count
- licenses_info
- mc_licenses_info
- metering_license_compliance_status
- metering_pvc_size
- multicluster_fulfilled_leases
- multicluster_licensed_nodes_contributed
- multicluster_licenses
- multicluster_nodes_issued
- multicluster_nodes_requested
- multicluster_unfulfilled_leases
- nodes_count
- policies_count
- profiles_count
- snapshot_storage_size_bytes
metrics_list.yaml: |
names:
- kubelet_volume_stats_used_bytes
- kubelet_volume_stats_capacity_bytes
EOF
Kasten metrics are collected via User Workload Monitoring (UWM) and must be listed under uwl_metrics_list.yaml. Kubelet PV metrics (kubelet_volume_stats_*) are collected via the standard metrics-collector path and go under metrics_list.yaml. Without the kubelet entries, the PV Capacity panels show "No data." If the observability-metrics-custom-allowlist ConfigMap already exists, merge these entries into the existing sections rather than replacing the whole ConfigMap.
ACM's metrics-collector picks up allowlist changes automatically — no restarts are needed.
Step 1 — Download the Dashboard JSON
Download the dashboard JSON:
curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/dashboards/grafana/kasten-k10-acm-dashboard-slim.json
The dashboard is also published to Grafana Cloud Dashboards if you need the latest upstream version.
If you download from Grafana Cloud instead of using the curl command above, you must edit the JSON before wrapping it in a ConfigMap:
- Remove the
__inputs,__elements, and__requiresblocks entirely — these are present in all Grafana Cloud downloads and will cause import errors in MCO Grafana. - Set the top-level
"id"field tonull. - Keep the top-level
"uid"field exactly as downloaded.
The version available via curl from docs.kasten.io has these edits pre-applied.
Step 2 — Wrap the JSON in a ConfigMap
MCO's dashboard loader requires the JSON to be delivered as a ConfigMap with specific labels. Create a file named kasten-acm-dashboard-cm.yaml.
A quick way to produce a correctly structured ConfigMap from the JSON:
oc create configmap kasten-acm-dashboard \
--from-file=kasten-acm-dashboard.json=./kasten-k10-acm-dashboard-slim.json \
-n open-cluster-management-observability \
--dry-run=client -o yaml | \
python3 -c "import sys; c=sys.stdin.read(); c=c.replace('metadata:\n', 'metadata:\n labels:\n grafana-custom-dashboard: \"true\"\n general-folder: \"true\"\n', 1); print(c, end='')" \
> kasten-acm-dashboard-cm.yaml
Review the output before applying to confirm the labels and namespace are correct.
Do not use Red Hat's generate-dashboard-configmap-yaml.sh tool — it strips the uid field and causes 412 conflicts on subsequent updates.
The resulting ConfigMap must look like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: kasten-acm-dashboard
namespace: open-cluster-management-observability
labels:
grafana-custom-dashboard: "true" # Required — loader checks for this exact key/value
general-folder: "true" # Places dashboard in General folder
data:
kasten-acm-dashboard.json: |
{ ... dashboard JSON ... }
Step 3 — Apply the ConfigMap to the Hub Cluster
Run this command on the ACM Hub cluster. The namespace must be open-cluster-management-observability — the loader only watches its own namespace. Applying to any other namespace has no effect.
oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability
Step 4 — Verify the Loader Detected the ConfigMap
oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=2m
Look for a line like:
Successfully updated dashboard kasten-acm-dashboard
Allow 30–60 seconds after applying the ConfigMap before expecting a success line — the loader retries every 10 seconds, and earlier entries in the --since=2m window may show retry attempts from the initial load cycle.
If you see name-exists or 412, see Dashboard Troubleshooting below.
Step 5 — Open ACM Grafana
Navigate to the ACM console → Observe → Dashboards → General → Kasten Multi-Cluster.
Dashboard Variables
The dashboard provides four drop-down variables at the top:
| Variable | Description |
|---|---|
datasource |
Thanos datasource (auto-selected) |
cluster_name |
Filter by managed cluster; All shows aggregate across all clusters |
policy |
Filter backup panels by Kasten policy name |
app |
Filter by application/namespace |
Updating the Dashboard
When a new version of the dashboard is published, or to make local changes:
-
Download the latest JSON using the
curlcommand from Step 1, or from the Grafana Cloud Dashboards page. -
Rebuild the ConfigMap YAML using the same method in Step 2, keeping the same ConfigMap
nameanduidfield from the JSON. -
Re-apply:
oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability
The loader detects the ConfigMap update and re-posts the dashboard to Grafana with overwrite: true. The existing dashboard is replaced in-place — no manual deletion needed.
Dashboard Metrics Reference
The following Kasten Prometheus metrics are used in the dashboard. All metrics must flow to the MCO Thanos backend via Prometheus remote_write for the dashboard to display data.
| Metric | Description |
|---|---|
action_backup_ended_overall |
Counter of completed backup actions, labeled by state (success/failed/cancelled) |
action_backup_ended_count |
Count of ended backup actions per policy and application |
action_backup_duration_seconds_sum_overall |
Cumulative backup action duration in seconds |
action_restore_ended_overall |
Counter of completed restore actions, labeled by state |
action_restore_duration_seconds_sum_overall |
Cumulative restore action duration in seconds |
action_export_ended_overall |
Counter of completed export actions, labeled by state |
action_export_duration_seconds_sum_overall |
Cumulative export action duration in seconds |
action_import_ended_overall |
Counter of completed import actions, labeled by state |
action_import_duration_seconds_sum_overall |
Cumulative import action duration in seconds |
action_retire_ended_overall |
Counter of completed retire actions, labeled by state (success/failed/cancelled) |
catalog_actions_count |
Gauge of current actions in the Kasten catalog, labeled by type, status, and namespace |
catalog_storage_artifact_count |
Count of stored artifacts in the Kasten catalog, labeled by category and retirement status |
compliance_count |
Count of policy compliance states across managed applications |
policies_count |
Count of Kasten policies, labeled by action type |
profiles_count |
Count of Kasten location profiles, labeled by status |
metering_pvc_size |
Total PVC capacity allocated to the Kasten catalog storage |
snapshot_storage_size_bytes |
Total local snapshot storage consumed |
nodes_count |
Count of nodes labeled by type (licensed_total, tainted, etc.) for capacity and compliance tracking |
licenses_info |
License metadata including node capacity, license type, and compliance status |
metering_license_compliance_status |
Per-cluster license compliance status (compliant/non-compliant) |
mc_licenses_info |
MCM hub: per-cluster license state with cluster and license_info_state labels; states are locally_installed, nodes_contributed, nodes_leased, expired |
multicluster_licenses |
MCM hub: count of unique licenses across all managed clusters (hub-global, no cluster_name label) |
multicluster_licensed_nodes_contributed |
MCM hub: sum of license node limits across all managed clusters — note this is the sum of limits, not nodes actively contributed (hub-global, no cluster_name label) |
multicluster_nodes_issued |
MCM hub: total licensed nodes issued to managed clusters |
multicluster_nodes_requested |
MCM hub: total licensed nodes requested by managed clusters |
multicluster_fulfilled_leases |
MCM hub: count of fulfilled license leases across managed clusters |
multicluster_unfulfilled_leases |
MCM hub: count of unfulfilled license lease requests — demand exceeds capacity signal |
action_backup_ended_overall vs action_backup_ended_countBoth counters increment on the same code path — once per backup action transition to a completed state (succeeded/failed/cancelled). The difference is in labels and initialization:
action_backup_ended_overall |
action_backup_ended_count |
|
|---|---|---|
| Labels | state only |
app, policy, state |
| Pre-initialized at startup | Yes — one count per completion state | No — only increments on real events |
The _overall metric is pre-populated at Kasten startup so that all state label combinations exist in Prometheus before any backups run. That initialization is a real counter increment, which means increase() over a window that includes a Kasten restart will show +1 per state regardless of actual backup activity. The dashboard uses action_backup_ended_count in per-cluster and per-application table panels for accurate counts, and _overall only in headline stat panels where the pre-initialization artifact is a minor cosmetic issue.
cluster_nameIf using User Workload Monitoring (UWM) federation via kasten-k10-acm-servicemonitor.yaml, the ServiceMonitor contains a metricRelabelings entry that overrides cluster_name on the UWM scrape path. Before applying it to each cluster, replace the placeholder value with the correct ACM managed cluster name:
replacement: REPLACE_WITH_ACM_CLUSTER_NAME
Download and apply the ServiceMonitor:
curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/kasten-k10-acm-servicemonitor.yaml
# Edit the file and set replacement: <your-acm-cluster-name>
oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io
Do not reuse a ServiceMonitor file that was previously applied to another cluster — it will carry the old cluster name and produce a duplicate metric stream in Thanos.
Dashboard Troubleshooting
Dashboard not appearing after 5 minutes:
oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=10m
| Log message | Cause | Fix |
|---|---|---|
No log lines mentioning kasten |
Loader did not detect the ConfigMap | Check the ConfigMap label: oc get cm kasten-acm-dashboard -n open-cluster-management-observability -o jsonpath='{.metadata.labels}' — must have grafana-custom-dashboard: "true" |
name-exists / 412 |
Dashboard with same title but different uid exists in Grafana |
Delete the conflicting dashboard from the Grafana UI, then delete and re-apply the ConfigMap |
version-mismatch / 412 |
Dashboard uid exists with a version mismatch |
Loader retries automatically with overwrite: true — wait for a retry cycle (~40 retries × 10 s) |
context deadline exceeded |
Grafana pod not ready | Wait for Grafana to fully start; the loader will retry |
Check if a conflicting dashboard exists:
Port-forward to Grafana and query the API:
oc port-forward -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') 3001:3001 &
curl -s "http://localhost:3001/api/search?query=Kasten" \
-H "X-Forwarded-User: WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000"
If multiple results appear, delete the one whose uid does not match the downloaded JSON's uid field from the Grafana UI.
PV Utilization panel shows "No data" but kubelet metrics exist in Thanos:
ACM's metrics-collector labels kubelet metrics with cluster, not cluster_name. The dashboard queries kubelet panels using cluster=~"$cluster_name" for this reason. If you have customized the dashboard JSON, ensure kubelet metric queries filter on the cluster label, not cluster_name.
All panels show "No data":
Kasten metrics are not yet flowing to MCO Thanos. Check remote_write on each Kasten cluster:
oc logs -n kasten-io \
$(oc get pod -n kasten-io -l app=prometheus,release=k10 \
-o jsonpath='{.items[0].metadata.name}') \
-c prometheus-server | grep "remote_write\|send"
If using the Thanos Receive direct endpoint, verify the port. The remote-write port is typically 19291 — port 10901 is gRPC and returns a protocol error. Refer to Troubleshooting Metrics Collection for the full port reference.