Skip to main content
Version: 8.5.9 (latest)

Deploying the Kasten ACM Dashboard

Once Kasten metrics are flowing to the ACM Thanos backend (see Step 4: Verifying Metrics Collection), you can deploy the Kasten multi-cluster Grafana dashboard into the ACM Observability Grafana instance. The dashboard provides a unified view of backup jobs, restore points, storage usage, and policy status across all clusters managed by RHACM.

What You Get

A single Grafana multi-cluster Kasten dashboard accessible from ACM Observe → Dashboards → General with the following Kasten data and features:

  • Per-cluster selector that filters all panels by protected Kasten cluster
  • Policy and application selectors
  • Backup job history, success/failure rates, and restore point counts
  • Local snapshot storage usage and PVC utilization
  • License usage and consumption for single-cluster Kasten deployments and for Multi-Cluster Manager license leasing deployments

How Dashboard Injection Works

MCO runs a dashboard loader sidecar (grafana-dashboard-loader) inside the Grafana pod. The loader watches for ConfigMaps in the open-cluster-management-observability namespace that carry the label grafana-custom-dashboard: "true" and automatically posts them to Grafana's internal API. This is the supported injection path — no direct Grafana API access is required.

ConfigMap (open-cluster-management-observability)
label: grafana-custom-dashboard: "true"

grafana-dashboard-loader sidecar detects ConfigMap

Loader POSTs dashboard JSON to Grafana API (internal)

Dashboard appears in ACM Observe → Dashboards → General

The loader retries up to 40 times at 10-second intervals if the upload fails, so it is safe to apply the ConfigMap before Grafana is fully ready.

Additional Metrics Allowlist

The Kasten ACM Dashboard uses several metrics beyond the core set added to the allowlist in Step 1 of the ACM Observability setup. Before deploying the dashboard, update the allowlist ConfigMap on the Hub cluster to include these additional entries:

Hub Cluster

Run this command on the ACM Hub cluster.

oc apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
data:
uwl_metrics_list.yaml: |
names:
- action_backup_ended_overall
- action_backup_ended_count
- action_backup_duration_seconds_sum_overall
- action_restore_ended_overall
- action_restore_duration_seconds_sum_overall
- action_export_ended_overall
- action_export_duration_seconds_sum_overall
- action_import_ended_overall
- action_import_duration_seconds_sum_overall
- action_retire_ended_overall
- catalog_actions_count
- catalog_storage_artifact_count
- compliance_count
- licenses_info
- mc_licenses_info
- metering_license_compliance_status
- metering_pvc_size
- multicluster_fulfilled_leases
- multicluster_licensed_nodes_contributed
- multicluster_licenses
- multicluster_nodes_issued
- multicluster_nodes_requested
- multicluster_unfulfilled_leases
- nodes_count
- policies_count
- profiles_count
- snapshot_storage_size_bytes
metrics_list.yaml: |
names:
- kubelet_volume_stats_used_bytes
- kubelet_volume_stats_capacity_bytes
EOF
info

Kasten metrics are collected via User Workload Monitoring (UWM) and must be listed under uwl_metrics_list.yaml. Kubelet PV metrics (kubelet_volume_stats_*) are collected via the standard metrics-collector path and go under metrics_list.yaml. Without the kubelet entries, the PV Capacity panels show "No data." If the observability-metrics-custom-allowlist ConfigMap already exists, merge these entries into the existing sections rather than replacing the whole ConfigMap.

ACM's metrics-collector picks up allowlist changes automatically — no restarts are needed.

Step 1 — Download the Dashboard JSON

Download the dashboard JSON:

curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/dashboards/grafana/kasten-k10-acm-dashboard-slim.json

The dashboard is also published to Grafana Cloud Dashboards if you need the latest upstream version.

Grafana Cloud download requires manual edits

If you download from Grafana Cloud instead of using the curl command above, you must edit the JSON before wrapping it in a ConfigMap:

  • Remove the __inputs, __elements, and __requires blocks entirely — these are present in all Grafana Cloud downloads and will cause import errors in MCO Grafana.
  • Set the top-level "id" field to null.
  • Keep the top-level "uid" field exactly as downloaded.

The version available via curl from docs.kasten.io has these edits pre-applied.

Step 2 — Wrap the JSON in a ConfigMap

MCO's dashboard loader requires the JSON to be delivered as a ConfigMap with specific labels. Create a file named kasten-acm-dashboard-cm.yaml.

A quick way to produce a correctly structured ConfigMap from the JSON:

oc create configmap kasten-acm-dashboard \
--from-file=kasten-acm-dashboard.json=./kasten-k10-acm-dashboard-slim.json \
-n open-cluster-management-observability \
--dry-run=client -o yaml | \
python3 -c "import sys; c=sys.stdin.read(); c=c.replace('metadata:\n', 'metadata:\n labels:\n grafana-custom-dashboard: \"true\"\n general-folder: \"true\"\n', 1); print(c, end='')" \
> kasten-acm-dashboard-cm.yaml

Review the output before applying to confirm the labels and namespace are correct.

warning

Do not use Red Hat's generate-dashboard-configmap-yaml.sh tool — it strips the uid field and causes 412 conflicts on subsequent updates.

The resulting ConfigMap must look like this:

apiVersion: v1
kind: ConfigMap
metadata:
name: kasten-acm-dashboard
namespace: open-cluster-management-observability
labels:
grafana-custom-dashboard: "true" # Required — loader checks for this exact key/value
general-folder: "true" # Places dashboard in General folder
data:
kasten-acm-dashboard.json: |
{ ... dashboard JSON ... }

Step 3 — Apply the ConfigMap to the Hub Cluster

Hub Cluster

Run this command on the ACM Hub cluster. The namespace must be open-cluster-management-observability — the loader only watches its own namespace. Applying to any other namespace has no effect.

oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

Step 4 — Verify the Loader Detected the ConfigMap

oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=2m

Look for a line like:

Successfully updated dashboard kasten-acm-dashboard

Allow 30–60 seconds after applying the ConfigMap before expecting a success line — the loader retries every 10 seconds, and earlier entries in the --since=2m window may show retry attempts from the initial load cycle.

If you see name-exists or 412, see Dashboard Troubleshooting below.

Step 5 — Open ACM Grafana

Navigate to the ACM console → Observe → Dashboards → General → Kasten Multi-Cluster.

Dashboard Variables

The dashboard provides four drop-down variables at the top:

Variable Description
datasource Thanos datasource (auto-selected)
cluster_name Filter by managed cluster; All shows aggregate across all clusters
policy Filter backup panels by Kasten policy name
app Filter by application/namespace

Updating the Dashboard

When a new version of the dashboard is published, or to make local changes:

  1. Download the latest JSON using the curl command from Step 1, or from the Grafana Cloud Dashboards page.

  2. Rebuild the ConfigMap YAML using the same method in Step 2, keeping the same ConfigMap name and uid field from the JSON.

  3. Re-apply:

    oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

The loader detects the ConfigMap update and re-posts the dashboard to Grafana with overwrite: true. The existing dashboard is replaced in-place — no manual deletion needed.

Dashboard Metrics Reference

The following Kasten Prometheus metrics are used in the dashboard. All metrics must flow to the MCO Thanos backend via Prometheus remote_write for the dashboard to display data.

Metric Description
action_backup_ended_overall Counter of completed backup actions, labeled by state (success/failed/cancelled)
action_backup_ended_count Count of ended backup actions per policy and application
action_backup_duration_seconds_sum_overall Cumulative backup action duration in seconds
action_restore_ended_overall Counter of completed restore actions, labeled by state
action_restore_duration_seconds_sum_overall Cumulative restore action duration in seconds
action_export_ended_overall Counter of completed export actions, labeled by state
action_export_duration_seconds_sum_overall Cumulative export action duration in seconds
action_import_ended_overall Counter of completed import actions, labeled by state
action_import_duration_seconds_sum_overall Cumulative import action duration in seconds
action_retire_ended_overall Counter of completed retire actions, labeled by state (success/failed/cancelled)
catalog_actions_count Gauge of current actions in the Kasten catalog, labeled by type, status, and namespace
catalog_storage_artifact_count Count of stored artifacts in the Kasten catalog, labeled by category and retirement status
compliance_count Count of policy compliance states across managed applications
policies_count Count of Kasten policies, labeled by action type
profiles_count Count of Kasten location profiles, labeled by status
metering_pvc_size Total PVC capacity allocated to the Kasten catalog storage
snapshot_storage_size_bytes Total local snapshot storage consumed
nodes_count Count of nodes labeled by type (licensed_total, tainted, etc.) for capacity and compliance tracking
licenses_info License metadata including node capacity, license type, and compliance status
metering_license_compliance_status Per-cluster license compliance status (compliant/non-compliant)
mc_licenses_info MCM hub: per-cluster license state with cluster and license_info_state labels; states are locally_installed, nodes_contributed, nodes_leased, expired
multicluster_licenses MCM hub: count of unique licenses across all managed clusters (hub-global, no cluster_name label)
multicluster_licensed_nodes_contributed MCM hub: sum of license node limits across all managed clusters — note this is the sum of limits, not nodes actively contributed (hub-global, no cluster_name label)
multicluster_nodes_issued MCM hub: total licensed nodes issued to managed clusters
multicluster_nodes_requested MCM hub: total licensed nodes requested by managed clusters
multicluster_fulfilled_leases MCM hub: count of fulfilled license leases across managed clusters
multicluster_unfulfilled_leases MCM hub: count of unfulfilled license lease requests — demand exceeds capacity signal
action_backup_ended_overall vs action_backup_ended_count

Both counters increment on the same code path — once per backup action transition to a completed state (succeeded/failed/cancelled). The difference is in labels and initialization:

action_backup_ended_overall action_backup_ended_count
Labels state only app, policy, state
Pre-initialized at startup Yes — one count per completion state No — only increments on real events

The _overall metric is pre-populated at Kasten startup so that all state label combinations exist in Prometheus before any backups run. That initialization is a real counter increment, which means increase() over a window that includes a Kasten restart will show +1 per state regardless of actual backup activity. The dashboard uses action_backup_ended_count in per-cluster and per-application table panels for accurate counts, and _overall only in headline stat panels where the pre-initialization artifact is a minor cosmetic issue.

ServiceMonitor and cluster_name

If using User Workload Monitoring (UWM) federation via kasten-k10-acm-servicemonitor.yaml, the ServiceMonitor contains a metricRelabelings entry that overrides cluster_name on the UWM scrape path. Before applying it to each cluster, replace the placeholder value with the correct ACM managed cluster name:

replacement: REPLACE_WITH_ACM_CLUSTER_NAME

Download and apply the ServiceMonitor:

curl -sO https://docs.kasten.io/downloads/8.5.9/prometheus/alerts/acm/kasten-k10-acm-servicemonitor.yaml
# Edit the file and set replacement: <your-acm-cluster-name>
oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io

Do not reuse a ServiceMonitor file that was previously applied to another cluster — it will carry the old cluster name and produce a duplicate metric stream in Thanos.

Dashboard Troubleshooting

Dashboard not appearing after 5 minutes:

oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=10m
Log message Cause Fix
No log lines mentioning kasten Loader did not detect the ConfigMap Check the ConfigMap label: oc get cm kasten-acm-dashboard -n open-cluster-management-observability -o jsonpath='{.metadata.labels}' — must have grafana-custom-dashboard: "true"
name-exists / 412 Dashboard with same title but different uid exists in Grafana Delete the conflicting dashboard from the Grafana UI, then delete and re-apply the ConfigMap
version-mismatch / 412 Dashboard uid exists with a version mismatch Loader retries automatically with overwrite: true — wait for a retry cycle (~40 retries × 10 s)
context deadline exceeded Grafana pod not ready Wait for Grafana to fully start; the loader will retry

Check if a conflicting dashboard exists:

Port-forward to Grafana and query the API:

oc port-forward -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') 3001:3001 &

curl -s "http://localhost:3001/api/search?query=Kasten" \
-H "X-Forwarded-User: WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000"

If multiple results appear, delete the one whose uid does not match the downloaded JSON's uid field from the Grafana UI.

PV Utilization panel shows "No data" but kubelet metrics exist in Thanos:

ACM's metrics-collector labels kubelet metrics with cluster, not cluster_name. The dashboard queries kubelet panels using cluster=~"$cluster_name" for this reason. If you have customized the dashboard JSON, ensure kubelet metric queries filter on the cluster label, not cluster_name.

All panels show "No data":

Kasten metrics are not yet flowing to MCO Thanos. Check remote_write on each Kasten cluster:

oc logs -n kasten-io \
$(oc get pod -n kasten-io -l app=prometheus,release=k10 \
-o jsonpath='{.items[0].metadata.name}') \
-c prometheus-server | grep "remote_write\|send"

If using the Thanos Receive direct endpoint, verify the port. The remote-write port is typically 19291 — port 10901 is gRPC and returns a protocol error. Refer to Troubleshooting Metrics Collection for the full port reference.