Skip to main content
Version: 8.5.8 (latest)

Red Hat ACM Observability

Veeam Kasten can be integrated with Red Hat Advanced Cluster Management (ACM) Observability Service to provide centralized monitoring across your Red Hat OpenShift fleet. This integration leverages Prometheus remote_write to push metrics from the Kasten cluster to the ACM Hub.

ACM Observability uses Observatorium — an open-source, multi-tenant metrics and logs platform built on Thanos — as its API gateway. The Observatorium API handles authentication, tenant routing, and forwards metrics to the underlying Thanos Receive component for storage.

Overview

Kasten supports two connectivity modes for pushing metrics to the ACM Observatorium API:

Mode When to use Auth Protocol
Same-cluster Kasten is installed on the ACM Hub cluster THANOS-TENANT header (auto-injected) HTTP
Cross-cluster (mTLS) Kasten is on a managed cluster, writing to the Hub Client certificate HTTPS
info

The Observatorium API uses the URL path to identify the tenant and routes requests internally. In same-cluster mode, Kasten writes directly to Thanos Receive (bypassing the Observatorium API), so a THANOS-TENANT header is required. In cross-cluster mode, Kasten writes through the Observatorium API, which extracts the tenant from the URL path and handles routing automatically.

Prerequisites

  • Red Hat ACM installed on the Hub cluster.
  • MultiClusterObservability enabled and configured on the ACM Hub.
  • Veeam Kasten installed (or ready to be installed) on the target cluster.
  • On OpenShift: Kasten must be installed with scc.create: true so that the Prometheus pod can run. Without this, the pod fails with SCC errors (runAsUser / fsGroup not in the allowed range).
  • For cross-cluster (mTLS) mode only:
    • A client certificate signed by the ACM Observability client CA.
    • The client certificate must include a Subject Alternative Name (SAN) and have OU=acm in the subject.
    • See Step 2 below.

Step 1: Gather ACM Configuration

Ensure that the ACM Observability service is running and gather the necessary connection details. You can verify service availability and retrieve configuration details either through the Red Hat ACM UI or via the CLI as described below.

Retrieve Configuration via Web Console

  1. Verify Installation:

    • Log in to your OpenShift Console on the Hub Cluster.
    • Ensure you are in the local-cluster view.
    • Click Search in the left navigation menu.
    • In the Resources dropdown, type and select MultiClusterObservability.
    • Click on the observability resource instance.
    • Ensure the status is Ready.
  2. Find the Tenant ID:

    • Navigate to Infrastructure > Clusters.
    • Select the local-cluster (the Hub cluster).
    • On the Overview tab, locate the Cluster ID. This is your Tenant ID.

Retrieve Configuration via CLI

Hub Cluster

All commands in Step 1 must be run on the ACM Hub cluster.

  1. Verify the Observability Service:

    oc get multiclusterobservability -n open-cluster-management-observability

    Expected output (verify the status is Ready):

    NAME STATUS AGE
    observability Ready 14d
  2. Identify the Remote Write URL — choose the URL that matches your deployment:

    • Same-cluster (HTTP) — writes directly to Thanos Receive: http://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/v1/receive

    • Cross-cluster (HTTPS / mTLS) — writes through the Observatorium API: https://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/metrics/v1/default/api/v1/receive

      If Kasten cannot reach the internal service, use the external route instead:

      ROUTE_HOST=$(oc get route observatorium-api -n open-cluster-management-observability -o jsonpath='{.spec.host}')
      echo "https://${ROUTE_HOST}/api/metrics/v1/default/api/v1/receive"
    warning

    Do not use /api/v1/receive for cross-cluster mode — that path bypasses authentication.

  3. Find the Tenant ID:

    oc get clusterversion version -o jsonpath='{.spec.clusterID}'

    This is the Hub Cluster ID used as hubThanosTenantId in the Kasten Helm values (see Step 3).

Add Kasten Metrics to the MCO Custom Allowlist

Hub Cluster

Run this command on the ACM Hub cluster, regardless of whether you used the Web Console or CLI path above.

MCO's metrics-collector only forwards metrics that appear in the custom allowlist. Without this step, K10 metrics will not flow through MCO even if Prometheus remote_write is functioning correctly. Apply the following ConfigMap on the Hub cluster:

oc apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
data:
metrics_list.yaml: |
names:
- action_backup_ended_count
- action_restore_ended_count
- action_export_ended_count
- action_import_ended_count
- action_run_ended_count
- catalog_actions_count
- catalog_storage_artifact_count
- metering_pvc_size
- policies_count
- policy_run_count
EOF
note

If the observability-metrics-custom-allowlist ConfigMap already exists in your environment, merge these entries into the existing metrics_list.yaml rather than replacing the ConfigMap.

Step 2: Prepare mTLS Client Certificate

Skip this step

If Kasten is on the same cluster as the ACM Hub (same-cluster mode), skip to Step 3.

This step applies only to cross-cluster deployments where Kasten communicates with the ACM Hub over HTTPS with mutual TLS. For detailed instructions on preparing mTLS certificates, refer to the Red Hat ACM documentation:

Exporting metrics to external endpoints — Red Hat ACM 2.15

This section documents the tls_config format, certificate requirements, and how to configure external metric endpoints with mTLS.

Key points for Kasten integration:

  • The client certificate must be signed by the ACM client CA (observability-client-ca-certs Secret on the Hub cluster). Certificates from other CAs are rejected.
  • The certificate must include at least one Subject Alternative Name (SAN) — DNS or IP. Certificates with only a Common Name (CN) are rejected.
  • The certificate subject must include OU=acm (Organizational Unit) to map to the RBAC group with write permissions.

Once you have the signed client certificate, client key, and server CA, create the Kubernetes resources in the Kasten namespace:

Target Cluster

Run these commands on the cluster where Kasten is (or will be) installed.

# Create the client certificate Secret (must use 'tls' type for tls.crt / tls.key keys)
kubectl create secret tls prometheus-client-cert \
-n kasten-io \
--cert=k10-client.crt \
--key=k10-client.key

# Create the server CA ConfigMap (key must be 'ca.crt')
kubectl create configmap observability-ca-cert \
-n kasten-io \
--from-file=ca.crt=ca.crt
Certificate Rotation

When the client certificate expires, Prometheus remote_write will fail with TLS handshake errors. Update the Secret with the renewed certificate and restart the Prometheus pod:

kubectl create secret tls prometheus-client-cert \
-n kasten-io \
--cert=k10-client.crt \
--key=k10-client.key \
--dry-run=client -o yaml | kubectl apply -f -

kubectl rollout restart deployment/prometheus-server -n kasten-io

Monitor prometheus_remote_storage_samples_failed_total for early detection of certificate expiry.

Step 3: Configure Kasten for Remote Write

Target Cluster

Run all commands in this step on the cluster where Kasten is (or will be) installed.

Configure Kasten's embedded Prometheus to push metrics to the ACM Hub. This is done by updating the Kasten Helm values.

Advanced Configuration

You can fine-tune the Prometheus remote write behavior by adding standard Prometheus configuration options under prometheus.server.remote_write[0]. Parameters such as queue_config, send_interval, and write_relabel_configs are supported and passed directly to the Prometheus server configuration. Find more details in Prometheus Remote Write.

1. Set Your Environment Variables

Copy-paste the block below into your terminal and edit the values as needed:

# ── Required ─────────────────────────────────────────────────────────────────
# Hub Cluster ID — run on the Hub cluster (see Step 1)
export ACM_TENANT_ID="$(oc get clusterversion version -o jsonpath='{.spec.clusterID}')"

# Remote Write URL — uncomment ONE line that matches your deployment (see Step 1):
# Same-cluster (HTTP):
export REMOTE_WRITE_URL="http://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/v1/receive"
# Cross-cluster (HTTPS / mTLS):
# export REMOTE_WRITE_URL="https://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/metrics/v1/default/api/v1/receive"

# ── Required on non-OpenShift (auto-detected on OpenShift) ───────────────────
export CLUSTER_NAME="us-east-prod-01"
export CLUSTER_ID="$(oc get clusterversion version -o jsonpath='{.spec.clusterID}' 2>/dev/null || echo 'set-me')"

# ── Optional ─────────────────────────────────────────────────────────────────
# Deep-link from the ACM dashboard back to this Kasten instance
export K10_DASHBOARD_URL="https://k10.apps.example.com/k10/"

2. Prepare the Helm Values

cat > k10-values.yaml <<EOF
clusterName: "${CLUSTER_NAME}"

global:
acm:
enabled: true
hubThanosTenantId: "${ACM_TENANT_ID}"
managedClusterId: "${CLUSTER_ID}"
## Cross-cluster only — uncomment to enable mTLS
## (see "Prepare mTLS Client Certificate" below):
# tls:
# clientCertSecretName: "prometheus-client-cert"
# serverCAConfigMapName: "observability-ca-cert"
# # insecureSkipVerify: false

prometheus:
server:
remote_write:
- url: "${REMOTE_WRITE_URL}"
dashboardUrl: "${K10_DASHBOARD_URL}"
metricsRegex: "(backup|restore|import|export|job|policy|action|catalog|process).*"
resources:
requests:
cpu: 750m
memory: 1.5Gi

## Required on OpenShift — allows Prometheus to run with the correct security context
scc:
create: true
EOF
tip

On OpenShift, you can omit clusterName and managedClusterId — Kasten auto-detects both. For cross-cluster mode, uncomment the tls: block, ensure the required Secret and ConfigMap exist (see Step 2), and set REMOTE_WRITE_URL to the HTTPS endpoint.

TLS Helm Values Reference

Helm Value Description
global.acm.tls.clientCertSecretName Name of a kubernetes.io/tls Secret in the Kasten namespace containing tls.crt and tls.key. Setting this activates mTLS and disables the THANOS-TENANT header injection (the tenant is identified via the URL path instead).
global.acm.tls.serverCAConfigMapName Name of a ConfigMap in the Kasten namespace containing the server CA certificate (key: ca.crt). Used for TLS server verification.
global.acm.tls.insecureSkipVerify When true, skips server certificate verification. Use only when connecting to an internal service URL with a self-signed or unverifiable certificate. Default: false.

3. Apply the Configuration

Install or Upgrade Kasten to apply the changes. Run these commands on the target cluster.

# For new installations
helm install k10 kasten/k10 -n kasten-io --create-namespace -f k10-values.yaml

# For existing installations
helm upgrade k10 kasten/k10 --reuse-values -n kasten-io -f k10-values.yaml
Inline Flags

You can also pass individual values using --set flags instead of a values file. See the Helm documentation for details.

Step 4: Verification

Target Cluster

Verification steps 1–2 run on the cluster where Kasten is installed. Steps 3–4 run on the Hub cluster.

1. Verify Prometheus Remote Write Counters

Port-forward to the Kasten Prometheus and check the remote storage counters:

kubectl port-forward -n kasten-io deployment/prometheus-server 9090:9090 &

# Check samples sent (should increase over time)
curl -s "http://localhost:9090/k10/prometheus/api/v1/query?query=prometheus_remote_storage_samples_total"

# Check for failures (should be 0)
curl -s "http://localhost:9090/k10/prometheus/api/v1/query?query=prometheus_remote_storage_samples_failed_total"

A healthy integration shows samples_total increasing steadily and samples_failed_total at 0:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1740000000,"13469"]}]}}

If samples_failed_total returns a non-zero value, check the Troubleshooting section.

2. Verify Kasten Prometheus Logs

Check the Prometheus logs to ensure remote_write is active and not encountering sustained errors:

kubectl logs -n kasten-io -l app=prometheus -c prometheus-server --tail=50 | grep -E "WARN|ERR|remote"
note

Occasional Failed to send batch, retrying warnings during startup (WAL replay) are normal and resolve automatically. Sustained errors indicate a configuration problem.

Hub Cluster

Steps 3–4 must be run on the ACM Hub cluster.

3. Verify Metrics in ACM Thanos

Query the Thanos query-frontend for a Kasten metric:

kubectl port-forward -n open-cluster-management-observability \
svc/observability-thanos-query-frontend 9090:9090 &

curl -s "http://localhost:9090/api/v1/query?query=catalog_actions_count" | python3 -m json.tool

Look for results tagged with your cluster name and "application": "k10". Each metric also carries a cluster_uid label (the OpenShift cluster ID), which gives a stable identifier for queries and dashboards even if cluster names overlap:

{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "catalog_actions_count",
"application": "k10",
"cluster_name": "us-east-prod-01",
"cluster_uid": "1a2b3c4d-5e6f-7890-abcd-ef1234567890"
},
"value": [1740000000, "221"]
}
]
}
}

If no results appear, allow 2–5 minutes for metrics to propagate, then recheck.

4. Verify Metrics in ACM Grafana

  1. Open the ACM Grafana dashboard on the Hub cluster.
  2. Navigate to Explore (or Drilldown in newer versions).
  3. Select the Thanos (or observatorium) datasource.
  4. Query for a Kasten metric, for example: catalog_actions_count.
  5. Verify that the metric is visible and tagged with your cluster name.

Troubleshooting

Common Installation Errors

Prometheus pod stuck in FailedCreate on OpenShift

The Prometheus server ReplicaSet reports FailedCreate with a message like:

unable to validate against any security context constraint: [...] .spec.securityContext.fsGroup: Invalid value: []int6465534: 65534 is not an allowed group

  • Cause: Kasten was installed without scc.create: true. The Prometheus pod runs as UID/GID 65534, which is outside the default OpenShift SCC allowed range.

  • Fix: Upgrade Kasten with scc.create: true:

    helm upgrade k10 kasten/k10 --reuse-values -n kasten-io --set scc.create=true

"A valid .Values.global.acm.hubThanosTenantId is required"

This error occurs during helm install or helm upgrade if the Tenant ID is missing.

  • Cause: global.acm.hubThanosTenantId is not set in your values file.
  • Fix: Retrieve the Hub Cluster ID (see Step 1) and add it to your k10-values.yaml.

"global.acm.managedClusterId is required when global.acm.enabled is true"

This error occurs if Kasten cannot auto-detect the Cluster ID (e.g., on non-OpenShift platforms) and it was not provided.

  • Cause: You are installing on a platform where Cluster ID auto-detection is not supported, or RBAC permissions prevent lookup.
  • Fix: Add global.acm.managedClusterId to your k10-values.yaml (use the value from $CLUSTER_ID).

"clusterName is required when prometheus.server.remote_write is configured"

This error occurs if Kasten cannot determine a name for the cluster.

  • Cause: You are installing on a non-OpenShift platform (or auto-detection failed) and did not provide a clusterName.
  • Fix: Add clusterName: "my-cluster-name" to your k10-values.yaml.

Runtime Errors (Prometheus Logs)

"server returned HTTP status 400 Bad Request: Client sent an HTTP request to an HTTPS server"

  • Cause: The remote write URL uses http:// but the Observatorium API expects https://.
  • Fix: Change the URL scheme to https:// and configure the TLS settings (global.acm.tls).

"server returned HTTP status 401 Unauthorized"

  • Cause: The Observatorium API requires authentication. This typically occurs when using the authenticated URL path (/api/metrics/v1/{tenant}/...) without a valid client certificate.
  • Fix: Configure mTLS by setting global.acm.tls.clientCertSecretName and ensuring the client certificate meets the requirements (SAN present, OU=acm).

"server returned HTTP status 400 ... could not determine subject"

  • Cause: The client certificate does not contain a Subject Alternative Name (SAN). The Observatorium API extracts the client identity from SANs, not from the Common Name (CN).
  • Fix: Regenerate the client certificate with at least one SAN (DNS or IP). See Step 2.

"server returned HTTP status 500 ... no matching hashring to handle tenant"

This error appears in the Prometheus logs (kubectl logs ... -c prometheus-server).

  • Cause: The hubThanosTenantId provided is incorrect, or the tenant name in the URL path does not match a configured tenant on the Observatorium API.
  • Fix: Verify that the Tenant ID matches the Hub Cluster ID exactly. For cross-cluster mode, also verify the tenant name in the URL path (typically default).

"EOF" or "use of closed network connection"

  • Cause: Transient TLS connection issues, often seen during Prometheus startup (WAL replay) when it sends a burst of samples.
  • Fix: These are typically self-resolving — Prometheus retries automatically. Check prometheus_remote_storage_samples_failed_total; if it stays at 0, the retries are succeeding. If errors persist, verify network connectivity to the Observatorium API service.

malformed HTTP response "\x00\x00\x06\x04..." in Prometheus logs

  • Cause: You may be connecting to the Thanos Receive service on the wrong port. Prometheus remote_write speaks HTTP/1.1; if the target port speaks gRPC, it returns a binary frame that Prometheus cannot parse.

  • To diagnose: Inspect the ports exposed by the Thanos Receive service on the Hub cluster:

    oc get service -n open-cluster-management-observability \
    -l app.kubernetes.io/name=thanos-receive \
    -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.ports[*]} {.name}: {.port}{"\n"}{end}{end}'

    Look for the port named remote-write (typically 19291). Port 10901 is gRPC and will produce this error; port 10902 is HTTP admin/health. Only the remote-write port accepts Prometheus remote_write traffic.

  • Fix: If you are using the Observatorium API URL (observability-observatorium-api at port 8080) as documented in Step 1, this error should not occur. If you are connecting directly to the Thanos Receive service, update your remote_write URL to use the port named remote-write (typically 19291) instead of port 10901.

Connectivity Reference

Deployment URL Auth
Kasten on Hub cluster http://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/v1/receive THANOS-TENANT header (auto-injected by Kasten)
Kasten on managed cluster (internal) https://observability-observatorium-api.open-cluster-management-observability.svc:8080/api/metrics/v1/default/api/v1/receive mTLS client certificate
Kasten on managed cluster (external route) https://<route-host>/api/metrics/v1/default/api/v1/receive mTLS client certificate

Deploying the Kasten ACM Dashboard

Once Kasten metrics are flowing to the ACM Thanos backend (see Step 4: Verification), you can deploy the Kasten multi-cluster Grafana dashboard into the ACM Observability Grafana instance. The dashboard provides a unified view of backup jobs, restore points, storage usage, and policy status across all clusters managed by RHACM.

What You Get

A single Grafana dashboard accessible from ACM Observe → Dashboards → General with:

  • Cluster selector that filters all panels by protected Kasten cluster
  • Policy and application selectors
  • Backup job history, success/failure rates, and restore point counts
  • Local snapshot storage usage and PVC utilization

How Dashboard Injection Works

MCO runs a dashboard loader sidecar (grafana-dashboard-loader) inside the Grafana pod. The loader watches for ConfigMaps in the open-cluster-management-observability namespace that carry the label grafana-custom-dashboard: "true" and automatically posts them to Grafana's internal API. This is the supported injection path — no direct Grafana API access is required.

ConfigMap (open-cluster-management-observability)
label: grafana-custom-dashboard: "true"

grafana-dashboard-loader sidecar detects ConfigMap

Loader POSTs dashboard JSON to Grafana API (internal)

Dashboard appears in ACM Observe → Dashboards → General

The loader retries up to 40 times at 10-second intervals if the upload fails, so it is safe to apply the ConfigMap before Grafana is fully ready.

Additional Metrics Allowlist

The Kasten ACM Dashboard uses several metrics beyond the core set added to the allowlist in Step 1. Before deploying the dashboard, update the allowlist ConfigMap on the Hub cluster to include these additional entries:

Hub Cluster

Run this command on the ACM Hub cluster.

oc apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
data:
metrics_list.yaml: |
names:
- action_backup_ended_count
- action_backup_ended_overall
- action_backup_duration_seconds_sum_overall
- action_restore_ended_count
- action_restore_ended_overall
- action_restore_duration_seconds_sum_overall
- action_export_ended_overall
- action_export_duration_seconds_sum_overall
- action_import_ended_overall
- action_import_duration_seconds_sum_overall
- action_run_ended_count
- catalog_actions_count
- catalog_storage_artifact_count
- compliance_count
- metering_pvc_size
- policies_count
- policy_run_count
- profiles_count
- kubelet_volume_stats_used_bytes
- kubelet_volume_stats_capacity_bytes
EOF
note

kubelet_volume_stats_used_bytes and kubelet_volume_stats_capacity_bytes are Kubernetes kubelet metrics used by the PV Capacity section of the dashboard. ACM does not collect these by default. Without them, the PV Capacity panels show "No data." If the observability-metrics-custom-allowlist ConfigMap already exists, merge these entries into the existing metrics_list.yaml rather than replacing the whole ConfigMap.

ACM's metrics-collector picks up allowlist changes automatically — no restarts are needed.

Step 1 — Download the Dashboard JSON

Download the dashboard JSON from the Grafana Cloud Dashboards repository:

  1. Go to the Grafana Cloud Dashboards page for the Kasten multi-cluster dashboard.
  2. Click Download JSON to save the raw dashboard JSON file locally (for example, kasten-acm-dashboard.json).

Step 2 — Wrap the JSON in a ConfigMap

MCO's dashboard loader requires the JSON to be delivered as a ConfigMap with specific labels. Create a file named kasten-acm-dashboard-cm.yaml.

A quick way to produce a correctly structured ConfigMap from the downloaded JSON:

oc create configmap kasten-acm-dashboard \
--from-file=kasten-acm-dashboard.json=./kasten-acm-dashboard.json \
--dry-run=client -o yaml | \
oc label --local -f - grafana-custom-dashboard=true general-folder=true -o yaml \
> kasten-acm-dashboard-cm.yaml

Review the output before applying to confirm the labels and namespace are correct.

Required JSON edits before running the command above:

  • Remove the __inputs, __elements, and __requires fields if present — these are Grafana provisioning metadata and will cause import errors in MCO Grafana.
  • Set "id": null — Grafana assigns a new numeric ID on import.
  • Keep the "uid" field exactly as downloaded — the loader uses the uid as a stable identifier for updates. Removing or changing it causes a name-exists 412 conflict if the dashboard was previously loaded.
warning

Do not use Red Hat's generate-dashboard-configmap-yaml.sh tool — it strips the uid field and causes 412 conflicts on subsequent updates.

The resulting ConfigMap must look like this:

apiVersion: v1
kind: ConfigMap
metadata:
name: kasten-acm-dashboard
namespace: open-cluster-management-observability
labels:
grafana-custom-dashboard: "true" # Required — loader checks for this exact key/value
general-folder: "true" # Places dashboard in General folder
data:
kasten-acm-dashboard.json: |
{ ... dashboard JSON ... }

Step 3 — Apply the ConfigMap to the Hub Cluster

Hub Cluster

Run this command on the ACM Hub cluster. The namespace must be open-cluster-management-observability — the loader only watches its own namespace. Applying to any other namespace has no effect.

oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

Step 4 — Verify the Loader Detected the ConfigMap

oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=2m

Look for a line like:

Successfully updated dashboard kasten-acm-dashboard

If you see name-exists or 412, see Dashboard Troubleshooting below.

Step 5 — Open ACM Grafana

Navigate to the ACM console → Observe → Dashboards → General → Kasten Multi-Cluster.

Dashboard Variables

The dashboard provides four drop-down variables at the top:

Variable Description
datasource Thanos datasource (auto-selected)
cluster_name Filter by managed cluster; All shows aggregate across all clusters
policy Filter backup panels by Kasten policy name
app Filter by application/namespace

Updating the Dashboard

When a new version of the dashboard is published to Grafana Cloud Dashboards:

  1. Download the updated JSON from the Grafana Cloud Dashboards page.

  2. Rebuild the ConfigMap YAML using the same method in Step 2, keeping the same ConfigMap name and uid field from the JSON.

  3. Re-apply:

    oc apply -f kasten-acm-dashboard-cm.yaml -n open-cluster-management-observability

The loader detects the ConfigMap update and re-posts the dashboard to Grafana with overwrite: true. The existing dashboard is replaced in-place — no manual deletion needed.

Dashboard Metrics Reference

The following Kasten Prometheus metrics are used in the dashboard. All metrics must flow to the MCO Thanos backend via Prometheus remote_write for the dashboard to display data.

Metric Description
action_backup_ended_overall Counter of completed backup actions, labeled by state (success/failed/cancelled)
action_backup_ended_count Count of ended backup actions per policy and application
action_backup_duration_seconds_sum_overall Cumulative backup action duration in seconds
action_restore_ended_overall Counter of completed restore actions, labeled by state
action_restore_duration_seconds_sum_overall Cumulative restore action duration in seconds
action_export_ended_overall Counter of completed export actions, labeled by state
action_export_duration_seconds_sum_overall Cumulative export action duration in seconds
action_import_ended_overall Counter of completed import actions, labeled by state
action_import_duration_seconds_sum_overall Cumulative import action duration in seconds
catalog_actions_count Gauge of current actions in the Kasten catalog, labeled by type, status, and namespace
catalog_storage_artifact_count Count of stored artifacts in the Kasten catalog, labeled by category and retirement status
compliance_count Count of policy compliance states across managed applications
policies_count Count of Kasten policies, labeled by action type
profiles_count Count of Kasten location profiles, labeled by status
metering_pvc_size Total PVC capacity allocated to the Kasten catalog storage

Cluster Name Configuration

The cluster_name label on Kasten metrics is set by Kasten's bundled Prometheus as an external label on the remote_write path — it is not present on metrics when Kasten's Prometheus is queried directly. Kasten auto-detects this value from the OpenShift infrastructure name (the node prefix, for example aro-mycluster-4kmhs), which may not match the ACM managed cluster name shown in the dashboard variable list.

To ensure cluster_name in the dashboard matches the ACM managed cluster name, two changes are required per cluster:

1. Set Prometheus external labels explicitly in Kasten Helm values:

helm upgrade k10 kasten/k10 --reuse-values -n kasten-io \
--set prometheus.server.global.external_labels.cluster_name=<acm-managed-cluster-name>

Or add the following to your k10-values.yaml and run helm upgrade:

prometheus:
server:
global:
external_labels:
cluster_name: <acm-managed-cluster-name>
note

Setting global.clusterName alone is insufficient — it controls Kasten's application-level cluster identity but does not update the Prometheus external label used in remote_write.

2. Update the ServiceMonitor (if using User Workload Monitoring federation):

If you applied kasten-k10-acm-servicemonitor.yaml, it contains a metricRelabelings entry that overrides cluster_name on the UWM federation path. Before applying the ServiceMonitor to each cluster, replace the placeholder with the correct ACM cluster name:

replacement: REPLACE_WITH_ACM_CLUSTER_NAME

Change it to match the ACM managed cluster name for that cluster, then apply:

oc apply -f kasten-k10-acm-servicemonitor.yaml -n kasten-io

Dashboard Troubleshooting

Dashboard not appearing after 5 minutes:

oc logs -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') \
-c grafana-dashboard-loader --since=10m
Log message Cause Fix
No log lines mentioning kasten Loader did not detect the ConfigMap Check the ConfigMap label: oc get cm kasten-acm-dashboard -n open-cluster-management-observability -o jsonpath='{.metadata.labels}' — must have grafana-custom-dashboard: "true"
name-exists / 412 Dashboard with same title but different uid exists in Grafana Delete the conflicting dashboard from the Grafana UI, then delete and re-apply the ConfigMap
version-mismatch / 412 Dashboard uid exists with a version mismatch Loader retries automatically with overwrite: true — wait for a retry cycle (~40 retries × 10 s)
context deadline exceeded Grafana pod not ready Wait for Grafana to fully start; the loader will retry

Check if a conflicting dashboard exists:

Port-forward to Grafana and query the API:

oc port-forward -n open-cluster-management-observability \
$(oc get pod -n open-cluster-management-observability \
-l app=multicluster-observability-grafana \
-o jsonpath='{.items[0].metadata.name}') 3001:3001 &

curl -s "http://localhost:3001/api/search?query=Kasten" \
-H "X-Forwarded-User: WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000"

If multiple results appear, delete the one whose uid does not match the downloaded JSON's uid field from the Grafana UI.

PV Utilization panel shows "No data" but kubelet metrics exist in Thanos:

ACM's metrics-collector labels kubelet metrics with cluster, not cluster_name. The dashboard queries kubelet panels using cluster=~"$cluster_name" for this reason. If you have customized the dashboard JSON, ensure kubelet metric queries filter on the cluster label, not cluster_name.

All panels show "No data":

Kasten metrics are not yet flowing to MCO Thanos. Check remote_write on each Kasten cluster:

oc logs -n kasten-io \
$(oc get pod -n kasten-io -l app=prometheus,release=k10 \
-o jsonpath='{.items[0].metadata.name}') \
-c prometheus-server | grep "remote_write\|send"

If using the Thanos Receive direct endpoint, verify the port. The remote-write port is typically 19291 — port 10901 is gRPC and returns a protocol error. Refer to Troubleshooting for the full port reference.