Monitoring
K10 enables centralized monitoring of all its activity by integrating with Prometheus. In particular, it exposes a Prometheus endpoint from which a central system can extract data.
K10 can be installed with Grafana in the same namespace. This instance of Grafana is setup to automatically query metrics from K10's prometheus instance. It also comes with a pre created dashboard that helps visualize the status of K10's operations such as backup, restore, export and import of applications.
This section documents how to install and enable Grafana and Prometheus, usage of the metrics currently exposed, generation of alerts and reports based on these metrics, and integration with external tools.
Using K10's Prometheus Endpoint
Exporting metrics from K10 is enabled by default. However, if you had
explicitly disabled it at install time, it can be re-enabled with a
helm upgrade
(see more about upgrade at Upgrading K10)
command to modify an already existing installation. The upgrade option
to add is:
--set prometheus.server.enabled=true
Note
By default, Prometheus is configured with persistent storage size
8Gi and retention period of 30d. That can be changed with --set
prometheus.server.persistentVolume.size=<size>
and --set
prometheus.server.retention=<days>
. The complete list of configurable
parameters can be found at Advanced Install Options.
Once Prometheus is enabled, metrics can be consumed from your Prometheus system by enabling the following scrape config:
- job_name: k10
scrape_interval: 15s
honor_labels: true
scheme: http
metrics_path: '/<k10-release-name>/prometheus/federate'
params:
'match[]':
- '{__name__=~"jobs.*"}'
static_configs:
- targets:
- 'prometheus-server.kasten-io.svc.cluster.local'
labels:
app: "k10"
K10 Metrics
While K10 exports a number of metrics, the jobs_duration
metric is
the easiest one for monitoring job status because it is already
aggregated. This metric captures the running time of jobs that have
completed, whether they succeed or fail.
The jobs_duration
metric is a Prometheus histogram, and thus it
comprises 3 "sub" metrics: jobs_duration_count
, jobs_duration_sum
and
jobs_duration_bucket
. There is a single "status" label for the metric
which can take the following values: [succeeded, failed]
.
K10 Action Metrics
As K10 performs various actions throughout the system, metrics are collected about those actions. Counts for both cluster and application-specific actions are collected.
These action metrics include labels describing the context of the action. For
actions specific to an application, the application name is included as
app
. For actions initiated by a policy, the policy name is included as
policy
. For ended actions, the final status is included as state
(i.e. succeeded
, failed
, or cancelled
).
Tip
When using K10 Multi-Cluster Manager, a cluster
label must
also be included. The cluster
label should match the name of a
secondary cluster or be blank (""
) to query metrics for the primary
cluster.
Separate metrics are collected for the number of times the action was started,
ended, or skipped. This is indicated by the suffix of the metric
(i.e. _started_count
, _ended_count
, or _skipped_count
).
Metrics are collected for the following actions:
backup
andbackup_cluster
restore
andrestore_cluster
export
import
report
run
For example, to query the number of successful backups for each application in the past 24 hours:
sum by (app) (round(increase(action_backup_ended_count{state="succeeded"}[24h])))
Or to query the number of failed restores across all applications for the past hour:
sum(round(increase(action_restore_ended_count{state="failed"}[1h])))
Important
Due to action metrics being reported as counters, the
increase
or rate
functions must be used when querying. See
Prometheus query functions for
more information.
Additional documentation on querying data from Prometheus can be found in the Prometheus docs.
Cluster License Status
K10 exports the metering_license_compliance_status
metric related
to cluster license compliance. This metric contains information on
when the cluster was out of license compliance.
The metering_license_compliance_status
metric is a Prometheus gauge
,
and has a value of 1 if the cluster's is license status is compliant and 0
otherwise. To see the timeline of when K10 was out of license compliance, the
metering_license_compliance_status
metric can be queried and graphed.
Using K10's Grafana Endpoint
Installation
To enable/disable Grafana and Prometheus, use this helm value while installing/upgrading K10. The helm value is enabled by default.
--set grafana.enabled=true
Accessing Grafana from K10's dashboard
Click on the "Usage and Reports" card on K10's dashboard.
Click on "More Charts and Alerts" to access the instance of Grafana installed with K10.
Charts and Graphs
The Grafana dashboard can be used to monitor how many application scoped or cluster scoped actions (backup, restore, export and import) have completed, failed or been skipped.
It shows the number of policy runs that have completed or been skipped.
The amount of disk space consumed and the percentage of free space available in K10's stateful services (catalog, jobs, and logging) are also shown.
Grafana Alerts
Grafana can be used to create alerts based on an alert rule. Each rule uses a query that fetches data from a data source. Each query involves a metric such as the K10 metrics described in a previous section.
Before adding an alert, make sure you performed more than one
backup action.
Select Notification channels
from the Alerting
menu.
Click on Add Channel
.
Add a channel name and select Slack
as the channel type.
In Optional Slack settings
add Webhook URL.
In Notification settings
select Default
setting.
As shown in above screenshot click on Test
button.
If webhook is configured properly. Test notification will be
sent successfully.
Click on the Save
button.
To add alerts in Grafana click on + → Dashboard
On the new Dashboard screen click on Add an empty panel
In the Edit Panel
select Prometheus
as the Datasource.
In the Metrics browser
select action_backup_started_count
metrics
Click on the Use query
button.
On the top-right corner of the Edit Panel
, change the
type of visualization to Graph (old)
Change the query name from A
to action backup started
.
On the top-right corner, click on the Save
button.
Go to the Alert
tab and click on Create Alert
. Configure
a name for the alert rule. In the Conditions
section,
select the query named action backup started
created in the
previous step. Set IS ABOVE
to a value that is more than the
current number of backup actions that were started on this cluster.
is number of backup actions you executed in your cluster.
In the Notifications section, the Send To
option will default
to the notification channel created in a previous step.
Click on the Save
button on the top-right corner , followed by
the Apply
button.
Once the number of backup actions started crosses the threshold set
in the alert rule, notifications should be seen in the slack channel.
Generating Reports
K10 Reporting provides regular insights into key performance and operational states of the system. It uses prometheus to obtain information about action runs and storage consumption. For more information about K10 Reporting see Reporting