K10 Disaster Recovery

K10 Disaster Recovery (DR) aims to protect K10 from the underlying infrastructure failures. In particular, this feature provides the ability to recover the K10 platform in case of a variety of disasters such as the accidental deletion of K10, failure of underlying storage that K10 uses for its catalog, or even the accidental destruction of the Kubernetes cluster on which K10 is deployed.

Overview

K10 enables Disaster Recovery with the help of an internal policy to backup its own data stores and store these in an object storage bucket or an NFS file storage location configured using a Location Profile.

External Storage Configuration

To enable K10 Disaster Recovery, a Location Profile needs to be configured. This will use an object storage bucket or an NFS file storage location to store data from K10's internal data stores and the cluster will need to have write permissions to this location.

Note

A VBR location profile cannot be used as a destination for DR backups.

Enabling K10 Disaster Recovery

The K10 Disaster Recovery settings are accessible via the Disaster Recovery page under the Settings menu in the navigation sidebar. For new installations, these settings are also accessible using the link located within the alerts panel.

../_images/dr_dashboard.png

Select the Disaster Recovery page under the Settings menu in the navigation sidebar, and then click Enable K10 DR button to begin the process.

../_images/dr_enable.png

Enabling K10 Disaster Recovery requires selecting a Location Profile for the exported K10 Disaster Recovery backups and providing a passphrase for encrypting the snapshot data.

The passphrase can be provided as a raw string or as reference to a secret in HashiCorp Vault or AWS Secrets Manager.

Enable Disaster Recovery by clicking on the Enable K10 DR button.

../_images/dr_enable_passphrase.png

Note

If providing a raw passphrase, save it securely outside the cluster.

../_images/dr_enable_vault.png

Note

Using HashiCorp Vault requires that K10 is configured to access Vault.

../_images/dr_enable_aws.png

Note

Using AWS Secrets Manager requires that an AWS Infrastructure Profile exists with the adequate permissions

Cluster ID

A confirmation message with the cluster ID will be displayed when Disaster Recovery is enabled. This ID is used as a prefix to the object storage or NFS file storage location where K10's data store's exported backups are saved.

Note

Save the cluster ID safely, it is required to recover K10 from a disaster.

../_images/dr_disable.png

The cluster ID value can also be accessed by using the following kubectl command.

# Extract UUID of the `default` namespace
$ kubectl get namespace default -o jsonpath="{.metadata.uid}{'\n'}"

K10 Disaster Recovery Policy

A policy called k10-disaster-recovery-policy which implements K10 Disaster Recovery will automatically be created when Disaster Recovery is enabled. This policy can be viewed from the Policies page in the navigation sidebar.

Click Run Once on the k10-disaster-recovery-policy to start a backup. The data exported by K10 for Disaster Recovery purposes will be encrypted via AES-256-GCM.

../_images/dr_policy.png

Warning

After enabling K10 Disaster Recovery, it is essential that you copy and save the following to successfully recover K10 from a disaster:

  1. The cluster ID displayed on the disaster recovery page

  2. The Disaster Recovery passphrase provided above

  3. The credentials and object storage bucket or the NFS file storage information (used in the Location Profile configuration above)

Without this information, K10 Disaster Recovery will not be possible.

Disabling K10 Disaster Recovery

K10 Disaster Recovery can be disabled by clicking the Disable K10 DR button on the K10 Disaster Recovery page, which is found under the Settings menu in the navigation sidebar.

../_images/dr_disable.png

Recovering K10 From a Disaster

Recovering from a K10 backup involves the following sequence of actions:

  1. Create a Kubernetes Secret, k10-dr-secret, using the passphrase provided while enabling Disaster Recovery

  2. Install a fresh K10 instance in the same namespace as the above Secret

  3. Provide bucket information and credentials for the object storage location or NFS file storage location where previous K10 backups are stored

  4. Restoring the K10 backup

  5. Uninstalling the k10restore instance after recovery is recommended

Note

If K10 backup is stored using an NFS File Storage Location, it is important that the same NFS share is reachable from the recovery cluster and is mounted on all nodes where K10 is installed.

Specifying a Disaster Recovery Passphrase

Currently, K10 Disaster Recovery encrypts all artifacts via the use of the AES-256-GCM algorithm. The passphrase entered while enabling Disaster Recovery is used for this encryption. On the cluster used for K10 recovery, the Secret k10-dr-secret needs to be therefore created using that same passphrase in the K10 namespace (default kasten-io)

The passphrase can be provided as a raw string or reference a secret in HashiCorp Vault or AWS Secrets Manager.

Specifying the passphrase as a raw string:

$ kubectl create secret generic k10-dr-secret \
   --namespace kasten-io \
   --from-literal key=<passphrase>

Specifying the passphrase as a HashiCorp Vault secret:

$ kubectl create secret generic k10-dr-secret \
   --namespace kasten-io \
   --from-literal source=vault \
   --from-literal vault-kv-version=<version-of-key-value-secrets-engine> \
   --from-literal vault-mount-path=<path-where-key-value-engine-is-mounted> \
   --from-literal vault-secret-path=<path-from-mount-to-passphrase-key> \
   --from-literal key=<name-of-passphrase-key>

# Example
$ kubectl create secret generic k10-dr-secret \
   --namespace kasten-io \
   --from-literal source=vault \
   --from-literal vault-kv-version=KVv1 \
   --from-literal vault-mount-path=secret \
   --from-literal vault-secret-path=k10 \
   --from-literal key=passphrase

The supported values for vault-kv-version are KVv1 and KVv2.

Note

Using a passphrase from HashiCorp Vault also requires enabling HashiCorp Vault authentication when installing the kasten/k10restore helm chart. Refer: Enabling HashiCorp Vault using Token Auth or Kubernetes Auth.

Specifying the passphrase as an AWS Secrets Manager secret:

$ kubectl create secret generic k10-dr-secret \
   --namespace kasten-io \
   --from-literal source=aws \
   --from-literal aws-region=<aws-region-for-secret> \
   --from-literal key=<aws-secret-name>

# Example
$ kubectl create secret generic k10-dr-secret \
   --namespace kasten-io \
   --from-literal source=aws \
   --from-literal aws-region=us-east-1 \
   --from-literal key=k10/dr/passphrase

Reinstall K10

Note

If you are reinstalling K10 on the same cluster, it is important to clean up the namespace in which K10 was previously installed before the above passphrase creation.

# Delete the kasten-io namespace.
$ kubectl delete namespace kasten-io

K10 must be reinstalled before recovery. Please follow the instructions here.

Provide External Storage Configuration

Create a Location Profile with the object storage location or NFS file storage location where K10 backups are stored.

Restoring K10 Backup with Iron Bank K10 Images

The general instructions found in Restore K10 Backup can be used for restoring K10 using Iron Bank hardened images with a few changes.

Specific helm values are used to ensure that K10 restore helm chart only uses Iron Bank images. The values file must be downloaded by running:

$ curl -sO https://docs.kasten.io/ironbank/k10restore-ironbank-values.yaml

Note

This file is protected and should not be modified. It is necessary to specify all other values using the corresponding helm flags, such as --set, --values, etc.

Credentials for Registry1 must be provided in order to successfully pull the images. These should already have been created as part of re-deploying a new K10 instance; therefore, only the name of the secret should be used here.

The following set of flags should be added to the instructions found in Restore K10 Backup to use Iron Bank images for K10 disaster recovery:

...
--values=<PATH TO DOWNLOADED k10restore-ironbank-values.yaml> \
--set-json 'imagePullSecrets=[{"name": "k10-ecr"}]' \
...

Restore K10 Backup

Requirements:

  • Source cluster ID

  • Name of Location Profile from the previous step

# Install the helm chart that creates the K10 restore job and wait for completion of the `k10-restore` job
# Assumes that K10 is installed in 'kasten-io' namespace.
$ helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name>

The restore job always restores the restore point catalog and artifact information. If the restore of other resources (options include profiles, policies, secrets) needs to be skipped, the skipResource flag can be used.

# e.g. to skip restore of profiles and policies, helm install command will be as follows:
$ helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name> \
    --set skipResource="profiles\,policies"

The timeout of the entire restore process can be configured by the helm field restore.timeout. The type of this field is int and the value is in minutes.

# e.g. to specify the restore timeout, helm install command will be as follows:
$ helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name> \
    --set restore.timeout=<timeout-in-minutes>

If the Disaster Recovery Location Profile was configured for Immutable Backups, K10 can be restored to an earlier point in time. The protection period chosen when creating the profile dictates how far in the past the point-in-time can be. Set the pointInTime helm value to the desired time stamp.

# e.g. to restore K10 to 15:04:05 UTC on Jan 2, 2022:
$ helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name> \
    --set pointInTime="2022-01-02T15:04:05Z"

See Immutable Backups Workflow for additional information.

Enable HashiCorp Vault using Token Auth

Create a Kubernetes secret with the Vault token.

kubectl create secret generic vault-creds \
    --namespace kasten-io \
    --from-literal vault_token=<vault-token>

Warning

This may cause the token to be stored in shell history.

Use these additional parameters when installing the kasten/k10restore helm chart.

--set vault.enabled=true \
--set vault.address=<vault-server-address> \
--set vault.secretName=<name-of-secret-with-vault-creds>

Enable HashiCorp Vault using Kubernetes Auth

Refer to Configuring Vault Server For Kubernetes Auth prior to installing the kasten/k10restore helm chart.

Use these additional parameters when installing the kasten/k10restore helm chart.

--set vault.enabled=true \
--set vault.address=<vault-server-address> \
--set vault.role=<vault-kubernetes-authentication-role_name> \
--set vault.serviceAccountTokenPath=<service-account-token-path> # optional

vault.role is the name of the Vault Kubernetes authentication role binding the K10 service account and namespace to the Vault policy.

vault.serviceAccountTokenPath is optional and defaults to /var/run/secrets/kubernetes.io/serviceaccount/token.

Restore K10 Backup in Air-Gapped environment

In case of air-gapped installations, it's assumed that k10offline tool is used to push the images to a private container registry. Below command can be used to instruct k10restore to run in air-gapped mode.

# Install the helm chart that creates the K10 restore job and wait for completion of the `k10-restore` job
# Assumes that K10 is installed in 'kasten-io' namespace.
$ helm install k10-restore kasten/k10restore --namespace=kasten-io \
    --set airgapped.repository=repo.example.com \
    --set sourceClusterID=<source-clusterID> \
    --set profile.name=<location-profile-name>

Restoring K10 Backup with Google Workload Identity Federation

K10 can be restored from a Google Cloud Storage bucket using Google Workload Identity Federation. Please follow the instructions provided here to restore k10 with this option.

Using the Restored K10 in Place of the Original

The newly restored K10 includes a safety mechanism to prevent it from performing critical background maintenance operations on backup data in storage. These operations are exclusive, meaning only one K10 instance should perform them at a time. The DR-restored K10 assumes that it does not have permission to perform these maintenance tasks initially. This assumption is made in case the original source K10 is still running, especially during scenarios like testing the DR restore procedure in a secondary test cluster while the primary production K10 is still active.

If there are no other K10 instances accessing the same sets of backup data (i.e., when the original K10 has been uninstalled and only the new DR-restored K10 remains), you can signal that the new K10 is now eligible to take over the maintenance duties by deleting the following resource:

# Delete the k10-dr-remove-to-get-ownership configmap in the K10 namespace.
$ kubectl delete configmap --namespace=kasten-io k10-dr-remove-to-get-ownership

Important

It is critical that you delete this resource only when you are prepared to make the permanent cutover to the new DR-restored K10 instance. Running multiple K10 instances simultaneously, each assuming ownership, can corrupt backup data.

Cluster-Scoped Resource Recovery

Prior to recovering applications, it may be desirable to restore cluster-scoped resources. Cluster-scoped resources may be needed for cluster configuration or as part of application recovery.

Upon completion of the Disaster Recovery Restore job, go to the Applications card, hover on the Cluster-Scoped Resources card, click on the restore icon, and select a cluster restore point to recover from.

../_images/clusterscoped.png

Application Recovery

Upon completion of the Disaster Recovery Restore job, go to the Applications card, select Removed under the Filter by status drop-down menu. Click restore under the application and select a restore point to recover from.

../_images/removed_applications.png

Uninstall k10restore

The K10restore instance can be uninstalled with the helm uninstall command.

# e.g. to uninstall K10restore from the kasten-io namespace
$ helm uninstall k10-restore --namespace=kasten-io

Recovering with the Operator

Recovering from a K10 backup involves the following sequence of actions:

  1. Install a fresh K10 instance.

  2. Configure a Location Profile from where the K10 backup will be restored.

  3. Create a Kubernetes Secret named k10-dr-secret in the same namespace as the K10 install, with the passphrase given when disaster recovery was enabled on the previous K10 instance. The commands are detailed here.

  4. Create a K10restore instance. The required values are

    • Cluster ID - value given when disaster recovery was enabled on the previous K10 instance.

    • Profile name - name of the Location Profile configured in Step 2.

    and the optional values are

    • Point in time - time (RFC3339) at which to evaluate restore data. Example "2022-01-02T15:04:05Z".

    • Resources to skip - can be used to skip restore of specific resources. Example "profile,policies".

    After recovery, deleting the k10restore instance is recommended.

Operator K10restore form view with Enable HashiCorp Vault set to False

../_images/dr_operator_passphrase.png

Operator K10restore form view with Enable HashiCorp Vault set to True

../_images/dr_operator_vault.png