etcd Backup (OpenShift Container Platform)

Assuming the Kubernetes cluster is set up through OpenShift Container Platform, the etcd pods will be running in the openshift-etcd namespace.

Before taking a backup of the etcd cluster, a Secret needs to be created in a temporary new or an existing namespace, containing details about the etcd cluster endpoint, etcd pod labels and namespace in which the etcd pods are running. In the case of OCP, it is likely that etcd pods have labels app=etcd,etcd=true and are running in the namespace openshift-etcd. A temporary namespace, and a Secret to access the etcd member can be created by running the following commands:

$ oc create ns etcd-backup

$ oc create secret generic etcd-details \
     --from-literal=endpoints=https://10.0.133.5:2379 \
     --from-literal=labels=app=etcd,etcd=true \
     --from-literal=etcdns=openshift-etcd \
     --namespace etcd-backup

Note

Make sure that the provided endpoints, labels and etcdns values are correct. K10 uses the labels provided above to identify a member of the etcd cluster and then takes backup of the running etcd cluster.

To figure out the value for endpoints flag, the below command can be used:

$ oc get node -l node-role.kubernetes.io/master="" -ojsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}'

To avoid any other workloads from etcd-backup namespace being backed up, Secret etcd-details can be labeled to make sure only this Secret is included in the backup. The below command can be executed to label the Secret:

$ oc label secret -n etcd-backup etcd-details include=true

Backup

To create the Blueprint resource that will be used by K10 to backup etcd, run the below command:

$ oc --namespace kasten-io apply -f \
    https://raw.githubusercontent.com/kanisterio/kanister/0.69.0/examples/etcd/etcd-in-cluster/ocp/etcd-incluster-ocp-blueprint.yaml

Once the Blueprint is created, the Secret that was created above needs to be annotated to instruct K10 to use this Secret with the provided Blueprint to perform backups on the etcd pod. The following command demonstrates how to annotate the Secret with the name of the Blueprint that was created earlier.

$ oc annotate secret -n etcd-backup etcd-details kanister.kasten.io/blueprint='etcd-blueprint'
      secret/etcd-details annotated

Once the Secret is annotated, use K10 to backup etcd using the new namespace. If the Secret is labeled, as mentioned in one of the previous steps, while creating the policy, just that Secret can be included in the backup by adding resource filters like below:

Note

The backup location of etcd can be found by looking at the Kanister artifact of the created restore point.

Restore

To restore the etcd cluster, the same mechanism that is documented by OpenShift can be followed with minor modifications. The OpenShift documentation provides a cluster restore script (cluster-restore.sh), and that restore script requires minor modifications because is expects the backup of static pod manifests as well which is not taken in this case. The modified version of the restore script can be found on here.

Before starting the restore process, make sure these prerequisites are met:

SSH connectivity to all the leader nodes
Among all the leader nodes, choose one node to be the restore node
A command-line utility to download etcd backup from object store (e.g., the aws CLI)

The below steps should be followed to restore the etcd cluster:

Download the etcd backup on the restore node using the aws CLI to a specific location (e.g., /var/home/core/etcd-backup):

Stop static pods from all other leader nodes by moving them outside of staticPodPath directory (i.e., /etc/kubernetes/manifests):

# Move etcd pod manifest
$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp

# Make sure etcd pod has been stopped
$ sudo crictl ps | grep etcd

# Move api server pod manifest
$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp

Move the etcd data directory to a different location, on all leader nodes that are not the restore nodes:
$ sudo mv /var/lib/etcd/ /tmp
Run the modified cluster-ocp-restore.sh script with the location of etcd backup:
$ sudo ./cluster-ocp-restore.sh /var/home/core/etcd-backup
Restart the Kubelet service on all of the leader nodes:
$ sudo systemctl restart kubelet.service

Verify that the single etcd node has been started:

$ oc get pods -n openshift-etcd
NAME                                                           READY     STATUS      RESTARTS   AGE
etcd-ip-10-0-149-197.us-west-1.compute.internal                1/1       Running     0          3m57s
installer-2-ip-10-0-149-197.us-west-1.compute.internal         0/1       Completed   0          7h54m
installer-2-ip-10-0-166-99.us-west-1.compute.internal          0/1       Completed   0          7h53m
installer-2-ip-10-0-212-253.us-west-1.compute.internal         0/1       Completed   0          7h52m
revision-pruner-2-ip-10-0-149-197.us-west-1.compute.internal   0/1       Completed   0          7h51m
revision-pruner-2-ip-10-0-166-99.us-west-1.compute.internal    0/1       Completed   0          7h51m
revision-pruner-2-ip-10-0-212-253.us-west-1.compute.internal   0/1       Completed   0          7h51m

Force etcd deployment, by running the below command:

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

# Verify all nodes are updated to latest version
$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

# To make sure all etcd nodes are on the latest version wait for a message like below
AllNodesAtLatestRevision
3 nodes are at revision 3

Force rollout for the API Server control plane component:

# API Server
$ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

Wait for all API server pods to get to the latest revision:

$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

Force rollout for the Controller Manager control plane component:

$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

Wait for all Controller manager pods to get to the latest revision:

$ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

Force rollout for the Scheduler control plane component:

$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

Wait for all Scheduler pods to get to the latest revision:

$ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

Verify that all the etcd pods are in the running state. If successful, the etcd cluster has been restored successfully

$ oc get pods -n openshift-etcd | grep etcd
etcd-ip-10-0-149-197.us-west-1.compute.internal                4/4       Running     0          19m
etcd-ip-10-0-166-99.us-west-1.compute.internal                 4/4       Running     0          20m
etcd-ip-10-0-212-253.us-west-1.compute.internal                4/4       Running     0          20m