Orchestrated Application Failover
Moving applications across clusters can be used for performing production failovers.
Production application failover is a process by which a standby production cluster assumes operations when a primary cluster fails or primary operations are abnormally terminated.
A cluster running K10 along with a production application can use workflows provided by K10 to organize an orchestrated failover to another cluster.
The following section provides an example of actions (i.e. Failover Action) that can be performed to failover an application between clusters. Actual actions should be adapted per application and cluster configuration.
In addition, most industries have requirements to test their DR plans at regular intervals. With this in mind, this page contains another example set of actions (Failover and FailoverTest Actions) to help with building your own DR procedure.
Failover Action
Failover Action is a set of recommended steps to make a production application DR ready. It is assumed that the customer's cloud infrastructure is able to do production DNS name switchover between clusters in case of production outage. In general, failover of components other than the application running on a Kubernetes cluster is out of scope for this document.
Step 1: Clusters preparation
To organize DR infrastructure at least one standby Kubernetes cluster equipped with K10 is required. Depending on production needs a standby cluster can live in the same cloud environment or in a different one.
This document assumes that an application to be failed over is installed in a primary cluster and that it has a working DNS configuration.
Note
See Installing K10 for details about K10 installation options.
Step 2: Selection of an external storage
To store application backups, primary and standby clusters must have network connectivity to at least one external storage location (S3, Google Cloud, Azure, etc).
In order to allow access to an external storage location, a corresponding location profile must be configured both on the primary and standby clusters.
Note
See Location Configuration for details about location profiles.
Step 3: Backup configuration
To backup a production application and store it in an external storage location, a backup policy with exports enabled must be configured on the primary cluster.
Depending on needs, the backup policy can also retain periodic local snapshots on the primary cluster for faster local restore, at the expense of some local resources.
Note
Refer to Application-Scoped Policies for details about backup + export policy configuration.
Step 4: Restore configuration
In order to restore an application on the standby side, an import policy must be configured on the standby cluster. Depending on whether a standby cluster is deployed in the same environment or in another the import policy may require applying transformations to some application resources (like Ingress) that might require different configurations between environments.
An import policy can be configured with scheduled runs or can also be ran on demand, for example in the event of an outage, or to test that the import + restore process is working.
Note
Refer to Migrating Applications for details about import + restore configuration. More information about transforms can be found at Transforms page.
Step 5: Triggering failover
When an outage on the primary cluster happens, an import policy on the standby side should be invoked to download and restore an exported backup of the application and restore it.
After any external resources required by the application have also been failed over (such as external DNS entries), a copy of the production application should be up and running instead on the standby cluster.
FailoverTest Action
FailoverTest Action contains a guided example of steps required to build a demonstration of the procedure described in the previous section.
It is assumed that a production cluster is up and running and the K10 has been installed properly.
For demonstration purposes, a sample Kubernetes application based on the
gcr.io/google_containers/echoserver:1.10
image has been deployed:
$ kubectl create namespace echoserver $ kubectl create deployment echoserver --image=gcr.io/google_containers/echoserver:1.10 --namespace echoserver $ kubectl expose deployment echoserver --type=NodePort --port=8080 --namespace echoserver
To provide external access to the application an Ingress resource which uses the NGINX ingress controller has been configured:
$ kubectl apply --namespace=echoserver --filename=- <<EOF --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: echoserver-ingress namespace: echoserver annotations: kubernetes.io/ingress.class: nginx spec: rules: - http: paths: - path: / pathType: ImplementationSpecific backend: service: name: echoserver port: number: 8080 EOF
To make sure that the application is accessible via it's external IP address, the following command has been executed:
$ kubectl get ingress echoserver-ingress --namespace echoserver NAME CLASS HOSTS ADDRESS PORTS AGE echoserver-ingress <none> * 34.83.228.252 80 41s
Step 1: Deploying a standby cluster
For this step a standby Kubernetes cluster with K10 installed is required. It is used as the target for the failover operation.
For this demonstration, another Kubernetes cluster has been provisioned and Traefik has been used as the default ingress controller.
Step 2: Location profile configuration
At this step an external storage should be prepared for exporting application backups.
For the purpose of this example, an AWS S3 bucket has been used and location profiles have been created for it both on primary and standby instances of K10:
Step 3: Backup configuration
For the purpose of this demonstration, a backup + export policy has been configured on the primary cluster.
Note
For the purpose of this demonstration, the "on-demand" frequency has been used and the policy has been run on demand.
Step 4: Restore configuration
This step requires an import + restore policy to be configured on the standby cluster.
For the purpose of this demonstration, an import + restore policy has been configured without a schedule, and has been run on demand.
Since both clusters have different ingress controllers, Ingress resources need to be reconfigured on the standby cluster to use Traefik ingress controller instead of NGINX. To achieve this the following transforms have been added to the import + restore policy on the standby cluster to remove NGINX Ingress annotations and add Traefik related settings:
Step 5: Run-once backup
An initial backup + export has to be successfully performed on the primary cluster.
This can be achieved via the Run Once
button on the Policies page:
Before moving to the next step it's required to ensure that a corresponding policy run is completed:
Step 6: Run-once restore
After a backup is completed on the primary cluster a restore on the
standby cluster should be initiated by clicking the Run Once
button on the import policy previously created.
An import policy run should be completed before moving to the next step:
Step 7: Checking an application copy
After an import and restore are completed it's required to ensure whether an application is up and running on the standby cluster and that it is accessible externally.
In this case, the Ingress resource got restored with a Traefik ingress class annotation and the application is now accessible via Traefik load balancer's IP address:
$ kubectl get ingress echoserver-ingress --namespace echoserver -o yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: traefik creationTimestamp: "2023-01-12T06:46:51Z" generation: 1 name: echoserver-ingress namespace: echoserver resourceVersion: "778114" uid: de70a1eb-b90b-4167-bd37-c7d233ce58a7 spec: rules: - http: paths: - backend: service: name: echoserver port: number: 8080 path: / pathType: ImplementationSpecific status: loadBalancer: {} $ kubectl get service traefik --namespace traefik NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE traefik LoadBalancer 10.95.252.188 34.168.126.252 80:32340/TCP,443:32305/TCP 22m