Moving applications across clusters can be used for performing production
failovers.
Production application failover is a process by which a standby production
cluster assumes operations when a primary cluster fails or primary operations
are abnormally terminated.
A cluster running Veeam Kasten along with a production
application can use workflows provided by Veeam Kasten to organize an
orchestrated failover to another cluster.
The following section provides an example of actions (i.e. Failover Action)
that can be performed to failover an application between clusters.
Actual actions should be adapted per application and cluster configuration.
In addition, most industries have requirements to test their DR plans at
regular intervals. With this in mind, this page contains another example
set of actions (Failover and FailoverTest Actions) to help with building
your own DR procedure.
Failover Action is a set of recommended steps to make a production application
DR ready. It is assumed that the customer's cloud infrastructure is able to do
production DNS name switchover between clusters in case of production outage.
In general, failover of components other than the application running on a
Kubernetes cluster is out of scope for this document.
To organize DR infrastructure at least one standby Kubernetes cluster
equipped with Veeam Kasten is required. Depending on production needs
a standby cluster can live in the same cloud environment or in a
different one.
This document assumes that an application to be failed over is installed
in a primary cluster and that it has a working DNS configuration.
To store application backups, primary and standby clusters must have
network connectivity to at least one external storage location (S3,
Google Cloud, Azure, etc).
In order to allow access to an external storage location, a corresponding
location profile must be configured both on the primary and standby
clusters.
To backup a production application and store it in an external storage
location, a backup policy with exports enabled must be configured on
the primary cluster.
Depending on needs, the backup policy can also retain periodic local snapshots
on the primary cluster for faster local restore, at the expense of some local
resources.
In order to restore an application on the standby side, an import policy must
be configured on the standby cluster. Depending on whether a standby cluster
is deployed in the same environment or in another the import policy
may require applying transformations to some application resources
(like Ingress) that might require different configurations between
environments.
An import policy can be configured with scheduled runs or can also be ran on
demand, for example in the event of an outage, or to test that the
import + restore process is working.
Note
Refer to Migrating Applications for details about
import + restore configuration. More information about transforms can
be found at Transforms page.
When an outage on the primary cluster happens, an import policy on
the standby side should be invoked to download and restore an exported
backup of the application and restore it.
After any external resources required by the application have also been
failed over (such as external DNS entries), a copy of the production
application should be up and running instead on the standby cluster.
At this step an external storage should be prepared for exporting
application backups.
For the purpose of this example, an AWS S3 bucket has been used and
location profiles have been created for it both on primary and standby
instances of Veeam Kasten:
This step requires an import + restore policy to be configured
on the standby cluster.
For the purpose of this demonstration, an import + restore policy
has been configured without a schedule, and has been run on demand.
Since both clusters have different ingress controllers, Ingress
resources need to be reconfigured on the standby cluster to use
Traefik ingress controller instead of NGINX. To achieve this the
following transforms have been added to the import + restore policy
on the standby cluster to remove NGINX Ingress annotations
and add Traefik related settings:
After a backup is completed on the primary cluster a restore on the
standby cluster should be initiated by clicking the RunOnce
button on the import policy previously created.
An import policy run should be completed before moving to the next
step:
After an import and restore are completed it's required to ensure
whether an application is up and running on the standby cluster
and that it is accessible externally.
In this case, the Ingress resource got restored with a Traefik
ingress class annotation and the application is now accessible
via Traefik load balancer's IP address: