In my previous post I did a proof of concept of recreating my infrastructure at a secondary node. While it worked, it was highly manual, and it took some downtime until I actually got around to do it.
A proper DR solution, however, should be pre-made, ready to be enacted. It can still be partly manual, but the more scripted it is and the more pre-made – and in many cases, pre-running – infrastructure there is, the more you can trust it will actually work when you need it.
With this in mind, I set off on creating a proper DR infrastructure.
The first decision: Separate DR cluster or same Kubernetes cluster?
DR-wise, you’ll not want to create too much dependencies between production and DR, you want to make sure it’s independent enough. In my case, as an enthusiast with limited infrastructure, I found it tempting to test multi-node Kubernetes clusters, so for those reasons I chose to keep it as a node in the same cluster. DR-wise, it probably wasn’t wise, but it gives me some extra infrastructure to run my workloads – and playing with technology – on, so I went for keeping as part of the same cluster.
But running a multi-node cluster with only two nodes isn’t necessarily wise. If there is a network issue and your cluster splits in two, you want your cluster to have a pre-arranged agreement of who is the real environment. Usually, you’ll want to let the part that has the majority of nodes continue running without the smaller part of the cluster, while a part of the cluster that recognize that it’s in minority should silently stop pretending it knows what the heck is going on, and not try to change any cluster state. This mechanism is called quorum. In a two-node cluster, you won’t have a majority, any split will lead to both nodes being alone, not having any basis to decide whether it should be the boss or not.
The solution to this might be to introduce a third node. It might be just a node which does nothing much but voting, and not running any workloads. I was contemplating running a node on my general-purpose VPS for this purpose for a while. And maybe I’ll try it out at some point, who knows?
Alas, in my case, it’s not the service to the world that is my servers main purpose, most of all it’s my home automation system that runs stuff in my home. Even if it’s alone in the world, I want it to perform this duty – as it does today. So, I decided on a quorum of only one node, as I don’t want more infrastructure at home. I’m a minimalist there.
So, what about my second node? There is a separate mode you can run it in: agent. It won’t run the API, it won’t have the state database on its own, it’s fully dependent on the master node for much of its tasks – allthough at least for a while, it might continue running its workload should the master go down.
Other than this, it’s a node as good as any, all cluster-mechanisms with distributed workloads etc work as if it’s a full node. So my DR-node became an agent in my cluster on it’s spare time – which hopefully is most of the time. But it also means that I still have work to do to make it a full-blown node with its own view of the world if I have to run my main workloads from it – in fact, that task I have already tested in my previous blog post, recreating the cluster from my etcd backup.
Oh, and there’s one other thing I have done to gain a bit more control over what workloads gets placed where: taint. It’s sort of a warning sign to not schedule any workloads to remote. I can specify on workloads that it should ignore the taint and allow to be scheduled there still, and then I can select the node with a nodeSelector if I want to pin workloads specifically on that node. On a POD specification, this is how this looks:
nodeSelector:
kubernetes.io/hostname: remote
tolerations:
- key: "dedicated"
operator: "Equal"
value: "remote"
effect: "NoSchedule
A POD with that added to the specification will ignore the taint and always run on remote, which is what I’d want my pre-configured/maybe-running DR workloads to do.
Improved DR: Preconfigured (and some pre-running) workloads.
In my etcd database, I have configured two nodes it knows about: hassio (my main node) and remote (my agent). Should hassio go down and I want to bring up remote, it will become a master node named remote, and look at the workloads and bring up whatever it can bring up. For some workloads that don’t have any state on the disk (or where it hardly changes), I can probably leave it up to the cluster where to schedule it, for others I’ll need to do more manual work.
This blog (wordpress) for example have both a database and the document root of the web server.
It’s possible to make a replicated database solution, although running two masters is a different game. There exists solutions for that both for mysql and for postgresql, but it’s not out of the box and straightforward, so I’ll leave that for another day. I might want to play with it at some point, but I can actually tolerate some downtime, I just want a predictable and easy way to get it up – on a secondary node, if I have to. So for now, I’ll not do multi-master replication.
At the time of writing, I actually haven’t implemented any database replication solutions, however I expect that to change. It might be tempting for mostly-read workloads like wordpress to always have an updated database in the DR environment. We’ll see, but for now I stick to separate copies of the database for DR. The added bonus for me is that the separate setups can be a pre-prod setup – if I have an active DR, I can do my changes in DR first, before doing it in production. It’s not something you’d do in a mission critical environment, there you’d have dedicated test clusters. But in my case, it is a fine trade-off.
The other issue is persistent storage (disks). There exists two ways to achieve this in a replicated setup:
- Network file systems like NFS has it’s use cases. Then both nodes will see the same files. But it introduces more network latency, and it increases complexity and creates more dependecies to the network and to the central file server – the latter I don’t actually have, and in my minimalist cluster it’d end up running at the master node, and I’d have manual work to get it moved to the DR node anyhow….
- Distributed file systems like Ceph is what you’d do in a professional multi-master setting, but those systems all eats RAM and system resources for lunch, and for a home user like me that is a luxury I might not afford.
So for now, we’ll go with something slightly more manual.
Creating an active DR wordpress
If you have ready my previous blog post Kubernetes configuration as code – Gitea and ArgoCD, you’ll know that I have all the configuration to recreate my workloads ready-made. So, we’ll start off with creating a separate installation of wordpress by duplicating all the yaml, just with different naming. I have opted for keeping it in the wordpress namespace, others might prefer a different namespace. The advantage of keeping it in the same namepace is that the secrets can be the same, I don’t need to copy i.e. database passwords, because in a copied database instance, the passwords will of course also be copied.
I have also created separate storage pools. In prod, I have zfs-storage-znvm for SSD disk and zfs-storage-nas for spinning disks. In DR, I don’t have that large an SSD, I have most on spinning disks, but I have still created zfs-storage-znvm-dr and zfs-storage-nas-dr to make the yaml configuration consistent (and probably even possible to generate with scripts, a lot of it, when I come to think of it).
So, my database storage, which in production is:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
io.kompose.service: database-vegard
name: wordpress-db
namespace: wordpress
spec:
accessModes:
- ReadWriteOnce
storageClassName: zfs-storage-znvm
resources:
requests:
storage: 10Gi
becomes
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
io.kompose.service: database-vegard
name: wordpress-db-dr
namespace: wordpress
spec:
accessModes:
- ReadWriteOnce
storageClassName: zfs-storage-znvm-dr
resources:
requests:
storage: 10Gi
I have mostly opted to just add -dr to all the resources, so I’ll not detail all the changes – they are all pretty straightforward and follow the same schedule. I have even named my blog https://vegard-dr.blog.engen.priv.no/ – try it out, if I haven’t broken it, it works and have the early content from this blog, but not, for example, this post (until I decide to copy and refresh the data, more on that later).
My workload specifications will need to have the before mentioned section to allow it to run on remote – and tell it to stay there:
nodeSelector:
kubernetes.io/hostname: remote
tolerations:
- key: "dedicated"
operator: "Equal"
value: "remote"
effect: "NoSchedule
It should then find and bind to the volumes I have created on the DR node. All configurations like configmaps (some might be reusable across nodes) and secrets exists and is usable for the DR workloads.
Permananent DR Infrastructure
There is, however, things that’s needed for the world to reach wordpress and other infrastructure
Traefik
As previously described in other blog posts, I am using Traefik as reverse proxy. To be truly independent from my main setup, I want DR to have it’s own traefik instance, traefik-dr, with an ingressclass traefik-dr. These are named traefik-external on my main setup. I used to have a traefik-internal, but decided it wasn’t actually needed.
Ingressroute
The ingressroute that binds wordpress to the external instance is also a bit special. Designing a DR solution, you’ll always try to go for a configuration where the amount of changes for activating DR is as few as possible. I can make an ingressroute which makes it accept both production and DR without reconfiguration:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: wordpress-ingressroute-dr
namespace: traefik-external
annotations:
kubernetes.io/ingress.class: "traefik-dr"
spec:
entryPoints:
- websecure
routes:
- match: Host(`vegard-dr.blog.engen.priv.no`) || Host(`vegard.blog.engen.priv.no`)
kind: Rule
middlewares:
- name: redirect-to-https
services:
- name: wordpress-dr
namespace: wordpress
port: 80
tls:
certResolver: letsencrypt
As you can see, it will accept (and generate certificates for) both the DR and production, so there’s one less thing that needs to be changed if prod is down and I want to bring it up on DR.
Load balancers, DNS and ingress traffic.
In my previous DR attempt, I decided to run everything incoming through my main site. But that means I’m still dependent upon my home network being up, so I wanted to make this a bit better.
I have created an external load balancer IPV6 pool for DR with a network from a /48 (yes, a pretty generous provider) that is routed to my VPS, called loadbalancer-ipv6-pool-dr. For IPv4, I have only one IP address, which actually makes it a bit different – but not radically – from the main setup. I don’t have to deal with NAT and port forwarding, but I have to hardcode the IP address in the load balancer resources. I also have no Unifi Gateway to handle firewall policies for me, so the security of the node itself becomes a bit more important.
I have also, of course, to get the routing working and my remote node behave a bit as part of my network, set up more BGP routing to make sure all my internal networks can talk to each other, including the networking at the remote node.
I have written about most of my BGP experiments here, here and here, and what I had to do here is quite similar to these blog posts.
One of these days, I’ll get around to write a more thorough blog post about networking in Kubernetes, too.
So, I have a loadbalancer for wordpress with IPv6:
apiVersion: v1
kind: Service
metadata:
name: traefik-vegardblog-dr
namespace: traefik-external
annotations:
projectcalico.org/ipv6pools: '["loadbalancer-ipv6-pool-dr"]'
external-dns.alpha.kubernetes.io/hostname: vegard-dr.blog.engen.priv.no
external-dns/external: "true"
external-dns.alpha.kubernetes.io/ttl: "300"
spec:
externalTrafficPolicy: Local
type: LoadBalancer
ipFamilyPolicy: SingleStack
ipFamilies:
- IPv6
ports:
- name: web
port: 80
- name: websecure
port: 443
selector:
app: traefik-dr
If externaldns is actually up, it will create a record in public DNS too.
For IPv4, as in on-prem, I have created:
apiVersion: v1
kind: Service
metadata:
annotations:
name: traefik-external-ipv4-dr
namespace: traefik-external
spec:
allocateLoadBalancerNodePorts: true
loadBalancerIP: <public IP>
externalIPs:
- <public ip>
externalTrafficPolicy: Local
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: web
port: 80
protocol: TCP
targetPort: 80
- name: websecure
port: 443
protocol: TCP
targetPort: 443
selector:
app: traefik-dr
sessionAffinity: None
type: LoadBalancer
And for DNS for ipv4, I will create service records of type externalname, as I did earlier:
kind: Service
apiVersion: v1
metadata:
name: wordpress-name-dr
namespace: wordpress
annotations:
external-dns.alpha.kubernetes.io/hostname: vegard-dr.blog.engen.priv.no
external-dns/external: "true"
external-dns.alpha.kubernetes.io/ttl: "300"
spec:
type: ExternalName
externalName: <external IP>
These are of course a bit optimistic, if externaldns isn’t up and things happen that makes the ipv6 address be different, I’ll need to change DNS manually. At the time of writing, I don’t have any externaldns running in DR – but that might change…
Syncing data
With all this, infrastructure is created, but I still need to synchronize the data, as what I’ve created so far is a non-initialized wordpress instance.
As mentioned in The road to enterprise at home: A DR-test! I have a backup strategy where all my zfs volumes are synced to the backup node – which is how I got started thinking about running it as a DR node, I was already halfway there with the data synced to the node. But I don’t want to run my DR directly off the backup volumes, I’ll want to copy it to the freshly created zfs filesystem. that my DR wordpress deploy created. Here comes ZFS to the rescue, this can done in a pretty lightweight way! You can actually clone a fully writeable new filesystem based on a snapshot of another filesystem.
A word of warning, though, this new filesystem will have a dependency to the snapshot, so you’ll not be able to destroy that snapshot before you have destroyed the clone. This actually bit me, because my backup strategy depends on syncing the changes that has happened since the last backup to the backup filesystem, and the way this is implemented in syncoid (a part of sanoid) at least, this fails when there’s a snapshot created on the destination since the last sync. My solution to this was to create the snapshots at the source system, and then let syncoid sync the snapshot – because it’s meant to preserve the snapshots in the source system.
But, back to the clone. The process to copy the database volume is as follows:
Scale down wordpress-db-dr:
kubectl scale -n wordpress deployment wordpress-db-dr --replicas 0
In my case, I actually needed to temporary change the specification of wordpress-db-dr in the argocd repository to stop argocd to detect the drift and set it to 1 replica again…. There are more ways to do that, though.
Get the necessary file systems:
hassio% kubectl get pvc -n wordpress
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
wordpress-db Bound pvc-577cf97a-08a7-483e-b6ab-8367e0c6e6bb 10Gi RWO zfs-storage-znvm <unset> 181d
wordpress-db-dr Bound pvc-115939d9-5381-40f7-b573-eee9befaf142 10Gi RWO zfs-storage-znvm-dr <unset> 8d
wordpress-html Bound pvc-232f16b3-3d12-40f0-ac9a-09241fffd551 100Gi RWO zfs-storage-nas <unset> 181d
wordpress-html-dr Bound pvc-ea99e934-58dc-414d-85e5-b41ae3e36fc5 100Gi RWO zfs-storage-nas-dr <unset>
On DR (remote), destroy the file system you want to clone to:
zfs destroy backup/znvm/k3s/pvc-115939d9-5381-40f7-b573-eee9befaf142
(because that’s what it was called in my system)
Then I need to clone it:
zfs clone backup/encrypted/znvm/k3s/pvc-577cf97a-08a7-483e-b6ab-8367e0c6e6bb@clone backup/znvm/k3s/pvc-115939d9-5381-40f7-b573-eee9befaf142
…and then came the stuff I struggled with, but I finally figured it out. To make it mountable by the POD, you need to make sure it’s unmounted (zfs will often automatically mount it), and you need to make sure the mountpoint is set to legacy:
zfs umount backup/znvm/k3s/pvc-115939d9-5381-40f7-b573-eee9befaf142
zfs set mountpoint=legacy backup/znvm/k3s/pvc-115939d9-5381-40f7-b573-eee9befaf142
Now you can scale up wordpress-db-dr again, and it will contain the same as the snapshot:
kubectl scale -n wordpress deployment wordpress-db-dr --replicas 1
This clone is actually not a full copy, it shares the blocks that are common up to the snapshot, only if there is changes in the clone will extra file space be used.
This is actually a neat trick to save some disk space, and as I am paying for disk space at the provider, duplicating data can actually add up to some money. Some people will have less issues with this, though. For those, a zfs send piped to a zfs receive will probably feel safer. In a professional setting, I’d probably shell out for the disk space to duplicate. Here’s the command for a full copy:
zfs send backup/encrypted/znvm/k3s/pvc-577cf97a-08a7-483e-b6ab-8367e0c6e6bb@clone | zfs recv backup/znvm/k3s/pvc-115939d9-5381-40f7-b573-eee9befaf142
Now, the DR volume will contain a fully independent copy.
Me, I trust the zfs clone mechanisms and save disk space, but I can tolerate some surprises at the backup/DR node in the future…
The process to copy the other volume is exactly the same.
But yet another minor detail…
The cloned wordpress will believe it’s called vegard.blog.engen.priv.no, not vegard-dr.blog.engen.priv.no. When you run production off it, that’s probably what you want, but if you want to be able to test it independently you need to rename it to vegard-dr.blog.engen.priv.no. The way I did that was to inject WP_HOME and SITE_URL environment variables for this name, with source from an environment-specific configuration file. But this also means that if I want to actually run DR, not only the test, I need to change that config map.
Activating DR for wordpress.
There is a few steps to activate DR, but not many.
- I’d probably want to resync the data to the latest backup (which already resides on the same node).
- I need to change the identity so that it believes it’s vegard.blog.engen.priv.no, in the before mentioned configmap.
- I need to change the main load balancer resource traefik-external-wordpress, to:
- Serve an ip address from the loadbalancer-ipv6-pool-dr pool
- Change traefik-name (the production version) to point to the DR IPV4 address.
- If externaldns isn’t up, I need to update DNS manually.
That wasn’t too bad for a DR activating process, was it?
Other tested DR components
- Keycloak was easily cloned. I decided to not ever run it independently, so it believes it’s called keycloak.engen.priv.no. If I want to use it, I just need to make sure production DNS points to DR. The configuration is mostly static, so I just need to make a new clone now and then.
- Redis is a dependency of WordPress in my case, in fact, so I needed to make a separate installation. I just make it a permanent dr-identity, because it’s easiest that way. There is no data sync, the caching is independent.
- Nextcloud was approximately the same process as for wordpress, though with a few more variables to change, and I decided to clean up some hardcoded values on the filesystem configuration so that it’s all governed by the Kubernetes configuration and I can easily clone with new data without having to do any changes in the cloned data.
With all this done, the next step is probably a better DR test than my first one! Stay tuned!