In my previous blog post, I got as far as having an identical gitea in DR, with the same repositories that exists on-prem. They will, of course, not stay identical for very long without finding a way to keep them in sync. Before starting to migrate applications to applicationsets and create them in DR, I needed to decide on the mechanisms, and there I have basically two choises:
- DR can continuously poll PROD for changes. The disadvantage with this is that I can’t push any changes to a pull-replica, so if prod goes down, I can’t create any changes without turning off the replication. I can’t turn on replication again without recreating the repository from scratch as a replica. It’s probably a doable strategy, but I chose the other one.
- PROD can push changes to DR, periodically and/or on every commit. It’s a force-push, so it will overwrite any changes I do in DR. However, should PROD be down, I can of course do changes in DR, because I might have to push some changes to i.e. activate deactivated DR components. If I am unsure if PROD is totally dead, I can of course make sure it can’t push by deleting the credentials it uses to push, recreating them when prod is alive again.
As stated, I went for option two. I won’t go into the details on how to actually do that, that is well documented for example here.
In my old repo, I have one apps-of-apps that contain all my other ArgoCD application, so in reality the process is quite similar to the bootstrap repo migration process.
I do, however, want PROD to be connected to the production gitea and DR to the DR gitea. That is simply done by making the repoURL a variable, so here’s my appsets applicationset that contains all the other gitea-hosted applicationsets
kind: ApplicationSet
metadata:
name: appsets-bootstrap
namespace: argocd
spec:
goTemplate: true
generators:
# ===== PROD =====
- matrix:
generators:
- clusters:
selector:
matchLabels:
env: prod
- list:
elements:
- appName: appsets-prod
repoURL: <prod-repo-url>
env: prod
- matrix:
generators:
- clusters:
selector:
matchLabels:
env: dr
- list:
elements:
- appName: apsets-dr
repoURL: <dr-repo-url>
env: dr
template:
metadata:
name: "appsets-{{ .env }}"
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "5"
spec:
project: default
source:
repoURL: "{{ .repoURL }}"
targetRevision: main
path: application
destination:
server: https://kubernetes.default.svc
namespace: appsets
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
This creates applications for prod and DR that pulls from different repositories. This applicationset specification, of course, lives in the bootstrap repo, which makes it the final step in the bootstrap process.
Creating applicationsets for workloads
The migrations of applications to applicationsets is pretty similar to the process in part 3, but since this is applications that are meant for being exposed externally, there is also some differences.
Instead of the hostnames pointing to it being different, it is now the same hostnames, that should be active in only one of production and DR, so we also need some DR activation mechanisms and procedures.
They also don’t necessarily need to run before activating DR, it’s fine to spin it up on DR activation. We also need to make sure the content is as close to the last known good production content as is feasible. In general, I use longhorn disaster recovery volumes for this, but for CloudnativePG I am better off continuously pulling the database backup as a replica, being activated as a master when I need to activate DR.
I don’t have both of these mechanisms in the same application anywhere in my cluster, so I’m going to start with the application for this blog, that uses longhorn DR volumes. I am using MariaDB for it, as that’s what WordPress prefers, and MariaDB is happy with just spinning up on a copy of files from production.
The wordpress applicationset
I’ll start off with just showing the applicationset specification
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: wordpress
namespace: argocd
spec:
goTemplate: true
generators:
- matrix:
generators:
- clusters:
selector:
matchLabels:
env: prod
- list:
elements:
- appName: wordpress-prod
appPath: application/wordpress/prod
namespace: wordpress
enabled: "true"
repoURL: "<prod-repo-url>"
env: prod
- matrix:
generators:
- clusters:
selector:
matchLabels:
env: dr
- list:
elements:
- appName: wordpress-dr
appPath: application/wordpress/dr
namespace: wordpress
enabled: "false"
repoURL: "<dr-repo-url>"
env: dr
template:
metadata:
name: "{{ .appName }}"
labels:
app.kubernetes.io/name: wordpress
env: "{{ .env }}"
spec:
project: default
source:
repoURL: "{{ .repoURL }}"
targetRevision: main
path: "{{ .appPath }}"
destination:
server: https://kubernetes.default.svc
namespace: "{{ .namespace }}"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
templatePatch: |
{{- if eq .enabled "false" }}
spec:
source:
kustomize:
patches:
- target:
group: ""
version: v1
kind: Service
# name: app # optionally scope to specific Services
# labelSelector: "app=wordpress"
patch: |-
- op: add
path: /metadata/annotations/external-dns~1external
value: "false"
{{- else }}
spec:
source:
kustomize:
patches:
- target:
group: "longhorn.io"
version: v1beta2
kind: Volume
# name: app # optionally scope to specific Services
# labelSelector: "app=wordpress"
patch: |-
- op: add
path: /spec/Standby
value: false
{{- end }}
The first part of it is well-known, except that I introduce another variable, enabled. Flipping this is my DR activation mechanism.
The magic for that is in the templatePatch section. This tests for this variable being true or false, and will optionally patch the application yaml based on this:
- If enabled is false, I set an annotations external-dns/external: false – optionally replacing a true value – making sure it’s not setting a DNS record. The reason I do it like this, it’s that then I can set external-dns/external: true in the repo if it should be in external-dns, and leave it out if not. It simply makes the logic easier, as one more false doesn’t matter, external-dns simply won’t do anything with it, but one more true can make unwanted DNS records.
- If enables is true, I patch any volume YAMLs with Standby: false. Again, if they were already active this is a non-operation, but if it’s Standby: true then I need to switch it to activate the DR volume.
I am not currently doing this for wordpress, but if I also add this patch under the false test, I can have the wordpress pod not running if it’s not activated, saving resources on a potentially limited-resource DR node.
- target:
group: apps
version: v1
kind: Deployment
# name: app # optionally scope to specific Services
patch: |-
- op: replace
path: /spec/replicas
value: 0
This won’t work for the MariaDB POD, as that’s run by MariaDB operator and uses other mechanisms, but the POD won’t really start anyhow until the volume with the database is activated.
The DR volumes, I have mentioned before. I can specify the PVCs normally, but in DR, I need to add specific resources for the PV and volume records, so that the PVC binds to it instead of provisioning its own:
kind: Volume
metadata:
name: wordpress-files
namespace: longhorn-system
spec:
Standby: true
accessMode: rwo
backingImage: ""
backupCompressionMethod: lz4
numberOfReplicas: 1
dataLocality: best-effort
diskSelector:
- hdd
fromBackup: s3://klauvsteinen-longhorn-backups@eu-central-1/?backup=backup-cd62b8e088204625&volume=pvc-7241107b-5109-4d29-a36f-663c56de8a98
frontend: "blockdev"
size: "107374182400"
Flipping enabled in the applicationset will change the Standby to false in this resource, activating the volume and making the PVC health.
For the PV we have this, which binds to this volume and serves it to the PVC.
apiVersion: v1
kind: PersistentVolume
metadata:
name: wordpress-files
spec:
capacity:
storage: 100Gi
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn-rwo-local-hdd
persistentVolumeReclaimPolicy: Retain
claimRef: # <- pre-bind to the PVC the operator will create
namespace: wordpress
name: wordpress-files
csi:
driver: driver.longhorn.io
volumeHandle: wordpress-files # <- your Longhorn volume name
fsType: ext4
Once I have used DR, I will need to delete and recreate these resources, probably based on new backup IDs, but in real life, this is very often a manual process. It’s usually more urgent switching to DR than switching back.
For the external-dns annotations, there’s another potential gotcha. Once the DNS entries are live for prod, DR won’t manage it. If prod cluster is truly dead, the ArgoCD in prod won’t remove the DNS records, so I might need to manually fix this by removing some DNS records – until I find a better mechanism.
CloudnativePG
I mentioned that for my postgres databases, I used another mechanism. Here’s how I define my DR keycloak database:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: keycloak-db
namespace: keycloak
annotations:
# Avoid first-sync dry-run errors in Argo CD before CRDs are present
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
# (Optional) apply DB slightly before your app; adjust if you like
argocd.argoproj.io/sync-wave: "1"
spec:
# Single-instance (adjust to >1 if you later want HA)
instances: 1
# Pin a PG image version you want to run
imageName: ghcr.io/cloudnative-pg/postgresql:16
# Storage: adjust class/size to your environment
storage:
size: 10Gi
storageClass: longhorn-db-local-ssd
backup:
barmanObjectStore:
destinationPath: "s3://keycloak-backups/dr/"
endpointURL: "http://minioprod.minio.svc.cluster.local:9000"
s3Credentials:
accessKeyId:
name: keycloak-backup-secret
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: keycloak-backup-secret
key: AWS_SECRET_ACCESS_KEY
# Set the postgres superuser password from a basic-auth secret
enableSuperuserAccess: true
superuserSecret:
name: keycloak-superuser # type: kubernetes.io/basic-auth (username/password)
# Bootstrap the application database and owner
bootstrap:
recovery:
source: origin
replica:
enabled: true
source: prod
externalClusters:
- name: origin
barmanObjectStore:
destinationPath: s3://keycloak-backups/
serverName: prod-new/keycloak-db
endpointURL: http://minioprod.minio.svc.cluster.local:9000/
s3Credentials:
accessKeyId:
name: keycloak-backup-secret
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: keycloak-backup-secret
key: AWS_SECRET_ACCESS_KEY
wal:
maxParallel: 8
- name: prod
barmanObjectStore:
destinationPath: s3://keycloak-backups/
serverName: prod-new/keycloak-db
endpointURL: https://minio.engen.priv.no/
s3Credentials:
accessKeyId:
name: keycloak-backup-secret
key: AWS_ACCESS_KEY_ID
secretAccessKey:
name: keycloak-backup-secret
key: AWS_SECRET_ACCESS_KEY
wal:
maxParallel: 8
This bootstraps from the backup (synchronized to DR, in case prod is down), but then continuously pulls from prod backup directly, to avoid fetching a potentially stale DR minio replica. This makes the database pretty much synchronized to the point of the disaster making you activating DR.
To make this a master, you simply need to change enabled: true to enabled: false in the replica section, and this can be specified in the ArgoCD Applicationset specification much as we did for standby volumes.
Summary
I don’t need to create DR for all applications, for example my home assistant doesn’t make sense to run in the cloud. There’s also applications that are less important, that I just don’t bother running in DR, saving on the resources of my for-now-rented DR node. I have created an applicationset singlenodeapps which just contains a list of applications I need to run per node, a different set in PROD and DR. Per now, there’s none in DR, quite a few in my PROD cluster
I don’t have a way to say «activate DR for all applications», and it’s a manual activation by changing a flag in the repository on each application. The bonus is that if, for some reason, you should get some corruption in one of the workloads, but the others are fine, you can opt to run only that workload in DR, leaving the rest on the PROD node.
I can’t easily do isolated DR tests in this setup. Live DR tests involves a potential synchronization of data back into production. That can, of cource, be done with a longhorn restore from a DR volume backup, so it’s definitely doable.
If I want to do isolated DR tests, I could potentially avoid activating the external-dns entries based on some variable, thereby making it not accessed by outside. This might work for some workloads, but some might need better isolation.
I have done a live DR test of this blog, and it was pretty neat to see everything spin up and work without issues on my DR node.