A real incident: Longhorn recovery

Running Longhorn with multiple replicas over long distance isn’t exactly recommended. Longhorn works best on a local network, where network in general is stable and you don’t lose connectivity between the nodes so often.

I ignored this for a while, because I wanted to test a dual-node longhorn setup, but in the end I paid for it with loss of volumes. In short: Several of my volumes had both replicas declared faulty, Longhorn just didn’t know what do do. There might be way to recover that are a bit hackish, the data is still there after all, but I decided it was a good opportunity to test recovery by restoring. My data isn’t that dynamic.

I naively thought at first that I could just press «restore» in a volume, and everything would automagically happen. Imho, that would be a usability improvement. Instead, you need to create new volumes which seeds from the backup. You use the same strategy to create replicated volumes in DR, but then you’d want continuous sync until you activate the volume, but here you just want to create it from the latest backup. You can actually do this from the user interface, which is probably fine, but I’ll give the YAML anyhow.

To create a restored volume:

apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: nextcloud-files                 # your DR volume name
  namespace: longhorn-system
spec:
  frontend: blockdev
  accessMode: rwo
  fromBackup: "s3://longhorn-backups@minio/?backup=backup-f801bbc8f89e44cb&volume=pvc-6b7294af-5360-40ff-8c27-f30234d5cf1c"   # URL of ANY backup of the source volume
  numberOfReplicas: 2
  dataEngine: v1
  diskSelector: 
    - hdd

This block does exactly what it looks like: It creates a new volume from that backup. The URL can be found in the Longhorn user interface, when looking at the backups. The backup parameter varies for each backup run, so it’s not possible to hardcode this before you need it, really. The bonus is that you can decide which backup you need to restore from, if you keep multiple generations – and you’d likely want to, as it’s cheap to do so.

Then, you’ll need to hand-craft a PV that uses it:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-files
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  claimRef:
    name: nextcloud-files
    namespace: nextcloud
   csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeHandle: nextcloud-files

The claimref is what binds it to the nextcloud-files PVC. You’ll need to recreate that PVC, but as long as this PV exists, it can be the exact same definition as before – the PVC will find it and not provision a new volume. The capacity also needs to match on all levels. At least you can not specify a smaller capacity here than the size of the volume.

And that’s really it. You don’t really need to check in this in your ArgoCD or anything, if you have checked in the PVC, that is enough.

Vegards Blog

A real incident: Longhorn recovery

Legg igjen en kommentar Avbryt svar

A real incident: Longhorn recovery

Del dette:

Legg igjen en kommentar Avbryt svar