Disclaimer: Quite a bit of this is outdated. After a week or so, I decided to redo it all – mainly because I wanted ipv6 supported inside, and I thought that the fact that my multus macvlans supported ipv6 was a proof I was all good. Turns out that with multus and macvlans, you live quite a bit on the outside of kubernetes concepts, and even the built in network security mechanisms doesn’t apply to it. I decided to redo it with ipv6 and without multus – because at least in k3s, you simply can not enable ipv6 after having created the cluster. Do not do the same mistake as me, go for ipv6 from the start. IPv4 is legacy, it’s high time we start treating IPV6 as a first class citizen.
After my docker networking deep dive, I decided to explore another container orchestration system, Kubernetes. Kubernetes is more feature-rich, has more built in security and isolation mechanisms, and comes with more networking options built in, but has somewhat of a learning curve. For a nerd, that might be seen as an added bonus, there’s tons to learn! In this post I focus mostly about my choices and the concepts, while I’ll go more in depth into different concepts later. To give a full overview of my setup, I’ll skip out on a bit of the installation/configuration details, which I guess I’ll correct somewhat in followup posts. If you want me to go in to details in some of this, either be patient and I might get around to it, or leave a comment (which I’ll need to approve. There’s way too much comment spam going around, so I won’t leave it unmoderated).
Kubernetes Implementation
There’s a few different implementations of Kubernetes. The real deal is K8s, and then it’s down to more like overlays over docker like Minikube. Since my infrastructure is pretty limited, but I still wanted a «real deal»-feeling to it, I chose K3s, which is fairly lightwight but still gives you a good experience tailored for more limited production workloads. Installing it is fairly simple, with one installation script that just tend to work.
Networking
The next thing to choose is how to manage networking. Kubernetes has the concept of Container Network Interface, that is plugins that manages the networking for you. K3s comes with Flannel as the default CNI, which is simple and might have done the job. But since I am on a learning trip, I decided to choose something a bit more advanced, so I went with Calico. It is more feature-rich, comes with better security features, and all sorts of fun stuff to play with. Eager to get going, I didn’t dive too deep into this. There’s also options like Cilium, plus likely several others.
I wanted to continue with my strategy of one outside-facing network where my DMZ-facing applications can be exposed and one inside-network which is Kubernetes only. Cilium can’t deliver that natively, and the solution to that is usually Multus, a CNI that you’ll use as an addon to your main CNI. Multus enables you to have multiple network interfaces in a POD. My internal jury is still out on to what extent this is a good idea, but so far I run a pretty similar setup like in my docker networking series, with an incoming DMZ (a multus network) and outgoing DMZ (another multus network) where containers that just should be allowed to speak to the world can attach. I chose to stick with macvlan on top of the 802.1q interfaces, as that gives a good isolation between the host and the kubernetes network.
hassio% ifconfig kdmz01
kdmz01: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::1e69:7aff:fe64:12e1 prefixlen 64 scopeid 0x20<link>
inet6 2a01:799:393:f10b:1e69:7aff:fe64:12e1 prefixlen 64 scopeid 0x0<global>
ether 1c:69:7a:64:12:e1 txqueuelen 1000 (Ethernet)
RX packets 1304998 bytes 223638900 (223.6 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1200547 bytes 1075326210 (1.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
hassio% ifconfig kdmzout
kdmzout: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::1e69:7aff:fe64:12e1 prefixlen 64 scopeid 0x20<link>
inet6 2a01:799:393:f10c:1e69:7aff:fe64:12e1 prefixlen 64 scopeid 0x0<global>
ether 1c:69:7a:64:12:e1 txqueuelen 1000 (Ethernet)
RX packets 15518299 bytes 38629815999 (38.6 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11650235 bytes 18813143467 (18.8 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Then, I need to define the networks in Kubernetes. In Kubernetes, you do that through a NetworkAttachmentDefinition. My incoming DMZ is defined with this configuration, that I apply the standard way with kubectl apply -f dmz.yaml:
hassio% cat dmz.yaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: dmz-net
namespace: traefik
spec:
config: '{
"cniVersion": "0.4.0",
"type": "macvlan",
"master": "kdmz01",
"mode": "bridge",
"ipam": {
"type": "host-local",
"ranges": [[{ "subnet": "192.168.29.0/24", "rangeStart": "192.168.29.10", "rangeEnd": "192.168.29.250", "gateway": "192.168.29.1" }]],
"default-gateway": [ "192.168.29.1" ]
}
}'
I have another similar that I call dmz-out, with the range 192.168.31.0/24.
A Pod will per default end up with a main interface managed by Calico, we can attach this network to a pod in addition in its configuration – which Multus will handle. We’ll cover that later.
Kubernetes will per default give you a fairly large and fairly flat network, managed by your CNI. I have created a pool, 10.151.0.0/16, that Calico will manage. For the internal network I saw no reason so far to go with static IP addressed but rather let Calico hand it out on its own, as Kubernetes has built in DNS that will look up the addresses that way instead of you using the addresses directly. The traditional security model of Kubernetes is not so much around network separation that I did play with in docker, you’ll rather attach network policies to a group of resources that is dynamically handled by your CNI. One of the key concepts around structuring this is, however, namespaces.
Namespaces
A namespace is a grouping of resources. You can loosely compare it to a docker stack, in that you can install and name things in it without much regard to what exists in other namespaces. There’s by default no security implications in separating things in namespaces, but you can use a namespace as a grouping unit in a security rule, so it still helps. Using my previous example of this blog, it is now installed in the namespace wordpress.
I decided to go with a different reverse proxy, Traefik, mainly because I wanted to try something different than nginx, and because it’s feature-rich and flexible. I run Traefik in its own namespace, traefik, and with a connection both to the internal network and to the DMZ, much like I did in my docker setup.
Pods
Pods is where you run your containers. Typically there will be one container per pod, and you’d start/stop the pod, not the containers. There’s exceptions where you’d have several containers in a pod, but those I’ll cover later – or in the next post, I have a feeling there will be several posts…
In my wordpress namespace, I have two pods, one for the wordpress application container and one for my mariadb container. But each of these might not be single pods, they can be multiple, for redundancy and fault tolerance, so you’ll generally specify it in a deployment, which manages a set of similar pods – and handles distribution over several nodes, restart strategy, scaling, and a ton of things I haven’t gotten to play with yet, partly because my Kubernetes cluster is a tiny Intel NUC sitting on the top of a drawer trying to look innocent because of the WAF. So far, the deployments I have specified myself all consists of a single POD, where one is terminated before the second one starts. All of these will of course needs some of permanent storage, which in Kubernetes is specified in the form of Permanant Volume Claims. That’s a topic that likely warrants more deep dive.
So, my WordPress Deployment looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
labels:
io.kompose.service: wordpress-vegard
name: wordpress-app
namespace: wordpress
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: wordpress-vegard
strategy:
type: Recreate
template:
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.34.0 (cbf2835db)
k8s.v1.cni.cncf.io/networks: '[
{ "name": "dmz-out", "namespace": "infrastructure", "ips": [ "192.168.31.13/24" ], "mac": "02:42:ac:11:00:04" }
]'
mutating-webhook.network: "dmz-out"
labels:
io.kompose.service: wordpress-vegard
app: wordpress
webhook-network: "true"
spec:
containers:
- env:
- name: WORDPRESS_DB_NAME
valueFrom:
secretKeyRef:
name: database
key: MYSQL_DATABASE
- name: WORDPRESS_DB_HOST
valueFrom:
configMapKeyRef:
name: wordpress-vegard-cm1
key: db-host
- name: WORDPRESS_DB_PASSWORD
valueFrom:
secretKeyRef:
name: password
key: MYSQL_PASSWORD
- name: WORDPRESS_DB_USER
valueFrom:
secretKeyRef:
name: user
key: WORDPRESS_DB_USER
image: wordpress:latest
name: wordpress-app
volumeMounts:
- mountPath: /var/www/html
name: wordpress-html
- mountPath: /usr/local/etc/php/conf.d/upload.ini
name: wordpress-vegard-cm1
subPath: upload.ini
restartPolicy: Always
volumes:
- name: wordpress-html
persistentVolumeClaim:
claimName: wordpress-html
- configMap:
items:
- key: upload.ini
path: upload.ini
name: wordpress-vegard-cm1
name: wordpress-vegard-cm1
First, I specify some metadata, how many replicas I want, some labels and annotations that you can use to to pretty cool things. The k8s.v1.cni.cncf.io/networks annotation is the one that tells multus to connect another network to it. To be able to use it in my external firewall rules if I should wish to, I hardcode the IPv4 address, while for IPv6 I let it get it with IPV6 autoconfiguration. If I hardcode the MAC address, that one will also be static, provided my network doesn’t change. I could have opted for hard coded ipv6 addresses and turn off IPv6 autoconfig, but if my ISP should decide to give me another IPv6 prefix, at least all of my Kubernetes setup will reconfigure all by itself, leaving only external firewall rules and external DNS for me to worry about….(I have some ideas how to solve that with automation, too, but that’s for another day).
Then, I specify the container. There’s some environment variables the docker image expects, then there’s the docker image itself, and then there are some persistent volumes – which is a topic worthy of a blog post in itself. There’s also two concepts that’s pretty good to know about:
- configMaps, that is a place to stort application confirmation,either in form of key/value pairs or in form of full configuration files that you can mount into the runtime environment of the container. I have used both here.
- secretKeys, that is sort of the same but more geared through secret values (passwords, encryption keys, etc). There’s much more to those than what I have covered so far, you can potentially encrypt them, or even store them in an external vault of your choosing. Courtesy of Keeper, I actually have Keeper Secrets Manager which I tested in my backup solution which I still run. This will allow me to test Kubernetes External Secrets Operator for Keeper, and knowing myself, I’ll get tempted to do this in the not-too-distant future.
I’ll not go into the details how you create and specify configmaps and secrets, there’s some options there.
For completion, here’s my Mariadb deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.34.0 (cbf2835db)
labels:
io.kompose.service: mariadb-vegard
app: mariadb
name: wordpress-db
namespace: wordpress
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: mariadb-vegard
strategy:
type: Recreate
template:
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.34.0 (cbf2835db)
labels:
io.kompose.service: mariadb-vegard
app: mariadb
spec:
containers:
- env:
- name: MYSQL_DATABASE
valueFrom:
secretKeyRef:
name: database
key: MYSQL_DATABASE
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: password
key: MYSQL_PASSWORD
image: mariadb
name: wordpress-db
volumeMounts:
- mountPath: /var/lib/mysql
name: wordpress-db
restartPolicy: Always
volumes:
- name: wordpress-db
persistentVolumeClaim:
claimName: wordpress-db
This one is a bit simpler, as I don’t expose it to multus, it shouldn’t have a need to access the external world. However, by default it will still be able to, unless I add a network policy that blocks it – that, I’ll probably cover in a specific security related blog post….
Services
You’ll usually not configure different parts of your systems to communicate directly to other pods, there’s another abstraction layer called services. A service will usually live on a more static IP address, at least more long-lived than the containers that die and restart and gets new ip addresses all the time.
I specify two services here, my wordpress service and my mariadb service.
First, my wordpress service. Note the selector towards the bottom, it select based on pod labels, and I had app: wordpress specified in my wordpress deployment specification, so it’ll automatically find the PODs to connect to based on that label.
kind: Service
metadata:
annotations:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/version: 11.5.1
helm.sh/chart: grafana-8.9.0
name: wordpress
namespace: wordpress
spec:
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: wordpress-app
port: 80
protocol: TCP
targetPort: 80
selector:
app: wordpress
sessionAffinity: None
type: ClusterIP
My mariadb service is pretty similar:
apiVersion: v1
kind: Service
metadata:
annotations:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/version: 11.5.1
helm.sh/chart: grafana-8.9.0
name: db
namespace: wordpress
spec:
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: wordpress-db
port: 3306
protocol: TCP
targetPort: 3306
selector:
app: mariadb
sessionAffinity: None
type: ClusterIP
Now, here’s another piece of magic. When I deploy this, I can, in my whole Kubernetes cluster, get to my wordpress instances web port by connecting to the hostname wordpress.wordpress.svc.cluster.local on port 80. WordPress will use mariadb.wordpress.svc.cluster.local:3306 for its DB connection. The second wordpress in these names is the namespace. You might of course not want it exposed clusterwide, but that you’ll handle by network policies.
There’s some bits and pieces I haven’t gone into the details about, like how to specify namespaces, configmaps, secrets and persistent volumes.
To expose it throgh traefik, I have a final piece of magic, an ingressroute, that I deploy in the traefik namespace
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: wordpress-ingressroute
namespace: traefik
spec:
entryPoints:
- websecure
routes:
- match: Host(`vegard.blog.engen.priv.no`)
kind: Rule
middlewares:
- name: redirect-to-https
services:
- name: wordpress
namespace: wordpress
port: 80
tls:
certResolver: letsencrypt
Traefik
I have installed traefik in its own namespace, traefik. I opted to use helm to install it, this will make it easier to update:
$ helm repo add traefik https://traefik.github.io/charts
$ helm repo update
$ helm install traefik traefik/traefik --namespace traefik
Had I not been such a beginner, I’d have fed a -f values.yaml argument on it to have it automatically configured the way I wanted, but I ended up managing the Deployment Manually:
apiVersion: apps/v1
kind: Deployment
metadata:
name: traefik
namespace: traefik
labels:
app: traefik
spec:
strategy:
type: Recreate
replicas: 1
selector:
matchLabels:
app: traefik
template:
metadata:
labels:
app: traefik
annotations:
k8s.v1.cni.cncf.io/networks: '[
{ "name": "dmz-net", "namespace": "traefik", "ips": ["192.168.29.15/24"], "mac": "02:42:ac:11:00:03" }
]'
spec:
initContainers:
- name: setup-routes
image: busybox
command:
- sh
- -c
- |
ip route del default
ip route add default via 192.168.29.1 dev net1
ip route add 10.0.0.0/8 via 169.254.1.1 dev eth0
ipv6_prefix=`ip -6 address list scope global dev net1 |grep inet6 | awk '{print $2}' | cut -f1-4 -d':'`
for i in `seq 1 100`; do ip -6 address add $ipv6_prefix:1:0:0:$i dev net1;done
securityContext:
capabilities:
add: ["NET_ADMIN"]
containers:
- name: traefik
image: traefik:v3.3.2
args:
- "--api.insecure=true"
- "--accesslog=true"
- "--providers.kubernetescrd"
- "--providers.kubernetescrd.allowCrossNamespace=true"
- "--providers.kubernetesingress"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.letsencrypt.acme.email=your-email@example.com"
- "--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json"
- "--certificatesresolvers.letsencrypt.acme.dnschallenge.provider=linode"
- "--certificatesresolvers.letsencrypt.acme.dnschallenge.delaybeforecheck=10"
- "--certificatesresolvers.letsencrypt.acme.dnschallenge.resolvers=8.8.8.8:53,1.1.1.1:53"
env:
- name: LINODE_TOKEN
valueFrom:
secretKeyRef:
name: linode-dns-token
key: LINODE_TOKEN
ports:
- name: web
containerPort: 80
protocol: TCP
- name: websecure
containerPort: 443
protocol: TCP
volumeMounts:
- mountPath: "/data"
name: traefik-volume
volumes:
- name: traefik-volume
persistentVolumeClaim:
claimName: traefik-volume
As with wordpress, I expose it on a macvlan. Here, I have shown a piece of magic I left out in wordpress. I use it there too, but I have some secret sauce that adds it (did I tell you there’s material for more blog posts here too?)
The «init-container» section is a special container that runs before the traefik container starts. I have given it more rights, so that it is able to change i.e. networking configuration, which even root isn’t allowed to do inside a container. In my docker networks I solved the same things through docker events, but this is a more integrated and elegant way to solve the same problem. The init-container will be shut down before the traefik container starts, but the networking is set up, so that traefiks default route will be out through the macvlan and 10.0.0.0/8 (all docker internal networking stuff ends up there) is added to where the default route was previously going.
For good measure I add 100 sequential IPv6 addresses to use in A records externally, but that’s so far only eye candy – they all respond to all hostnames anyhow.
The traefik container have the web and websecure entrypoints specified, and I connected my wordpress ingressroute on the websecure entrypoint. As has been the norm for years, I want my http calls for vegard.blog.engen.priv.no to redirect to https. For that, I’ll use a Traefik Middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: redirect-to-https
namespace: traefik
spec:
redirectScheme:
scheme: https
permanent: true
I can use this in a second ingressroute for my wordpress:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: wordpress-ingressroute-http
namespace: traefik
spec:
entryPoints:
- web
routes:
- match: Host(`vegard.blog.engen.priv.no`)
kind: Rule
middlewares:
- name: redirect-to-https
services:
- name: noop
kind: TraefikService
There’s likely more elegant ways to do this, but this works.
There’s also setup there for DNS challenges, through my DNS provider, which is Linode. That functionality comes out of the box, I just need to provide it a token in a secret. The certificates will of course have to be persisted, in the volume mounted at /data/, in the file /data/acme.json
The external part of this is of course DNS, port forwarding on your external router for IPv6 and opening up for the IPv6 addresses, which is already in a public range.
More services
My other web services I am running will be pretty similar. I configure the functionality in their own namespace, and then I create ingressroutes for traefik to handle the rest of it. As long as you configure your services correct and your ingress routes point to the services correctly, magic just seems to happen, and a new site appears, ready to get traffic from the start once you point the DNS names to it.
Migration from and coexistence with my docker setup
Now, since I actually already have services running in docker, and since I only have one external ipv4 address, I have a problem. Where should port 80 and 443 go?
I let them go to port 80 and 443, and handle the ipv4 traffic via the docker nginx for the transition period. In nginx, I just changed out the backends services and sent it in through traefik instead of into the previous docker stack. Elegant? Maybe not. It worked better with ipv6, there they could point directly to my Kubernetes traefik setup, while your non-migrated workload could point to the docker nginx.
Conclusion? Summary?
This was fun. Before getting around to writing this blog post, I had played with a ton of Kubernetes features, some not mentioned yet. I have also migrated the rest of the services from docker so I could point the firewall port forward to traefik instead of docker.
Is it any better than before? Not all that much. I am essentially doing the same. Some things are more elegant, and there’s more features in Kubernetes that will allow me to approve part of this, perhaps by adding redundancy here and there.
I really like the persistent volumes. I use ZFS as storage, and there’s all these neat features that makes Kubernetes provision them automagically by configuring it in Kubernetes (more on that later).
I also like the Traefik setup and the ingressroutes that just makes things appear. I am pretty sure it’d even be possible to create automatic hooks here and there to have DNS records appear in my external DNS provider, via the API.
It’s more complex than my Docker setup, but I can easily extend it more and make it more resilient. And I’m a geek. I love to tinker. That’s a reason all by itself.