May 17, 2024 | 30 min. read

Alright, this is the big project that I've been builing up to for a while now. To reiterate previous posts, I have four Raspberry Pis in my homelab that run everything: Jellyfin, qBittorrent, Calibre, PiHole, BIND, etc. I've acquired each of the Pis over the span of a few years and I haven't put in the effort to set up any infrastructure as code, so they've all experienced quite a bit of configuration drift. They each have slightly different versions of my dotfiles, they differ in which utilities and bash scripts are installed, the versions of Raspbian they run are quite varied, etc. It's not too bad to manage four fairly unique hosts and remember which is which, but each new host or VM I add is going to increase complexity exponentially. Essentially, I need to start treating the machines as cattle, not pets.

Additionally, I want to be able to treat all of the machines as identical nodes of a single cluster of compute, rather than configure each machine and allocate services individually. This is primarily for convenience, since I would like to be able to just throw YAML at a container orchestrator until it spits out the services I want. Secondarily, it would be nice to have some resiliency if one of the Pis' SD cards kicks the bucket or something. Instead of an outage, whatever services running on the Pi would just be scheduled to run on the other nodes of the cluster (ideally).

Game Plan

As a result, there are two big projects that I've been working on for a while now. The first is installing NixOS on every machine that hosts a service in the lab (so not the NAS or router, but all the Pis). I've been using NixOS on my desktop and several laptops for about two years now, and it's been awesome. If you haven't heard of NixOS, it's a really cool project Disclaimer: I maintain a few packages in nixpkgs and I've contributed some documentation to the project, so I do have some skin in the game other than being a user. I swear I'm not trying to shill! . It's a Linux distro built on top of Nix, a purely functional package manager that aims to make software 100% reproducible (package derivations are also built in a domain-specific functional programming language called nix , which can get a little confusing). It accomplishes this by viewing every build artifact as a pure function of "source code and build tools in, software out" - binaries, libraries, documentation, you name it. With this capability to describe builds functionally, NixOS allows total declarative management of your operating system. No more fiddling with config files in /etc and losing them, or misremembering if it's foo or foo-dev or foo-git from the AUR that has that one weird thing you need. Every single package and service you want (and every option you want to specify for those packages/services) on your machine is declared in a single configuration.nix file. Obviously, this is very handy for wrangling a fleet of servers like the homelab.

Managing configs is just half the battle, though - managing compute is a whole other beast. For that task, I've been eyeing self-hosting Kubernetes for a while now. I've worked with managed Kubernetes instances quite a bit in the past for work (mainly EKS), but self-hosting is a whole other beast. With EKS, you can basically click a few buttons in the AWS console and all of the hard stuff is done; you can start copy/pasting writing Deployments in a few minutes. When self-hosting, it's so much more work - you need to set up static addresses for each node, set up a Certificate Authority, generate TLS certs, install a CNI plugin, etc. Just looking at Kubernetes the Hard Way is enough to make your head spin.

Nihil Sub Nix Novum

As you might have seen in a previous blog post, I've been gradually installing Nix on all of my machines for a while now, and I have the system down pretty well.

My NixOS configs live in my dotfiles, where I have a single Nix Flake that contains all of my Nixified machines' configs. I've tried pretty hard to template things such that all common functionality (e.g. every machine should use the same fonts, every machine should have Flakes enabled, all the RasPis should be more or less identical, etc.) lives in a common folder (src) that can be used with the imports attribute in each machine's respective NixOS module (example). This isn't totally flawless, though, as some things are pretty identical across machines but just different enough to make a mess of things. For example, NixOS uses the fonts.packages attribute to determine which fonts are installed, but nix-darwin uses fonts.fonts . For the most part I've just let things stay duplicated, but here and there I do explicit pkgs.stdenv.isLinux checks to help determine which attributes need to be present in a NixOS module.

As with any declarative configuration tool, the one really tricky part is secrets management. Part of the fun of having configuration as code is checking your configs into git and linking the repo on /r/unixporn, but this gets a lot trickier when you want your declarative configuration to contain stuff like user passwords or private SSH keys. There's a few different tools in NixOS for this This blog post gives a great overview of what's available. , but unfortunately none of them are first-party. For a while, I went with just .gitinore 'ing a common/secrets.nix file that contains all my sensitive data and doing an import ../secrets.nix to fetch that data in any of the hosts' configs that needed it. This is pretty suboptimal for two reasons:

When the derivation is built, the secret data is copied into /nix/store in plaintext and is visible to anyone with permission to view the derivation.

in plaintext and is visible to anyone with permission to view the derivation. I'm far too lazy to manually copy around that secrets.nix file across all my machines. If I want to do rebuild any machine that needs secret data remotely, then I need to do it from the machine with the secrets.nix file.

I ended up settling on agenix to properly manage my secrets, and it's been a pretty good experience so far. It's built on top of a general purpose encryption tool called age , and uses SSH keys to asymetrically encrypt/decrypt any secrets you want to add to your NixOS configs. Since secrets are encrypted at rest, you can check them into public git repos with a clear conscience. The six-step tutorial in the GitHub repo is literally all you need to get up and running.

That being said (and I'm not 100% on this, please correct me if I'm wrong), I think there's a chicken-and-egg problem that would prevent the kind of zero-click installs that I've been doing so far. With the older secrets.nix method of managing secrets, I could build an image on any machine with the secrets.nix file, burn it onto the installation media, and boot from it on the machine I want to install NixOS on. After that, no config is needed; everything that I need on the machine was included in the install image. With agenix , if you want to share a secret with a machine or user, you have to explicitly state in the config which SSH public keys can access each secret. That's great and all for granular access controls, but what about the scenario I've described above, where I'm writing the configs for a machine that doesn't exist yet (and therefore doesn't have any SSH public keys)?

I guess the simplest solution here is abandoning the "zero-click" part of things, and justing add a few manual steps to the install process. The new flow would basically be something like:

Write the new machine's configs, create the install image as usual, boot the new machine from it SSH into the machine Get the newly generated host keys Add the public key to the permitted public keys of the agenix secrets Encrypt and add the secrets to new machine's configs Rebuild the new machine with the updated config and encrypted secrets

That makes me a little sad, though - I really enjoyed being able to stick an SD card into a machine, power it on, and immediately SSH into a fully-configured server. Regardless, it's still a huge step up from doing all of my configuration by hand on multiple machines and having zero record of what's present on any given machine.

K3s, Please

On the declarative compute side of things, things are also going pretty swimmingly. While researching other homelabbers' self-hosted Kubernetes clusters, the suggestion I got again and again was to check out a project called K3s. Turns out, it fits my use case pretty perfectly.

It's a Kubernetes "distribution" optimized for edge and IoT devices (like Raspberry Pis!) that does all of the heavy lifting in terms of installation and setup. In fact, the installation process (for both control plane and worker nodes) is a simple curl | bash command I know a lot of people absolutely hate that pattern for installing software, but I've gotta be honest: I don't find the arguments super compelling. I've seen a lot of people say that it's a security nightmare, but I fail to see how it differs from running any other arbitrary binary on your system. I mean, downloading a package from the AUR (for example) is basically doing the same thing in that you're sourcing a random binary/bash script that you're really trusting to not act maliciously. If we've learned anything from the xz debacle, it should be that even "properly" packaged software can have pretty gnarly side effects.. k3s also includes some pretty handy stuff out-of-the-box, like Traefik for proxying services and automatic TLS cert provisioning for HTTPS.

As part of this project, I also decided to actually learn Kubernetes. Like I mentioned earlier, I have a good amount of experience wrangling managed Kubernetes instances at work, but very little of that was greenfield infrastructure development. Most, it was following playbooks that others had written or updating pre-existing manifests. I had bounced off the official docs a few times in the past and decided to read through Kubernetes Up and Running this time, which was highly recommended and definitely more approachable. It was a pretty quick read, and (not to brag) didn't actually have too much stuff I didn't already know about. What I did find surprising, however, is how little a vanilla self-hosted Kubernetes cluster actually implements. If you want to use Ingresses, you have to install a separate Ingress Controller. LoadBalancer type Services also don't do anything out of the box, unless you're running a managed K8s cluster from a cloud provider (or, like k3s, you have something like ServiceLB). Ditto with networking - you're practically required to install a Container Networking Interface plugin like Calico or Flannel to be able to network between pods. I'm sure there's good reasons to keep this functionality out of the K8s project itself (I don't even pretend to be enough of a distributed systems wizard to understand its inner workings), but it sure makes learning it harder.

Regardless, having the servers run NixOS made installation even easier. It's literally just adding the following to each machines' configuration.nix :

services. k3s = { enable = true ; role = "agent" ; serverAddr = "https://192.168.1.42:6443" ; tokenFile = config.age.secrets.k3s-token.path; };

I didn't want to go with a super-fancy High Availability setup, so I just made heracles (a Raspberry Pi model 4B) the server node and ixion (another 4B) and athena (a 3B) the agent nodes If you've noticed, the last Pi gorgon isn't part of the cluster. It's only a Model 2B, so it's way too wimpy to run most of the stuff in the lab.:

❯ k get nodes NAME STATUS ROLES AGE VERSION athena Ready <none> 43d v1.28.6+k3s2 heracles Ready control-plane,etcd,master 138d v1.29.3+k3s1 ixion Ready <none> 116d v1.27.6+k3s1

Next, I just had to migrate all of my disparate docker-compose.yml files that used to run my services into Kubernetes manifests. This wasn't too hard - most Deployments kind of read like a Docker Compose file but with a little more info - but it did take a lot of copying and pasting. Part of me wanted to utilize Helm to template things and make them more DRY, but I figured I was already reaching the limit of justifiable complexity by running K8s in the first place.

The only tricky part of porting my existing setup to k3s was getting my funky Traefik config to work. You can read about it more in this post, but I have a local BIND instance that is the authoritative DNS server for all of my services' and machines' hostnames (anything in *.lab.janissary.xyz ). I could never quite get BIND to cooperate with DNS challenges for ACME, so I had to point Traefik to 1.1.1.1 specifically for ACME challenges. Since that DNS server lives outside the homelab, it only sees the records delegated to DigitalOcean (which is *.janissary.xyz ). I then had to create a wildcard record in DigitalOcean for *.lab.janissary.xyz , so that Lego (the ACME client Traefik uses) can properly create the TXT records that ACME uses to verify ownership of domains. This is kind of a huge hack and maybe a misuse of DNS, but it manages to work smoothly enough. Anytime I need a new TLS cert, Traefik is able to generate one as expected and I haven't experienced any issues with it yet I have no idea how TLS certificates work beyond "needed for HTTPS encryption", but isn't there some mechanism in the cert that says "This certificate is only valid for this hostname at this IP address"? If so, I'm surprised that hasn't broken anything. I'm generating TLS certs as if they're certifying connections to something in DigitalOcean, but they're actually being used to certify connections to machines in the homelab (which, obviously are totally different IP addresses). I should probably get around to actually learning how TLS works.. Porting this setup to k3s took some digging through the documentation, but I managed to get it working by writing this small HelmChartConfig:

apiVersion: helm.cattle.io/v1 kind: HelmChartConfig metadata: name: traefik namespace: kube-system spec: valuesContent: |- additionalArguments: - --api.insecure - --accesslog - --providers.kubernetescrd - --certificatesresolvers.myresolver.acme.dnschallenge.provider=digitalocean - --certificatesresolvers.myresolver.acme.dnschallenge.resolvers=1.1.1.1:53 - --certificatesresolvers.myresolver.acme.email=<my email> - --certificatesresolvers.myresolver.acme.storage=/data/acme.json env: - name: DO_AUTH_TOKEN valueFrom: secretKeyRef: name: digitalocean-auth-token key: token

That manifest essentially edits the Traefik deployment that k3s sets up by default. After symlinking that into the k3s server's /var/lib/rancher/k3s/server/manifests/traefik-config.yaml and restarting k3s.service , it was good to go.

Passive Ingressive

After verifying that the cluster was working with the "hello world" Nginx Deployment, I started migrating the old Docker Compose services into k8s. To make sure that they were routable to their usual domain names, I went ahead and created an IngressRoute for each service. IngressRoute s are the Custom Resource Definition (CRD) baked into Traefik that allows it to do its custom routing logic to Kubernetes Service s. An example IngressRoute manifest for my Jellyfin server looks like this:

apiVersion: traefik.containo.us/v1alpha1 kind: IngressRoute metadata: name: jellyfin-ingress labels: app.kubernetes.io/name: jellyfin spec: entryPoints: - web - websecure routes: - match: Host(`watch.lab.janissary.xyz`) kind: Rule services: - name: jellyfin port: jellyfin-http tls: certResolver: myresolver

It's pretty easy to follow, if you're familiar with the Traefik YAML file syntax. Any request coming from the web (port 80) or websecure (port 443) entrypoints that match the hostname watch.lab.janissary.xyz go to the jellyfin-http port of the jellyfin service, and the TLS certificate resolver for those requests is the cert resolver named myresolver (which is just the ACME resolver created in the HelmChartConfig from earlier). Just to test out Kubernetes' capabilities for dynamic configuration, I decided to install Homepage, which is a handy little start page that can hook into the Kubernetes API and dynamically display the different services you have running. I'll probably go into more detail on how it works in a future post, but with Homepage installed I can add annotations to IngressRoute objects like so:

apiVersion: traefik.containo.us/v1alpha1 kind: IngressRoute metadata: name: jellyfin-ingress labels: app.kubernetes.io/name: jellyfin annotations: gethomepage.dev/href: "https://watch.lab.janissary.xyz" gethomepage.dev/icon: "jellyfin.png" gethomepage.dev/enabled: "true" gethomepage.dev/description: Media Server gethomepage.dev/group: Media gethomepage.dev/name: Jellyfin gethomepage.dev/pod-selector: "" gethomepage.dev/widget.type: "jellyfin" gethomepage.dev/widget.url: "https://watch.lab.janissary.xyz" gethomepage.dev/widget.key: "<my api key>" gethomepage.dev/widget.enableBlocks: "true" spec: entryPoints: - web - websecure routes: - match: Host(`watch.lab.janissary.xyz`) kind: Rule services: - name: jellyfin port: jellyfin-http tls: certResolver: myresolver

...and a little shortcut widget will appear in the Homepage console, with a live status of how many shows/movies are in Jellyfin, any movies in progress, etc. It's pretty handy, although I still just use my browser's URL bar autocomplete as my "homepage" more often than not.

Storage Stories

One of the central selling points of Kubernetes is that anything running in the cluster is ephemeral. For example, when you create a Deployment, you don't care where the underlying pod runs. It can get scheduled to a node, the node can get nuked, and the pod will be rescheduled to another node without skipping a beat. This is great!

However, that means that any data within the pod or on the node itself, much like a Docker Container, can't be guaranteed to stick around There are PersistentVolume types like hostPath and local , but these require nodeAffinity or are otherwise dependent on certain nodes being healthy to schedule pods to them. That kind of defeats the point of "fault-tolerant distributed container orchestration", though, so I don't really consider it strongly. . For any kind of persistent storage available to multiple pods, you need a PersistentVolume that data can live in.

For this, I dedicated a directory of my Synology NAS (aka hesiod ) to contain all of the homelab's PersistentVolume s. In /Lab , every service that needs some persistent and read/write storage gets a dedicated PersistentVolume in a directory named after the service. Each PersistentVolume also has a 1:1 PersistentVolumeClaim , since they're usually per-service.

I'm not sure if this is an optimal way to do things. I don't really have any backups or anything, and I definitely haven't put a lot of thought into how to size the volumes themselves. I mostly assume that since most persistent storage is just storing settings, 1Mi should do it, and for anything with a lot of settings or an SQLite database I figure 1Gi should be more than enough. I've tried to mount ConfigMap s as files where I can, since most services just use persistent storage as a place to read settings from. This unfortunately isn't usually possible since services commonly store settings in an SQLite DB or read and write to the configuration files when updating settings via a web UI.

I'll probably revisit how I've set up storage in the future. I feel like there's some easy wins in terms of reducing toil and improving resiliency here.

Naming Things

The last part of the effort to get Kubernetes up and running was actually getting DNS hooked up to all my services. This is also the part where I'm least sure if I'm doing things the "right way", but most of my research on this topic either catered to cloud-managed k8s setups or was SEO listicle slop that gave zero actual information.

Prior to K3s, I just had a row in my BIND DB file that pointed *.lab.janissary.xyz to the machine running the Traefik container. The BIND server still lives outside of Kubernetes and in a single Docker container on athena (I figured I should limit the blast radius in case the k8s cluster has an outage), so all I had to do was edit the record to point to heracles.lab.janissary.xyz instead. Although Traefik runs as a DaemonSet and is available on every machine, I figured that since heracles was the server it might as well be the place all the requests go to.

That was a pretty terrible idea. I'll probably do a full post-mortem of the outage in a future post, but to summarize: the Tailscale key for heracles expired one afternoon, so out of nowhere I was both unable to reach any *.lab.janissary.xyz service nor could I interact with the Kubernetes API via kubectl . This revealed heracles as a pretty huge single point of failure, since it was now responsible for all HTTP routing and any Kubernetes API requests. As part of the recovery, I edited the *.lab.janissary.xyz record to point to the IP of every machine in the cluster afterwards. It won't prevent the Kubernetes API going down again, but some DNS-level load balancing will help increase the resiliency of the web services since any node can now route HTTP requests via the Traefik ServiceLB DaemonSet . The new BIND DB file (the Tailnet view of the split setup, at least) now looks like this:

$ORIGIN lab.janissary.xyz. $TTL 60m @ IN SOA ns.lab.janissary.xyz. admin.janissary.xyz. ( 2023061301 ; serial 4h ; refresh 15m ; retry 8h ; expire 4m ; negative caching ttl ) IN NS ns.lab.janissary.xyz. ns IN A 100.110.34.140 ; athena gorgon IN A 100.65.15.148 athena IN A 100.66.244.10 ixion IN A 100.83.81.25 hesiod IN A 100.118.5.140 fission IN A 100.96.128.120 ; m1 macbook mammon IN A 100.76.196.136 ; nixos desktop heracles IN A 100.91.41.135 ; remaining records are services proxied by traefik *.lab.janissary.xyz. IN A 100.66.244.10 *.lab.janissary.xyz. IN A 100.83.81.25 *.lab.janissary.xyz. IN A 100.91.41.135

Like I said earlier, though, I'm wondering if this is the right way to do things. Kubernetes already has the idea of LoadBalancer services (granted, implementation is provided by third parties), which should do some of this already. Also, Kubernetes has its own DNS server with CoreDNS I really need to check out CoreDNS. I always thought it was just a K8s-specific thing until a coworker raved about it as a general-purpose DNS server. Just from the simplicity of the Corefile syntax compared to Bind's, I'm tempted to make a switch. for naming services and things (like jellyfin.default.svc.cluster.local ) - maybe I should actually be relying on that more?

All in All

After verifying that I could reach all the old services at their usual domain names with full functionality, I went ahead and gave the old docker-compose.yml files a warrior's death ( rm -rf ./**/docker-compose.yml ). After porting everything over, I think I'm willing to say that the K3s experiment/project was a success. I'll be the first to say, Kubernetes is 100% more verbose and more work to setup than Docker Compose and certainly more work than just manually installing software. All of these YAML files combined are required just to set up Jellyfin, for example:

apiVersion: v1 kind: PersistentVolume metadata: name: jellyfin-cache spec: storageClassName: nfs accessModes: - ReadWriteMany capacity: storage: 1Gi nfs: path: /volume1/Media/Lab/jellyfin/cache server: hesiod.lab.janissary.xyz readOnly: false apiVersion: v1 kind: PersistentVolume metadata: name: jellyfin-config spec: storageClassName: nfs accessModes: - ReadWriteMany capacity: storage: 1Gi nfs: path: /volume1/Media/Lab/jellyfin/config server: hesiod.lab.janissary.xyz readOnly: false

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: jellyfin-cache-claim spec: storageClassName: nfs volumeName: jellyfin-cache accessModes: - ReadWriteMany resources: requests: storage: 1Gi apiVersion: v1 kind: PersistentVolumeClaim metadata: name: jellyfin-config-claim spec: storageClassName: nfs volumeName: jellyfin-config accessModes: - ReadWriteMany resources: requests: storage: 1Gi

apiVersion: apps/v1 kind: Deployment metadata: name: jellyfin labels: app.kubernetes.io/name: jellyfin spec: replicas: 1 selector: matchLabels: app: jellyfin template: metadata: labels: app: jellyfin spec: volumes: - name: jellyfin-config persistentVolumeClaim: claimName: jellyfin-config-claim - name: jellyfin-cache persistentVolumeClaim: claimName: jellyfin-cache-claim - name: jellyfin-data persistentVolumeClaim: claimName: nfs-claim containers: - env: - name: TZ value: America/Los_Angeles - name: PUID value: "1024" - name: PGID value: "100" image: linuxserver/jellyfin:10.8.4 name: jellfyin ports: - containerPort: 8096 name: http-tcp protocol: TCP - containerPort: 8920 name: https-tcp protocol: TCP - containerPort: 1900 name: dlna-udp protocol: UDP - containerPort: 7359 name: discovery-udp protocol: UDP volumeMounts: - mountPath: /config name: jellyfin-config - mountPath: /cache name: jellyfin-cache - mountPath: /data name: jellyfin-data restartPolicy: Always

apiVersion: v1 kind: Service metadata: name: jellyfin labels: app.kubernetes.io/name: jellyfin spec: selector: app: jellyfin type: NodePort ports: - name: jellyfin-http protocol: TCP port: 80 targetPort: http-tcp - name: jellyfin-https protocol: TCP port: 443 targetPort: https-tcp

...and that's not including the IngressRoute I showed earlier. That being said, most of this stuff is just boilerplate copy/paste YAML. The small bit that is important - storage, networking, load balancing - is abstracting a huge amount of toil and complexity. So far, Kubernetes has fulfilled its side of the bargain when it comes to declarative orchestration of arbitrary services.

In fact, it's probably a bit too much of a breeze to stand up new services; I went a little hog wild and installed a ton of new stuff. I won't go into too much detail for now, though. That's for next time ;)