Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.

On-prem kubernetes, part 3.5

Posted 12-15-2023 at 07:52 AM by rocket357
Updated 12-27-2023 at 04:47 AM by rocket357

Tags kubernetes, linux, networking, openbsd, virtualization

Posts in this series:
Project Goals and Description: Background info and goals
Preparing the installers: pxeboot configs
Installing the Xen Hosts: installing Debian/Xen dom0
Installing the K8s VMs: installing the k8s domUs
Initializing the Control Plane: Bootstrapping a bare-bones HA Kubernetes Cluster
Installation/Configuration of Calico/MetalLB/ingress-nginx: Installing the CNI/Network Infrastructure
Installation/Configuration of LVM-CSI, S3-CSI, and Kadalu (GlusterFS) (this post): Installing the CSIs for Persistent Volumes
Installation/Configuration of cert-manager: Installing/Configuring cert-manager
Automating the boring bits: Installing/Configuring ArgoCD and GitOps Concepts
Authentication considerations: Installing/Configuring Authelia/Vault and LDAP/OAuth Integrations
Authentication configurations: Securing Applications with Authelia
Staying up to date: Keeping your cluster up-to-date

Github for example configuration files: rocket357/on-prem-kubernetes

Overview

At this point in time, we could easily deploy any stateless applications to the cluster and call it good (well, there's still plenty to come with respect to certificates, metrics, secrets, monitoring, etc...we're by no means done yet!) but at some point we're going to want for applications to be able to store state and data within the cluster. This is accomplished via Container Storage Interface services, which allow for the configuration of storage for pods.

Types of Storage

There are many ways of storing data within a kubernetes cluster, and many ways for pods to consume that data. You could, for instance, put a set of configuration key-values in a kubernetes ConfigMap, and then mount that ConfigMap in a pod as a file (or perhaps even as a directory of files depending on your needs). You can do the same with a kubernetes Secret, which, by the way, are not secure in the sense that they are encrypted or otherwise stored in a "safe" fashion (i.e. keep out prying eyes). You can't read a secret without RBAC permissions to read it, of course, but the data in a secret is stored in base64 encoding, not encrypted in any fashion. base64 encoding makes it safe to store binary data as text, but it offers zero confidentiality.

Another type of storage you're likely to want to use is the types of storage your pod can write to in bulk. ConfigMaps and Secrets are great for what they do, but you can't have a postgres database keep its data in either (could you imagine a complete kube-api call/re-write of the configmap/secret for every. single. database. row. update.?). For this we'll need a CSI driver. If you browse through the list of CSI drivers, you'll see that there are (non-exhaustive list):

Cloud-Specific Drivers

Alibaba
AWS
Azure
CloudScale
DigitalOcean
Google Cloud
Hetzner
IBM Cloud
Linode
Oracle Cloud
Qing Cloud
Tencent Cloud
Vultr
Yandex

"Build your own cloud" solutions

CephFS/RBD drivers
Cinder (OpenStack)
HyperV
Longhorn
Portworx
vSphere
TrueNAS

Hardware network storage

Datatom Infinity
Dell EMC
Dothill (SeaGate)
Hitachi
HPE
NetApp
Synology

"Traditional" network filesystems

BeeGFS
democratic-csi (ZFS)
GlusterFS
JuiceFS
KaDalu (Gluster)
MooseFS
NFS
SeaweedFS
SMB

Point is, if you can write data to it, it probably has a K8s CSI written for it (not all CSI drivers are equal, however!). Some of the CSI drivers support a dizzying array of options (looking at you, AWS EBS), and some are incredibly simple, such as the "sample" driver HostPath (just mounts a filesystem path in the k8s worker host into the pod...which is a bad idea for scalability/reliability/availability since it ties a pod to a specific host, hence it being a "sample" driver...but don't let that fool you, in a pinch it works, *especially* for host specific pods, like DaemonSets).

Cutting down the List

For our purposes, we really only need a few of these options. We're running on-prem, so all of the cloud-specific drivers are out of the question, and I don't have an expensive NAS or iSCSI network storage array at home (this is all commodity hardware), so my choices are from the "build your own cloud" and "traditional network filesystem" lists. But here's the kicker: we're using open source. At the heart and soul of Linux (and the larger, encompassing FOSS movement in general) is the concept that the people with the itch can write the code to scratch the problem, so the community has a variety of offerings not listed in the "official" Kubernetes lists (also, to note, the "official" list literally has instructions at the top to open a pull request if you want your CSI driver added, and that the information contained in the list is community-driven by the CSI driver maintainers!). How's that for open-source?

It's helpful to determine what *types* of storage we need, and what properties those storages should have.

Persistent Local Data

I'm a database guy at heart, so we're going to be storing database things, preferably in PostgreSQL. The postgres-operator bits I discussed previously set up a multi-host database cluster, so we'll have replicas that we can failover to in the event that a primary pod goes down. If the primary pod is running with a HostPath, for instance, on worker-27, and worker-27 dies, we lose the HostPath data (most likely). This is bad, unless of course we have a replicated copy of the data on a hot standby on a different host that can take over the primary role as soon as it detects that the old primary is dead. Ideally the data wouldn't be attached to a specific host (i.e. an EBS volume that any host can mount), and perhaps you're running a Ceph cluster at home and can afford such data integrity luxuries, but I have three simple hosts (well, four) and an ssh server that has a few TB of storage as well (this will come in to play at some point...stay tuned). For database purposes, an LVM volume on the workers is sufficient for my needs. LVM volumes are preferred to HostPath since we can decouple the data and the path and cleaning up an LVM persistent storage configuration is a bit easier than cleaning up HostPaths.

For LVM data, there is a great lvm-csi driver from Metalstack that works well.

Object Storage

On the other hand, having object storage is probably a good idea for storage that doesn't necessarily need to be blazing fast, but can store a large variety of stuff. One of the applications I'll eventually deploy to this cluster is Komga, one of my favorite "local storage" reader-type webapps. I use it to store epub/PDFs for all my books, crafting, survival, howtos, reference tables, etc... Komga needs two types of storage. The first is a persistent place to store configuration and "indexes" into...the other type of storage (object storage) for bulk object/data storage. The objects in our object storage aren't written or updated often...it's just a collection of files that are uploaded infrequently (i.e. the raw epub/PDFs that are only uploaded when we add new books or update old books). The index bits, however, contain metadata and the like, and are updated when you add new files, read portions of said files, edit the descriptions of files, etc... so it is updated more frequently than the raw bulk objects themselves. This needs to be local and fast, and can reside in LVM as well. This data can be rebuilt by scanning the object storage, so it doesn't need to be replicated.

But getting back to the object storage, I used to work for Amazon, and I've had an AWS account for many, many years, so I'll just deploy the aws-s3-csi driver for this data (ironically from Yandex since the S3 api is stable and well-known).

Replicated Local Storage

The last class of storage is stuff that we'd like to keep replicated across hosts, so if one host dies we don't lose the data. This is essentially the same thing as the PostgreSQL storage above, but without the postgres-operator auto-configuration to make it replicate the data automatically. Since this storage type is "missing" the operator bits (not every webapp will contain replication logic, and rightfully so), we need a different mechanism for accomplishing the same replication. GlusterFS, MooseFS, and friends are all good choices here, but they come with the overhead of replication (everything is a tradeoff in computing) so they tend to be slower than something like local LVM for writing. This is an acceptable tradeoff for applications that need data integrity and availability, but not necessarily blazing performance. Databases certainly need the performance, so they would be a bad choice here for all but the smallest datasets, but other applications could probably live with sub-par read and (particularly) write speed. And example might be gotify, which I use for messaging to my phone when something important happens (i.e. a patch is made available for OpenBSD, or my HomeAssistant Server runs a specific "fixit" automation, or there's a new login to my private gitea server, etc... If I receive an alert regarding a new login to my gitea server, and I haven't checked my phone yet, I don't want to lose the alert if gotify's pod restarts due to a host rebooting or losing a disk. Thus, I need this data to be replicated across hosts, such that starting gotify back up on a different host will persist the data so I don't lose alerts.

I'm partial to KaDalu here, since I've used it before and it is fairly straightforward to setup. It's an operator that configures a replicated GlusterFS backing store across your hosts.

Installation/Configuration of the CSI drivers

The CSI drivers can be installed via helm:

Code:

# install lvm driver
helm install --repo https://helm.metal-stack.io csi-driver-lvm helm/csi-driver-lvm --set lvm.devicePattern='/dev/xvdb'

# install s3 driver
helm install csi-s3 yandex-s3/csi-s3 -n k8s-csi-s3 --create-namespace --values k8s-csi-s3-values.yaml

# install kadalu operator/driver
# first download the chart and set the default env
K8S_DIST=kubernetes
curl -sL https://github.com/kadalu/kadalu/releases/latest/download/kadalu-helm-chart.tgz -o /tmp/kadalu-helm-chart.tgz

# next install operator
helm install operator --namespace kadalu --create-namespace /tmp/kadalu-helm-chart.tgz --set operator.enabled=true --set global.kubernetesDistro=$K8S_DIST

# now install the csi driver the operator will manage
helm install csi-nodeplugin --namespace kadalu /tmp/kadalu-helm-chart.tgz --set csi-nodeplugin.enabled=true --set global.kubernetesDistro=$K8S_DIST

# now we need to tell kadalu what host devices to use, and for that we need the kadalu kubectl plugin...so let's install it!
curl -fsSL https://github.com/kadalu/kadalu/releases/latest/download/install.sh | sudo bash -x

# and set up a storageclass by telling kadalu what hosts/drives to use...
kubectl kadalu storage-add storage-pool-1 --device kube1:/dev/xvdc

The lvm driver needs lvm.devicePattern set, which I've set in the example above to the second virtual disk on our workers (the second one listed in the xen worker configs, which we set to xvdb). The s3 csi needs a values file, which includes bits such as the AWS access key, secret key, region, bucket name, etc... to properly configure the driver, and the kadalu driver needs to know what kubernetes distribution we're running (openshift, rke, microk8s, etc...) so we'll pass the default "kubernetes" since we're a vanilla on-prem k8s installation.

And here is the first big "oopsie" of the deployment: there is no /dev/xvdc on these devices. I've only added xvda and xvdb in the xen configuration, and while I could create PVCs to use as gluster storage (constructed on top of the LVM csi we've installed), I'm going to demonstrate a fairly common maintenance routine and add xvdc to all of the worker nodes.

Fixing the lack of Prior Proper Planning

Here's how we'll go about fixing this. First, we'll need to shrink the "pvcs" LV on each host, but if we did all of them at the same time, our applications would go down for the duration of the maintenance. Instead, we need to pick the host with the fewest high-priority applications (easy to do at the moment, we're just getting started!), drain that host (wait for the applications to get rescheduled on different hosts), then lvreduce the pvcs logical volume so we can add /dev/system/glusterfs lv.

(Note: I've added a few applications here and there so we can see what this would look like if real applications were up and running in the cluster).

Code:

jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | grep k8s-worker-1
calico-system       calico-kube-controllers-69bd6d9685-9hzf5    1/1     Running   0          19h     10.244.230.1    k8s-worker-1   <none>           <none>
calico-system       calico-node-kzbz8                           1/1     Running   0          19h     10.1.9.1        k8s-worker-1   <none>           <none>
calico-system       csi-node-driver-qt9rd                       2/2     Running   0          19h     10.244.230.5    k8s-worker-1   <none>           <none>
default             csi-driver-lvm-controller-0                 3/3     Running   0          44h     10.244.230.3    k8s-worker-1   <none>           <none>
default             csi-driver-lvm-plugin-5jncm                 3/3     Running   0          44h     10.244.230.2    k8s-worker-1   <none>           <none>
k8s-csi-s3          csi-s3-w28nw                                2/2     Running   0          40h     10.244.230.8    k8s-worker-1   <none>           <none>
kadalu              kadalu-csi-nodeplugin-9jwf7                 3/3     Running   0          30m     10.244.230.6    k8s-worker-1   <none>           <none>
kube-system         kube-proxy-qmqrt                            1/1     Running   0          2d9h    10.1.9.1        k8s-worker-1   <none>           <none>
kube-system         metrics-server-945fcf89c-5qkhh              1/1     Running   0          43h     10.244.230.4    k8s-worker-1   <none>           <none>
metallb             metallb-speaker-5d7sm                       4/4     Running   0          41h     10.1.9.1        k8s-worker-1   <none>           <none>

That looks really busy, but if you look closely, most of these are system/CNI/CSI pods and not application pods (we haven't spoken about metrics-server or kube-proxy, but suffice it to say they're kubernetes system-level pods and won't require any special handling, also since Calico does the BGP advertising, the metallb-speaker is doing absolutely nothing right now). So really the only pods we need to concern ourselves with here is the calico-kube-controller and the csi-driver-lvm-controller.

Code:

jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | egrep 'k8s-worker-(2|3)' | sort -k8
calico-system       calico-node-q4qc2                           1/1     Running   0          19h     10.1.9.2        k8s-worker-2   <none>           <none>
calico-system       calico-typha-ff6ff5cd8-g5cjj                1/1     Running   0          19h     10.1.9.2        k8s-worker-2   <none>           <none>
kube-system         kube-proxy-kg4km                            1/1     Running   0          45h     10.1.9.2        k8s-worker-2   <none>           <none>
metallb             metallb-speaker-h9xvl                       4/4     Running   0          41h     10.1.9.2        k8s-worker-2   <none>           <none>
tigera-operator     tigera-operator-7f8cd97876-htz6t            1/1     Running   0          19h     10.1.9.2        k8s-worker-2   <none>           <none>
authelia            authdb-1                                    1/1     Running   0          10h     10.244.140.6    k8s-worker-2   <none>           <none>
calico-system       csi-node-driver-rdh2v                       2/2     Running   0          19h     10.244.140.1    k8s-worker-2   <none>           <none>
cert-manager        cert-manager-cainjector-84cfdc869c-trm2d    1/1     Running   0          44h     10.244.140.3    k8s-worker-2   <none>           <none>
cert-manager        cert-manager-webhook-649b4d699f-k9szc       1/1     Running   0          44h     10.244.140.4    k8s-worker-2   <none>           <none>
default             csi-driver-lvm-plugin-9z7km                 3/3     Running   0          44h     10.244.140.2    k8s-worker-2   <none>           <none>
kadalu              kadalu-csi-nodeplugin-wphdv                 3/3     Running   0          37m     10.244.140.7    k8s-worker-2   <none>           <none>
k8s-csi-s3          csi-s3-vsxpp                                2/2     Running   0          40h     10.244.140.11   k8s-worker-2   <none>           <none>
calico-system       calico-node-mqcj6                           1/1     Running   0          19h     10.1.9.3        k8s-worker-3   <none>           <none>
calico-system       calico-typha-ff6ff5cd8-6kxf6                1/1     Running   0          19h     10.1.9.3        k8s-worker-3   <none>           <none>
kube-system         kube-proxy-4dl55                            1/1     Running   0          45h     10.1.9.3        k8s-worker-3   <none>           <none>
metallb             metallb-speaker-scwxw                       4/4     Running   0          41h     10.1.9.3        k8s-worker-3   <none>           <none>
calico-apiserver    calico-apiserver-76dd5f76bd-7ltsr           1/1     Running   0          19h     10.244.69.196   k8s-worker-3   <none>           <none>
calico-system       csi-node-driver-cmdxm                       2/2     Running   0          19h     10.244.69.193   k8s-worker-3   <none>           <none>
cert-manager        cert-manager-7bfbbd5f46-sn724               1/1     Running   0          44h     10.244.69.195   k8s-worker-3   <none>           <none>
default             csi-driver-lvm-plugin-4dgbd                 3/3     Running   0          44h     10.244.69.194   k8s-worker-3   <none>           <none>
k8s-csi-s3          csi-s3-k4lqs                                2/2     Running   0          40h     10.244.69.205   k8s-worker-3   <none>           <none>
kadalu              kadalu-csi-nodeplugin-r2rjf                 3/3     Running   0          37m     10.244.69.200   k8s-worker-3   <none>           <none>
kadalu              operator-58ddcb697c-b622v                   1/1     Running   0          38m     10.244.69.199   k8s-worker-3   <none>           <none>
kube-system         metrics-server-945fcf89c-tfk4f              1/1     Running   0          43h     10.244.69.201   k8s-worker-3   <none>           <none>

Worker 2 and 3 are similar, though 2 has one of the authdb pods (authelia postgres-operator-run postgres backing stores) as well as the tigera operator, and 3 has the kadalu operator and cert-manager (more on this in a future post). The operators aren't directly involved with application traffic, as they simply manage the application configuration, health, and such, so restarting them shouldn't be super-impactful (typically when operators start up they'll perform data gathering for all of their resources, and immediately check health on the resources they manage. Only if they find problems will they start making changes, so as long as an operator + it's own managed resources are on the same host will we have issues (and then really only minor issues)).

Let's go ahead and drain k8s-worker-1 and resize the pvcs lv there.

Code:

jon@k8s-master-1:~$ kubectl drain k8s-worker-1
node/k8s-worker-1 cordoned
error: unable to drain node "k8s-worker-1" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm, cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-945fcf89c-5qkhh], continuing command...
There are pending nodes to be drained:
 k8s-worker-1
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-945fcf89c-5qkhh

Just as the docs say, we can't drain DaemonSet pods. These are pods that are scheduled specifically on this node, so they can't migrate to another node. That's fine, we can --ignore-daemonsets. As for the emptydir issue for metrics-server, this is data that can be reconstructed ("top" for nodes and pods, essentially), so it's ok to delete the data there as well with --delete-emptydir-data.

Code:

jon@k8s-master-1:~$ kubectl drain k8s-worker-1 --ignore-daemonsets --delete-emptydir-data
node/k8s-worker-1 already cordoned
Warning: ignoring DaemonSet-managed Pods: calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm
evicting pod kube-system/metrics-server-945fcf89c-5qkhh
evicting pod calico-system/calico-kube-controllers-69bd6d9685-9hzf5
evicting pod default/csi-driver-lvm-controller-0
pod/calico-kube-controllers-69bd6d9685-9hzf5 evicted
pod/csi-driver-lvm-controller-0 evicted
pod/metrics-server-945fcf89c-5qkhh evicted
node/k8s-worker-1 drained
jon@k8s-master-1:~$

We got a successful message back, so we should be good to go. For giggles, lets see where the controllers on 1 went:

Code:

jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | grep controller
calico-system       calico-kube-controllers-69bd6d9685-rwb99    1/1     Running   0          2m1s    10.244.69.202   k8s-worker-3   <none>           <none>
default             csi-driver-lvm-controller-0                 3/3     Running   0          119s    10.244.79.78    k8s-worker-4   <none>           <none>
ingress-nginx       ingress-nginx-controller-798796947c-6ckcg   1/1     Running   0          41h     10.244.79.80    k8s-worker-4   <none>           <none>
kube-system         kube-controller-manager-k8s-master-1        1/1     Running   0          45h     10.1.8.1        k8s-master-1   <none>           <none>
kube-system         kube-controller-manager-k8s-master-2        1/1     Running   0          2d10h   10.1.8.2        k8s-master-2   <none>           <none>
kube-system         kube-controller-manager-k8s-master-3        1/1     Running   0          2d10h   10.1.8.3        k8s-master-3   <none>           <none>
metallb             metallb-controller-5f9bb77dcd-z8vqs         1/1     Running   0          41h     10.244.79.68    k8s-worker-4   <none>           <none>

Calico controller went to 3, and LVM controller went to 4. They're up and healthy (the 1/1 and 3/3 mean there is 1 container in the Calico controller pod, and it is healthy (it would be 0/1 if unhealthy), and 3 containers in the LVM controller, and they're all healthy).

Nice, let's fix k8s-worker-1. I've shutdown k8s-worker-1 and ssh'd to the xen host it's running on. xl list checks to ensure it isn't running (it was still listed, so I xl destroyed it...remember all the important bits are on LVM, so destroying the xen domu doesn't *remove* the data). Now we can get to work on the LVM configuration for k8s-worker-1.

Code:

root@xen1:~# lvdisplay /dev/system/pvcs
  --- Logical volume ---
  LV Path                /dev/system/pvcs
  LV Name                pvcs
  VG Name                system
  LV UUID                ty5IXt-RIcM-Ki5x-txuy-mj59-6yS7-fZFwWP
  LV Write Access        read/write
  LV Creation host, time xen1, 2023-12-11 10:45:57 -0600
  LV Status              available
  # open                 0
  LV Size                <239.11 GiB
  Current LE             61211
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:3

root@xen1:~# lvreduce -L 140G /dev/system/pvcs
  WARNING: Reducing active logical volume to 140.00 GiB.
  THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce system/pvcs? [y/n]: y
  Size of logical volume system/pvcs changed from <239.11 GiB (61211 extents) to 140.00 GiB (35840 extents).
  Logical volume system/pvcs successfully resized.

root@xen1:~# lvcreate /dev/system -n glusterfs -l 100%FREE
  Logical volume "glusterfs" created.

Now we need to add the glusterfs lv to k8s-worker-1 and boot it back up. Adjust the /etc/xen/k8s-worker.cfg file to add the lv as xvdc:

Code:

disk = [ 'phy:/dev/system/worker,xvda,w', 'phy:/dev/system/pvcs,xvdb,w', 'phy:/dev/system/glusterfs,xvdc,w' ]

And boot it back up. ssh to k8s-worker-1 and pvresize /dev/xvdb, so we're not running into csi-lvm issues later (Since we shrank the lv in the xen host, the *pv* in the VM will be smaller, but will have the old size cached...this will cause csi-lvm provisioning to fail for future pvcs!). Once that's done, wait a bit for it to go ready:

Code:

jon@k8s-master-1:~$ kubectl get node k8s-worker-1
NAME           STATUS                     ROLES           AGE     VERSION
k8s-worker-1   Ready,SchedulingDisabled   <none>          2d10h   v1.28.2

jon@k8s-master-1:~$ kubectl uncordon k8s-worker-1
node/k8s-worker-1 uncordoned

jon@k8s-master-1:~$ kubectl get node k8s-worker-1
NAME           STATUS   ROLES    AGE     VERSION
k8s-worker-1   Ready    <none>   2d10h   v1.28.2

Don't forget to uncordon the worker, or it won't be able to schedule pods!

Let's repeat those steps on workers 2-4 now. First I'll check if authdb on 2 is the current primary db pod:

Code:

jon@k8s-master-1:~$ kubectl exec -it authdb-1 -n authelia -c postgres -- patronictl list
+ Cluster: authdb --------+---------+------------------+----+-----------+
| Member   | Host         | Role    | State            | TL | Lag in MB |
+----------+--------------+---------+------------------+----+-----------+
| authdb-0 | 10.244.79.72 | Leader  | running          |  1 |           |
| authdb-1 | 10.244.140.6 | Replica | creating replica |    |   unknown |
+----------+--------------+---------+------------------+----+-----------+

Ok, we hit a snag. authdb-1 is in a stuck state, so we need to figure out what's going on. A few seconds later, I've come to the realization that I pointed archiving/backups to a machine that isn't reachable from k8s, so postgres is going to be broken right now. Meh. My bad. I'll have to fix that later. For now, let's kubectl drain, then lvreduce pvcs, and kubectl uncordon on k8s-worker-{2,3,4}, one at a time. Side note, I had to force delete the authdb-1 pod since it was in an indefinite hold waiting for a backup. This normally wouldn't happen and occurred because I haven't set up my ssh backups properly yet.

k8s-worker-4 has a larger disk, and so more pods get scheduled to it, so draining it will take a bit longer. In fact, it won't complete because postgres-operator sets a Pod Disruption Budget on authdb, meaning we can't take authdb-0 offline right now because it's the primary and authdb-1 is broken due to the ssh-backups. Sigh...force delete on authdb-0 time.

Once everyone is back up and happy, and xvdc is present on all (with ~100G each), we can create replicated glusterfs storageclasses across the cluster:

Code:

# three replicas
kubectl kadalu storage-add glusterfs-pool-1 --type Replica3 --device k8s-worker-1:/dev/xvdc --device k8s-worker-2:/dev/xvdc --device k8s-worker-3:/dev/xvdc

# and a single replica for demo purposes...
kubectl kadalu storage-add glusterfs-pool-2 --type Replica1 --device k8s-worker-4:/dev/xvdc

Checking the drivers

At this point you should be able to see the following:

Code:

jon@k8s-master-1:~$ kubectl get storageclass
NAME                              PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
csi-driver-lvm-linear (default)   lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   45h
csi-driver-lvm-mirror             lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   45h
csi-driver-lvm-striped            lvm.csi.metal-stack.io   Delete          WaitForFirstConsumer   true                   45h
csi-s3                            ru.yandex.s3.csi         Delete          Immediate              false                  42h
kadalu.glusterfs-pool-1           kadalu                   Delete          Immediate              true                   3m19s
kadalu.glusterfs-pool-2           kadalu                   Delete          Immediate              true                   78s

I've marked lvm-linear as the default storageclass, so it will be used unless a specific storageclass is requested when pvcs are created.

Next Steps

That was a lot to take in. Storage in Kubernetes isn't really complex, it's just diverse, with tons of options. Cutting it down to a few storageclasses is really simple, though, and that's what we've done with this blog post. Also, as a bonus, we discussed the proper way to do a maintenance in kubernetes and some of the gotchas that can occur along the way.

Next time, we'll discuss cert-manager and ingress-nginx integrations for automatically requesting/using certs from LetsEncrypt.

Cheers!

On-prem kubernetes, part 3.5

Comments