Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.
On-prem kubernetes, part 3.5
Tags kubernetes, linux, networking, openbsd, virtualization
Posts in this series:
Project Goals and Description: Background info and goals
Preparing the installers: pxeboot configs
Installing the Xen Hosts: installing Debian/Xen dom0
Installing the K8s VMs: installing the k8s domUs
Initializing the Control Plane: Bootstrapping a bare-bones HA Kubernetes Cluster
Installation/Configuration of Calico/MetalLB/ingress-nginx: Installing the CNI/Network Infrastructure
Installation/Configuration of LVM-CSI, S3-CSI, and Kadalu (GlusterFS) (this post): Installing the CSIs for Persistent Volumes
Installation/Configuration of cert-manager: Installing/Configuring cert-manager
Automating the boring bits: Installing/Configuring ArgoCD and GitOps Concepts
Authentication considerations: Installing/Configuring Authelia/Vault and LDAP/OAuth Integrations
Authentication configurations: Securing Applications with Authelia
Staying up to date: Keeping your cluster up-to-date
Github for example configuration files: rocket357/on-prem-kubernetes
Overview
At this point in time, we could easily deploy any stateless applications to the cluster and call it good (well, there's still plenty to come with respect to certificates, metrics, secrets, monitoring, etc...we're by no means done yet!) but at some point we're going to want for applications to be able to store state and data within the cluster. This is accomplished via Container Storage Interface services, which allow for the configuration of storage for pods.
Types of Storage
There are many ways of storing data within a kubernetes cluster, and many ways for pods to consume that data. You could, for instance, put a set of configuration key-values in a kubernetes ConfigMap, and then mount that ConfigMap in a pod as a file (or perhaps even as a directory of files depending on your needs). You can do the same with a kubernetes Secret, which, by the way, are not secure in the sense that they are encrypted or otherwise stored in a "safe" fashion (i.e. keep out prying eyes). You can't read a secret without RBAC permissions to read it, of course, but the data in a secret is stored in base64 encoding, not encrypted in any fashion. base64 encoding makes it safe to store binary data as text, but it offers zero confidentiality.
Another type of storage you're likely to want to use is the types of storage your pod can write to in bulk. ConfigMaps and Secrets are great for what they do, but you can't have a postgres database keep its data in either (could you imagine a complete kube-api call/re-write of the configmap/secret for every. single. database. row. update.?). For this we'll need a CSI driver. If you browse through the list of CSI drivers, you'll see that there are (non-exhaustive list):
Cloud-Specific Drivers
"Build your own cloud" solutions
Hardware network storage
"Traditional" network filesystems
Point is, if you can write data to it, it probably has a K8s CSI written for it (not all CSI drivers are equal, however!). Some of the CSI drivers support a dizzying array of options (looking at you, AWS EBS), and some are incredibly simple, such as the "sample" driver HostPath (just mounts a filesystem path in the k8s worker host into the pod...which is a bad idea for scalability/reliability/availability since it ties a pod to a specific host, hence it being a "sample" driver...but don't let that fool you, in a pinch it works, *especially* for host specific pods, like DaemonSets).
Cutting down the List
For our purposes, we really only need a few of these options. We're running on-prem, so all of the cloud-specific drivers are out of the question, and I don't have an expensive NAS or iSCSI network storage array at home (this is all commodity hardware), so my choices are from the "build your own cloud" and "traditional network filesystem" lists. But here's the kicker: we're using open source. At the heart and soul of Linux (and the larger, encompassing FOSS movement in general) is the concept that the people with the itch can write the code to scratch the problem, so the community has a variety of offerings not listed in the "official" Kubernetes lists (also, to note, the "official" list literally has instructions at the top to open a pull request if you want your CSI driver added, and that the information contained in the list is community-driven by the CSI driver maintainers!). How's that for open-source?
It's helpful to determine what *types* of storage we need, and what properties those storages should have.
Persistent Local Data
I'm a database guy at heart, so we're going to be storing database things, preferably in PostgreSQL. The postgres-operator bits I discussed previously set up a multi-host database cluster, so we'll have replicas that we can failover to in the event that a primary pod goes down. If the primary pod is running with a HostPath, for instance, on worker-27, and worker-27 dies, we lose the HostPath data (most likely). This is bad, unless of course we have a replicated copy of the data on a hot standby on a different host that can take over the primary role as soon as it detects that the old primary is dead. Ideally the data wouldn't be attached to a specific host (i.e. an EBS volume that any host can mount), and perhaps you're running a Ceph cluster at home and can afford such data integrity luxuries, but I have three simple hosts (well, four) and an ssh server that has a few TB of storage as well (this will come in to play at some point...stay tuned). For database purposes, an LVM volume on the workers is sufficient for my needs. LVM volumes are preferred to HostPath since we can decouple the data and the path and cleaning up an LVM persistent storage configuration is a bit easier than cleaning up HostPaths.
For LVM data, there is a great lvm-csi driver from Metalstack that works well.
Object Storage
On the other hand, having object storage is probably a good idea for storage that doesn't necessarily need to be blazing fast, but can store a large variety of stuff. One of the applications I'll eventually deploy to this cluster is Komga, one of my favorite "local storage" reader-type webapps. I use it to store epub/PDFs for all my books, crafting, survival, howtos, reference tables, etc... Komga needs two types of storage. The first is a persistent place to store configuration and "indexes" into...the other type of storage (object storage) for bulk object/data storage. The objects in our object storage aren't written or updated often...it's just a collection of files that are uploaded infrequently (i.e. the raw epub/PDFs that are only uploaded when we add new books or update old books). The index bits, however, contain metadata and the like, and are updated when you add new files, read portions of said files, edit the descriptions of files, etc... so it is updated more frequently than the raw bulk objects themselves. This needs to be local and fast, and can reside in LVM as well. This data can be rebuilt by scanning the object storage, so it doesn't need to be replicated.
But getting back to the object storage, I used to work for Amazon, and I've had an AWS account for many, many years, so I'll just deploy the aws-s3-csi driver for this data (ironically from Yandex since the S3 api is stable and well-known).
Replicated Local Storage
The last class of storage is stuff that we'd like to keep replicated across hosts, so if one host dies we don't lose the data. This is essentially the same thing as the PostgreSQL storage above, but without the postgres-operator auto-configuration to make it replicate the data automatically. Since this storage type is "missing" the operator bits (not every webapp will contain replication logic, and rightfully so), we need a different mechanism for accomplishing the same replication. GlusterFS, MooseFS, and friends are all good choices here, but they come with the overhead of replication (everything is a tradeoff in computing) so they tend to be slower than something like local LVM for writing. This is an acceptable tradeoff for applications that need data integrity and availability, but not necessarily blazing performance. Databases certainly need the performance, so they would be a bad choice here for all but the smallest datasets, but other applications could probably live with sub-par read and (particularly) write speed. And example might be gotify, which I use for messaging to my phone when something important happens (i.e. a patch is made available for OpenBSD, or my HomeAssistant Server runs a specific "fixit" automation, or there's a new login to my private gitea server, etc... If I receive an alert regarding a new login to my gitea server, and I haven't checked my phone yet, I don't want to lose the alert if gotify's pod restarts due to a host rebooting or losing a disk. Thus, I need this data to be replicated across hosts, such that starting gotify back up on a different host will persist the data so I don't lose alerts.
I'm partial to KaDalu here, since I've used it before and it is fairly straightforward to setup. It's an operator that configures a replicated GlusterFS backing store across your hosts.
Installation/Configuration of the CSI drivers
The CSI drivers can be installed via helm:
The lvm driver needs lvm.devicePattern set, which I've set in the example above to the second virtual disk on our workers (the second one listed in the xen worker configs, which we set to xvdb). The s3 csi needs a values file, which includes bits such as the AWS access key, secret key, region, bucket name, etc... to properly configure the driver, and the kadalu driver needs to know what kubernetes distribution we're running (openshift, rke, microk8s, etc...) so we'll pass the default "kubernetes" since we're a vanilla on-prem k8s installation.
And here is the first big "oopsie" of the deployment: there is no /dev/xvdc on these devices. I've only added xvda and xvdb in the xen configuration, and while I could create PVCs to use as gluster storage (constructed on top of the LVM csi we've installed), I'm going to demonstrate a fairly common maintenance routine and add xvdc to all of the worker nodes.
Fixing the lack of Prior Proper Planning
Here's how we'll go about fixing this. First, we'll need to shrink the "pvcs" LV on each host, but if we did all of them at the same time, our applications would go down for the duration of the maintenance. Instead, we need to pick the host with the fewest high-priority applications (easy to do at the moment, we're just getting started!), drain that host (wait for the applications to get rescheduled on different hosts), then lvreduce the pvcs logical volume so we can add /dev/system/glusterfs lv.
(Note: I've added a few applications here and there so we can see what this would look like if real applications were up and running in the cluster).
That looks really busy, but if you look closely, most of these are system/CNI/CSI pods and not application pods (we haven't spoken about metrics-server or kube-proxy, but suffice it to say they're kubernetes system-level pods and won't require any special handling, also since Calico does the BGP advertising, the metallb-speaker is doing absolutely nothing right now). So really the only pods we need to concern ourselves with here is the calico-kube-controller and the csi-driver-lvm-controller.
Worker 2 and 3 are similar, though 2 has one of the authdb pods (authelia postgres-operator-run postgres backing stores) as well as the tigera operator, and 3 has the kadalu operator and cert-manager (more on this in a future post). The operators aren't directly involved with application traffic, as they simply manage the application configuration, health, and such, so restarting them shouldn't be super-impactful (typically when operators start up they'll perform data gathering for all of their resources, and immediately check health on the resources they manage. Only if they find problems will they start making changes, so as long as an operator + it's own managed resources are on the same host will we have issues (and then really only minor issues)).
Let's go ahead and drain k8s-worker-1 and resize the pvcs lv there.
Just as the docs say, we can't drain DaemonSet pods. These are pods that are scheduled specifically on this node, so they can't migrate to another node. That's fine, we can --ignore-daemonsets. As for the emptydir issue for metrics-server, this is data that can be reconstructed ("top" for nodes and pods, essentially), so it's ok to delete the data there as well with --delete-emptydir-data.
We got a successful message back, so we should be good to go. For giggles, lets see where the controllers on 1 went:
Calico controller went to 3, and LVM controller went to 4. They're up and healthy (the 1/1 and 3/3 mean there is 1 container in the Calico controller pod, and it is healthy (it would be 0/1 if unhealthy), and 3 containers in the LVM controller, and they're all healthy).
Nice, let's fix k8s-worker-1. I've shutdown k8s-worker-1 and ssh'd to the xen host it's running on. xl list checks to ensure it isn't running (it was still listed, so I xl destroyed it...remember all the important bits are on LVM, so destroying the xen domu doesn't *remove* the data). Now we can get to work on the LVM configuration for k8s-worker-1.
Now we need to add the glusterfs lv to k8s-worker-1 and boot it back up. Adjust the /etc/xen/k8s-worker.cfg file to add the lv as xvdc:
And boot it back up. ssh to k8s-worker-1 and pvresize /dev/xvdb, so we're not running into csi-lvm issues later (Since we shrank the lv in the xen host, the *pv* in the VM will be smaller, but will have the old size cached...this will cause csi-lvm provisioning to fail for future pvcs!). Once that's done, wait a bit for it to go ready:
Don't forget to uncordon the worker, or it won't be able to schedule pods!
Let's repeat those steps on workers 2-4 now. First I'll check if authdb on 2 is the current primary db pod:
Ok, we hit a snag. authdb-1 is in a stuck state, so we need to figure out what's going on. A few seconds later, I've come to the realization that I pointed archiving/backups to a machine that isn't reachable from k8s, so postgres is going to be broken right now. Meh. My bad. I'll have to fix that later. For now, let's kubectl drain, then lvreduce pvcs, and kubectl uncordon on k8s-worker-{2,3,4}, one at a time. Side note, I had to force delete the authdb-1 pod since it was in an indefinite hold waiting for a backup. This normally wouldn't happen and occurred because I haven't set up my ssh backups properly yet.
k8s-worker-4 has a larger disk, and so more pods get scheduled to it, so draining it will take a bit longer. In fact, it won't complete because postgres-operator sets a Pod Disruption Budget on authdb, meaning we can't take authdb-0 offline right now because it's the primary and authdb-1 is broken due to the ssh-backups. Sigh...force delete on authdb-0 time.
Once everyone is back up and happy, and xvdc is present on all (with ~100G each), we can create replicated glusterfs storageclasses across the cluster:
Checking the drivers
At this point you should be able to see the following:
I've marked lvm-linear as the default storageclass, so it will be used unless a specific storageclass is requested when pvcs are created.
Next Steps
That was a lot to take in. Storage in Kubernetes isn't really complex, it's just diverse, with tons of options. Cutting it down to a few storageclasses is really simple, though, and that's what we've done with this blog post. Also, as a bonus, we discussed the proper way to do a maintenance in kubernetes and some of the gotchas that can occur along the way.
Next time, we'll discuss cert-manager and ingress-nginx integrations for automatically requesting/using certs from LetsEncrypt.
Cheers!
Project Goals and Description: Background info and goals
Preparing the installers: pxeboot configs
Installing the Xen Hosts: installing Debian/Xen dom0
Installing the K8s VMs: installing the k8s domUs
Initializing the Control Plane: Bootstrapping a bare-bones HA Kubernetes Cluster
Installation/Configuration of Calico/MetalLB/ingress-nginx: Installing the CNI/Network Infrastructure
Installation/Configuration of LVM-CSI, S3-CSI, and Kadalu (GlusterFS) (this post): Installing the CSIs for Persistent Volumes
Installation/Configuration of cert-manager: Installing/Configuring cert-manager
Automating the boring bits: Installing/Configuring ArgoCD and GitOps Concepts
Authentication considerations: Installing/Configuring Authelia/Vault and LDAP/OAuth Integrations
Authentication configurations: Securing Applications with Authelia
Staying up to date: Keeping your cluster up-to-date
Github for example configuration files: rocket357/on-prem-kubernetes
Overview
At this point in time, we could easily deploy any stateless applications to the cluster and call it good (well, there's still plenty to come with respect to certificates, metrics, secrets, monitoring, etc...we're by no means done yet!) but at some point we're going to want for applications to be able to store state and data within the cluster. This is accomplished via Container Storage Interface services, which allow for the configuration of storage for pods.
Types of Storage
There are many ways of storing data within a kubernetes cluster, and many ways for pods to consume that data. You could, for instance, put a set of configuration key-values in a kubernetes ConfigMap, and then mount that ConfigMap in a pod as a file (or perhaps even as a directory of files depending on your needs). You can do the same with a kubernetes Secret, which, by the way, are not secure in the sense that they are encrypted or otherwise stored in a "safe" fashion (i.e. keep out prying eyes). You can't read a secret without RBAC permissions to read it, of course, but the data in a secret is stored in base64 encoding, not encrypted in any fashion. base64 encoding makes it safe to store binary data as text, but it offers zero confidentiality.
Another type of storage you're likely to want to use is the types of storage your pod can write to in bulk. ConfigMaps and Secrets are great for what they do, but you can't have a postgres database keep its data in either (could you imagine a complete kube-api call/re-write of the configmap/secret for every. single. database. row. update.?). For this we'll need a CSI driver. If you browse through the list of CSI drivers, you'll see that there are (non-exhaustive list):
Cloud-Specific Drivers
- Alibaba
- AWS
- Azure
- CloudScale
- DigitalOcean
- Google Cloud
- Hetzner
- IBM Cloud
- Linode
- Oracle Cloud
- Qing Cloud
- Tencent Cloud
- Vultr
- Yandex
"Build your own cloud" solutions
- CephFS/RBD drivers
- Cinder (OpenStack)
- HyperV
- Longhorn
- Portworx
- vSphere
- TrueNAS
Hardware network storage
- Datatom Infinity
- Dell EMC
- Dothill (SeaGate)
- Hitachi
- HPE
- NetApp
- Synology
"Traditional" network filesystems
- BeeGFS
- democratic-csi (ZFS)
- GlusterFS
- JuiceFS
- KaDalu (Gluster)
- MooseFS
- NFS
- SeaweedFS
- SMB
Point is, if you can write data to it, it probably has a K8s CSI written for it (not all CSI drivers are equal, however!). Some of the CSI drivers support a dizzying array of options (looking at you, AWS EBS), and some are incredibly simple, such as the "sample" driver HostPath (just mounts a filesystem path in the k8s worker host into the pod...which is a bad idea for scalability/reliability/availability since it ties a pod to a specific host, hence it being a "sample" driver...but don't let that fool you, in a pinch it works, *especially* for host specific pods, like DaemonSets).
Cutting down the List
For our purposes, we really only need a few of these options. We're running on-prem, so all of the cloud-specific drivers are out of the question, and I don't have an expensive NAS or iSCSI network storage array at home (this is all commodity hardware), so my choices are from the "build your own cloud" and "traditional network filesystem" lists. But here's the kicker: we're using open source. At the heart and soul of Linux (and the larger, encompassing FOSS movement in general) is the concept that the people with the itch can write the code to scratch the problem, so the community has a variety of offerings not listed in the "official" Kubernetes lists (also, to note, the "official" list literally has instructions at the top to open a pull request if you want your CSI driver added, and that the information contained in the list is community-driven by the CSI driver maintainers!). How's that for open-source?
It's helpful to determine what *types* of storage we need, and what properties those storages should have.
Persistent Local Data
I'm a database guy at heart, so we're going to be storing database things, preferably in PostgreSQL. The postgres-operator bits I discussed previously set up a multi-host database cluster, so we'll have replicas that we can failover to in the event that a primary pod goes down. If the primary pod is running with a HostPath, for instance, on worker-27, and worker-27 dies, we lose the HostPath data (most likely). This is bad, unless of course we have a replicated copy of the data on a hot standby on a different host that can take over the primary role as soon as it detects that the old primary is dead. Ideally the data wouldn't be attached to a specific host (i.e. an EBS volume that any host can mount), and perhaps you're running a Ceph cluster at home and can afford such data integrity luxuries, but I have three simple hosts (well, four) and an ssh server that has a few TB of storage as well (this will come in to play at some point...stay tuned). For database purposes, an LVM volume on the workers is sufficient for my needs. LVM volumes are preferred to HostPath since we can decouple the data and the path and cleaning up an LVM persistent storage configuration is a bit easier than cleaning up HostPaths.
For LVM data, there is a great lvm-csi driver from Metalstack that works well.
Object Storage
On the other hand, having object storage is probably a good idea for storage that doesn't necessarily need to be blazing fast, but can store a large variety of stuff. One of the applications I'll eventually deploy to this cluster is Komga, one of my favorite "local storage" reader-type webapps. I use it to store epub/PDFs for all my books, crafting, survival, howtos, reference tables, etc... Komga needs two types of storage. The first is a persistent place to store configuration and "indexes" into...the other type of storage (object storage) for bulk object/data storage. The objects in our object storage aren't written or updated often...it's just a collection of files that are uploaded infrequently (i.e. the raw epub/PDFs that are only uploaded when we add new books or update old books). The index bits, however, contain metadata and the like, and are updated when you add new files, read portions of said files, edit the descriptions of files, etc... so it is updated more frequently than the raw bulk objects themselves. This needs to be local and fast, and can reside in LVM as well. This data can be rebuilt by scanning the object storage, so it doesn't need to be replicated.
But getting back to the object storage, I used to work for Amazon, and I've had an AWS account for many, many years, so I'll just deploy the aws-s3-csi driver for this data (ironically from Yandex since the S3 api is stable and well-known).
Replicated Local Storage
The last class of storage is stuff that we'd like to keep replicated across hosts, so if one host dies we don't lose the data. This is essentially the same thing as the PostgreSQL storage above, but without the postgres-operator auto-configuration to make it replicate the data automatically. Since this storage type is "missing" the operator bits (not every webapp will contain replication logic, and rightfully so), we need a different mechanism for accomplishing the same replication. GlusterFS, MooseFS, and friends are all good choices here, but they come with the overhead of replication (everything is a tradeoff in computing) so they tend to be slower than something like local LVM for writing. This is an acceptable tradeoff for applications that need data integrity and availability, but not necessarily blazing performance. Databases certainly need the performance, so they would be a bad choice here for all but the smallest datasets, but other applications could probably live with sub-par read and (particularly) write speed. And example might be gotify, which I use for messaging to my phone when something important happens (i.e. a patch is made available for OpenBSD, or my HomeAssistant Server runs a specific "fixit" automation, or there's a new login to my private gitea server, etc... If I receive an alert regarding a new login to my gitea server, and I haven't checked my phone yet, I don't want to lose the alert if gotify's pod restarts due to a host rebooting or losing a disk. Thus, I need this data to be replicated across hosts, such that starting gotify back up on a different host will persist the data so I don't lose alerts.
I'm partial to KaDalu here, since I've used it before and it is fairly straightforward to setup. It's an operator that configures a replicated GlusterFS backing store across your hosts.
Installation/Configuration of the CSI drivers
The CSI drivers can be installed via helm:
Code:
# install lvm driver helm install --repo https://helm.metal-stack.io csi-driver-lvm helm/csi-driver-lvm --set lvm.devicePattern='/dev/xvdb' # install s3 driver helm install csi-s3 yandex-s3/csi-s3 -n k8s-csi-s3 --create-namespace --values k8s-csi-s3-values.yaml # install kadalu operator/driver # first download the chart and set the default env K8S_DIST=kubernetes curl -sL https://github.com/kadalu/kadalu/releases/latest/download/kadalu-helm-chart.tgz -o /tmp/kadalu-helm-chart.tgz # next install operator helm install operator --namespace kadalu --create-namespace /tmp/kadalu-helm-chart.tgz --set operator.enabled=true --set global.kubernetesDistro=$K8S_DIST # now install the csi driver the operator will manage helm install csi-nodeplugin --namespace kadalu /tmp/kadalu-helm-chart.tgz --set csi-nodeplugin.enabled=true --set global.kubernetesDistro=$K8S_DIST # now we need to tell kadalu what host devices to use, and for that we need the kadalu kubectl plugin...so let's install it! curl -fsSL https://github.com/kadalu/kadalu/releases/latest/download/install.sh | sudo bash -x # and set up a storageclass by telling kadalu what hosts/drives to use... kubectl kadalu storage-add storage-pool-1 --device kube1:/dev/xvdc
And here is the first big "oopsie" of the deployment: there is no /dev/xvdc on these devices. I've only added xvda and xvdb in the xen configuration, and while I could create PVCs to use as gluster storage (constructed on top of the LVM csi we've installed), I'm going to demonstrate a fairly common maintenance routine and add xvdc to all of the worker nodes.
Fixing the lack of Prior Proper Planning
Here's how we'll go about fixing this. First, we'll need to shrink the "pvcs" LV on each host, but if we did all of them at the same time, our applications would go down for the duration of the maintenance. Instead, we need to pick the host with the fewest high-priority applications (easy to do at the moment, we're just getting started!), drain that host (wait for the applications to get rescheduled on different hosts), then lvreduce the pvcs logical volume so we can add /dev/system/glusterfs lv.
(Note: I've added a few applications here and there so we can see what this would look like if real applications were up and running in the cluster).
Code:
jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | grep k8s-worker-1 calico-system calico-kube-controllers-69bd6d9685-9hzf5 1/1 Running 0 19h 10.244.230.1 k8s-worker-1 <none> <none> calico-system calico-node-kzbz8 1/1 Running 0 19h 10.1.9.1 k8s-worker-1 <none> <none> calico-system csi-node-driver-qt9rd 2/2 Running 0 19h 10.244.230.5 k8s-worker-1 <none> <none> default csi-driver-lvm-controller-0 3/3 Running 0 44h 10.244.230.3 k8s-worker-1 <none> <none> default csi-driver-lvm-plugin-5jncm 3/3 Running 0 44h 10.244.230.2 k8s-worker-1 <none> <none> k8s-csi-s3 csi-s3-w28nw 2/2 Running 0 40h 10.244.230.8 k8s-worker-1 <none> <none> kadalu kadalu-csi-nodeplugin-9jwf7 3/3 Running 0 30m 10.244.230.6 k8s-worker-1 <none> <none> kube-system kube-proxy-qmqrt 1/1 Running 0 2d9h 10.1.9.1 k8s-worker-1 <none> <none> kube-system metrics-server-945fcf89c-5qkhh 1/1 Running 0 43h 10.244.230.4 k8s-worker-1 <none> <none> metallb metallb-speaker-5d7sm 4/4 Running 0 41h 10.1.9.1 k8s-worker-1 <none> <none>
Code:
jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | egrep 'k8s-worker-(2|3)' | sort -k8 calico-system calico-node-q4qc2 1/1 Running 0 19h 10.1.9.2 k8s-worker-2 <none> <none> calico-system calico-typha-ff6ff5cd8-g5cjj 1/1 Running 0 19h 10.1.9.2 k8s-worker-2 <none> <none> kube-system kube-proxy-kg4km 1/1 Running 0 45h 10.1.9.2 k8s-worker-2 <none> <none> metallb metallb-speaker-h9xvl 4/4 Running 0 41h 10.1.9.2 k8s-worker-2 <none> <none> tigera-operator tigera-operator-7f8cd97876-htz6t 1/1 Running 0 19h 10.1.9.2 k8s-worker-2 <none> <none> authelia authdb-1 1/1 Running 0 10h 10.244.140.6 k8s-worker-2 <none> <none> calico-system csi-node-driver-rdh2v 2/2 Running 0 19h 10.244.140.1 k8s-worker-2 <none> <none> cert-manager cert-manager-cainjector-84cfdc869c-trm2d 1/1 Running 0 44h 10.244.140.3 k8s-worker-2 <none> <none> cert-manager cert-manager-webhook-649b4d699f-k9szc 1/1 Running 0 44h 10.244.140.4 k8s-worker-2 <none> <none> default csi-driver-lvm-plugin-9z7km 3/3 Running 0 44h 10.244.140.2 k8s-worker-2 <none> <none> kadalu kadalu-csi-nodeplugin-wphdv 3/3 Running 0 37m 10.244.140.7 k8s-worker-2 <none> <none> k8s-csi-s3 csi-s3-vsxpp 2/2 Running 0 40h 10.244.140.11 k8s-worker-2 <none> <none> calico-system calico-node-mqcj6 1/1 Running 0 19h 10.1.9.3 k8s-worker-3 <none> <none> calico-system calico-typha-ff6ff5cd8-6kxf6 1/1 Running 0 19h 10.1.9.3 k8s-worker-3 <none> <none> kube-system kube-proxy-4dl55 1/1 Running 0 45h 10.1.9.3 k8s-worker-3 <none> <none> metallb metallb-speaker-scwxw 4/4 Running 0 41h 10.1.9.3 k8s-worker-3 <none> <none> calico-apiserver calico-apiserver-76dd5f76bd-7ltsr 1/1 Running 0 19h 10.244.69.196 k8s-worker-3 <none> <none> calico-system csi-node-driver-cmdxm 2/2 Running 0 19h 10.244.69.193 k8s-worker-3 <none> <none> cert-manager cert-manager-7bfbbd5f46-sn724 1/1 Running 0 44h 10.244.69.195 k8s-worker-3 <none> <none> default csi-driver-lvm-plugin-4dgbd 3/3 Running 0 44h 10.244.69.194 k8s-worker-3 <none> <none> k8s-csi-s3 csi-s3-k4lqs 2/2 Running 0 40h 10.244.69.205 k8s-worker-3 <none> <none> kadalu kadalu-csi-nodeplugin-r2rjf 3/3 Running 0 37m 10.244.69.200 k8s-worker-3 <none> <none> kadalu operator-58ddcb697c-b622v 1/1 Running 0 38m 10.244.69.199 k8s-worker-3 <none> <none> kube-system metrics-server-945fcf89c-tfk4f 1/1 Running 0 43h 10.244.69.201 k8s-worker-3 <none> <none>
Let's go ahead and drain k8s-worker-1 and resize the pvcs lv there.
Code:
jon@k8s-master-1:~$ kubectl drain k8s-worker-1 node/k8s-worker-1 cordoned error: unable to drain node "k8s-worker-1" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm, cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-945fcf89c-5qkhh], continuing command... There are pending nodes to be drained: k8s-worker-1 cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-945fcf89c-5qkhh
Code:
jon@k8s-master-1:~$ kubectl drain k8s-worker-1 --ignore-daemonsets --delete-emptydir-data node/k8s-worker-1 already cordoned Warning: ignoring DaemonSet-managed Pods: calico-system/calico-node-kzbz8, calico-system/csi-node-driver-qt9rd, default/csi-driver-lvm-plugin-5jncm, k8s-csi-s3/csi-s3-w28nw, kadalu/kadalu-csi-nodeplugin-9jwf7, kube-system/kube-proxy-qmqrt, metallb/metallb-speaker-5d7sm evicting pod kube-system/metrics-server-945fcf89c-5qkhh evicting pod calico-system/calico-kube-controllers-69bd6d9685-9hzf5 evicting pod default/csi-driver-lvm-controller-0 pod/calico-kube-controllers-69bd6d9685-9hzf5 evicted pod/csi-driver-lvm-controller-0 evicted pod/metrics-server-945fcf89c-5qkhh evicted node/k8s-worker-1 drained jon@k8s-master-1:~$
Code:
jon@k8s-master-1:~$ kubectl get pods --all-namespaces -o wide | grep controller calico-system calico-kube-controllers-69bd6d9685-rwb99 1/1 Running 0 2m1s 10.244.69.202 k8s-worker-3 <none> <none> default csi-driver-lvm-controller-0 3/3 Running 0 119s 10.244.79.78 k8s-worker-4 <none> <none> ingress-nginx ingress-nginx-controller-798796947c-6ckcg 1/1 Running 0 41h 10.244.79.80 k8s-worker-4 <none> <none> kube-system kube-controller-manager-k8s-master-1 1/1 Running 0 45h 10.1.8.1 k8s-master-1 <none> <none> kube-system kube-controller-manager-k8s-master-2 1/1 Running 0 2d10h 10.1.8.2 k8s-master-2 <none> <none> kube-system kube-controller-manager-k8s-master-3 1/1 Running 0 2d10h 10.1.8.3 k8s-master-3 <none> <none> metallb metallb-controller-5f9bb77dcd-z8vqs 1/1 Running 0 41h 10.244.79.68 k8s-worker-4 <none> <none>
Nice, let's fix k8s-worker-1. I've shutdown k8s-worker-1 and ssh'd to the xen host it's running on. xl list checks to ensure it isn't running (it was still listed, so I xl destroyed it...remember all the important bits are on LVM, so destroying the xen domu doesn't *remove* the data). Now we can get to work on the LVM configuration for k8s-worker-1.
Code:
root@xen1:~# lvdisplay /dev/system/pvcs --- Logical volume --- LV Path /dev/system/pvcs LV Name pvcs VG Name system LV UUID ty5IXt-RIcM-Ki5x-txuy-mj59-6yS7-fZFwWP LV Write Access read/write LV Creation host, time xen1, 2023-12-11 10:45:57 -0600 LV Status available # open 0 LV Size <239.11 GiB Current LE 61211 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:3 root@xen1:~# lvreduce -L 140G /dev/system/pvcs WARNING: Reducing active logical volume to 140.00 GiB. THIS MAY DESTROY YOUR DATA (filesystem etc.) Do you really want to reduce system/pvcs? [y/n]: y Size of logical volume system/pvcs changed from <239.11 GiB (61211 extents) to 140.00 GiB (35840 extents). Logical volume system/pvcs successfully resized. root@xen1:~# lvcreate /dev/system -n glusterfs -l 100%FREE Logical volume "glusterfs" created.
Code:
disk = [ 'phy:/dev/system/worker,xvda,w', 'phy:/dev/system/pvcs,xvdb,w', 'phy:/dev/system/glusterfs,xvdc,w' ]
Code:
jon@k8s-master-1:~$ kubectl get node k8s-worker-1 NAME STATUS ROLES AGE VERSION k8s-worker-1 Ready,SchedulingDisabled <none> 2d10h v1.28.2 jon@k8s-master-1:~$ kubectl uncordon k8s-worker-1 node/k8s-worker-1 uncordoned jon@k8s-master-1:~$ kubectl get node k8s-worker-1 NAME STATUS ROLES AGE VERSION k8s-worker-1 Ready <none> 2d10h v1.28.2
Let's repeat those steps on workers 2-4 now. First I'll check if authdb on 2 is the current primary db pod:
Code:
jon@k8s-master-1:~$ kubectl exec -it authdb-1 -n authelia -c postgres -- patronictl list + Cluster: authdb --------+---------+------------------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------+--------------+---------+------------------+----+-----------+ | authdb-0 | 10.244.79.72 | Leader | running | 1 | | | authdb-1 | 10.244.140.6 | Replica | creating replica | | unknown | +----------+--------------+---------+------------------+----+-----------+
k8s-worker-4 has a larger disk, and so more pods get scheduled to it, so draining it will take a bit longer. In fact, it won't complete because postgres-operator sets a Pod Disruption Budget on authdb, meaning we can't take authdb-0 offline right now because it's the primary and authdb-1 is broken due to the ssh-backups. Sigh...force delete on authdb-0 time.
Once everyone is back up and happy, and xvdc is present on all (with ~100G each), we can create replicated glusterfs storageclasses across the cluster:
Code:
# three replicas kubectl kadalu storage-add glusterfs-pool-1 --type Replica3 --device k8s-worker-1:/dev/xvdc --device k8s-worker-2:/dev/xvdc --device k8s-worker-3:/dev/xvdc # and a single replica for demo purposes... kubectl kadalu storage-add glusterfs-pool-2 --type Replica1 --device k8s-worker-4:/dev/xvdc
At this point you should be able to see the following:
Code:
jon@k8s-master-1:~$ kubectl get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-driver-lvm-linear (default) lvm.csi.metal-stack.io Delete WaitForFirstConsumer true 45h csi-driver-lvm-mirror lvm.csi.metal-stack.io Delete WaitForFirstConsumer true 45h csi-driver-lvm-striped lvm.csi.metal-stack.io Delete WaitForFirstConsumer true 45h csi-s3 ru.yandex.s3.csi Delete Immediate false 42h kadalu.glusterfs-pool-1 kadalu Delete Immediate true 3m19s kadalu.glusterfs-pool-2 kadalu Delete Immediate true 78s
Next Steps
That was a lot to take in. Storage in Kubernetes isn't really complex, it's just diverse, with tons of options. Cutting it down to a few storageclasses is really simple, though, and that's what we've done with this blog post. Also, as a bonus, we discussed the proper way to do a maintenance in kubernetes and some of the gotchas that can occur along the way.
Next time, we'll discuss cert-manager and ingress-nginx integrations for automatically requesting/using certs from LetsEncrypt.
Cheers!
Total Comments 0