Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.

On-prem Kubernetes, Part 7

Posted 12-27-2023 at 04:44 AM by rocket357
Updated 01-08-2024 at 11:00 AM by rocket357

Tags kubernetes, linux, networking, openbsd, virtualization

Posts in this series:

Github for example configuration files: rocket357/on-prem-kubernetes

Overview

As luck would have it, around the time I thought it might be a good idea to post here about procedures for upgrading the cluster, a new minor version came out for kubernetes 1.28.X. I'm currently running 1.28.4 and 1.28.5 is available. Since this is essentially a lower-impact upgrade, it's a good place to start, and later when I upgrade to 1.29 I'll post again for a full version upgrade.

The beauty of kubernetes is that you can perform an upgrade with minimal impact to your end users. Assuming you have sufficient spare capacity and the applications running in your cluster are stateless (and those that are not stateless are sufficiently replicated to hot standbys), you can perform the upgrade during the middle of the day with minimal impact to your end users. Many of the applications I'm running (such as WBO) are *not* stateless/replicated, so there would be impact whenever the worker node that WBO is running on is upgraded, which is sadly unavoidable since the application isn't written with HA in mind. For applications we're running in HA, however, there should be no/minimal impact (i.e. reconnects/retries of requests at worst, assuming the application's remaining pods can handle the load and sessions aren't locked to specific pods).

kubeadm has a command to output the upgrade plan, which outlines any manual steps that might be required for an upgrade, so on the first control plane node we'll unlock kubeadm, upgrade it to the correct version (using apt), then relock it to the version we upgraded to. Once done, we'll run kubeadm upgrade plan and note the output.

Example plan output

For my cluster, the plan looks like this:

Code:

root@k8s-master-1:~# kubeadm upgrade plan
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.28.4
[upgrade/versions] kubeadm version: v1.28.5
I1224 21:22:39.030268  105244 version.go:256] remote version is much newer: v1.29.0; falling back to: stable-1.28
[upgrade/versions] Target version: v1.28.5
[upgrade/versions] Latest version in the v1.28 series: v1.28.5

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       TARGET
kubelet     7 x v1.28.2   v1.28.5

Upgrade to the latest version in the v1.28 series:

COMPONENT                 CURRENT   TARGET
kube-apiserver            v1.28.4   v1.28.5
kube-controller-manager   v1.28.4   v1.28.5
kube-scheduler            v1.28.4   v1.28.5
kube-proxy                v1.28.4   v1.28.5
CoreDNS                   v1.10.1   v1.10.1
etcd                      3.5.9-0   3.5.9-0

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.28.5

_____________________________________________________________________


The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.

API GROUP                 CURRENT VERSION   PREFERRED VERSION   MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io   v1alpha1          v1alpha1            no
kubelet.config.k8s.io     v1beta1           v1beta1             no
_____________________________________________________________________

As you can see, we're upgrading to version 1.28.5 from 1.28.4. As this is a single minor version, there is very little in the way of manual upgrades (but note that kubelet does have a manual update listed!).

The upgrade procedure

At a high level, the process would be thus:

unlock/update/relock kubeadm on all nodes (including workers)
Apply the upgrade on the first control plane node
1. ssh k8s-master-1
2. kubeadm upgrade apply
3. kubectl drain node k8s-master-1 --ignore-daemonsets
4. unlock/upgrade/relock/restart kubelet on the first control plane node (apt-mark/apt-get/systemctl)
5. kubectl uncordon k8s-master-1
6. wait for everything to settle and all nodes are ready again
For each control plane node remaining (one at a time!):
1. ssh k8s-master-#
2. kubeadm upgrade node
3. drain the current upgraded control plane node
4. unlock/upgrade/relock/restart kubelet on the current upgraded control plane node (apt-mark/apt-get/systemctl)
5. uncordon the current upgraded control plane node
6. wait for everything to settle and all nodes are ready again
Upgrade your CNI (if required)
For each worker node (one at a time, or possibly more at a time if you have enough spare capacity):
1. kubeadm upgrade node
2. drain the current upgraded worker nodes (have to run this on a control plane node)
3. unlock/upgrade/relock/restart kubelet on the current upgraded worker nodes (apt-mark/apt-get/systemctl)
4. uncordon the current upgraded worker nodes
5. wait for everything to settle and all nodes are ready again
Perform final app checks and resolve as needed

Running the upgrade - control plane first!

Per the plan output, to apply the upgrade we simply need to run kubeadm upgrade apply v1.28.5 on a single master node. For simplicity and consistency, I'll use k8s-master-1 (unless there is an operational issue, I'll always start with -1 of the cluster for consistency...and honestly if there's an operational issue, we shouldn't be upgrading the cluster until everything is healthy!). Once that is done on k8s-master-1, we need to drain k8s-master-1, upgrade the kubelet to 1.28.5, verify the kubelet is up and running (systemctl status), and then uncordon k8s-master-1. Once that's complete and k8s-master-1 shows Ready status, we can run kubeadm upgrade node on k8s-master-2, and then finally k8s-master-3.

The upgrade node procedure is basically the exact same thing as the upgrade apply that we did to k8s-master-1, minus the etcd bits. In other words, kubeadm upgrade node does a single node software upgrade, minus the kubelet which we'll upgrade via apt afterwards (this is the procedure we'll be using shortly on the workers as well), but kubeadm upgrade apply in addition to the first node upgrade also applies any api changes between versions and ensures any other etcd data that needs to be updated is updated. Since etcd is replicated, we wouldn't need to "run the upgrade" again on k8s-master-2 and k8s-master-3, we just need to upgrade their software versions, drain, then upgrade/restart kubelet, and finally uncordon them. Very important, however, is that this is done one at a time. Kubernetes (and really, etcd as the underlying datastore) relies on raft, a distributed quorum system, for data consistency. If you take two out of three masters down at a time in a system using raft, the single master cannot ensure data consistency (it cannot get a majority quorum), so api calls will fail. That aside, the etcd bits are already handled by the upgrade apply step so all we have to do for -2 and -3 is a node upgrade.

Upgrade your CNI

Calico's upgrade, since we utilized the helm chart for the tigera-operator, is simply:

Upgrade the operator (note the version embedded in the URL! Change as required!)
1. kubectl apply --server-side --force-conflicts -f https://raw.githubusercontent.com/pr...ator-crds.yaml
2. helm upgrade calico projectcalico/tigera-operator
Upgrade the CNI (note the version embedded in the URL! Change as required!)
1. curl https://raw.githubusercontent.com/pr...-operator.yaml -O
2. kubectl replace -f tigera-operator.yaml

Once this is done, it's a good idea to perform a few connectivity tests to your applications (once everything has settled back down) to ensure your CNI is still operational.

Upgrade the workers

Now we're in the meat and potatoes of the upgrade. No one (besides the kube admin) cares if kubelet is operational, or if kube-apiserver is HA. End users care if the applications are available. The worker nodes run the applications within the cluster, so at this point we're hitting potential impact to end users. (if you are the end user and you don't care if there is some downtime, then there are far fewer considerations for worker upgrades). When you drain a node, kubernetes will reschedule the pods that are on that node to other workers, then redirect traffic to the new pods and kill the pods on the worker that's being drained. Once that is done, you can upgrade/restart kubelet on the upgraded worker node and uncordon it to allow pods to be scheduled on the upgraded node once again. When you move to the next worker, it's likely that many of the pods on that worker will be scheduled to the last upgraded worker node (honoring pod affinity/antiAffinity, of course, so you don't end up in a situation where both primary and replica of a database end up on the same worker node, or too many pods of a given HA application end up on the same worker node, which could cause an application outage if that worker went down). There are also PodDisruptionBudgets to consider (which control how many pods of a given deployment/statefulset/replicaset are allowed to be offline at a given time, which is incredibly useful for maintaining HA/capacity during an upgrade, but can also cause a few issues during worker upgrades if a PDB would be violated by stopping a given pod).

Once you've worked through all of the worker node upgrades, you can perform your final application checks and resolve as needed. It's especially important to check the stateful applications, such as our DBs, to ensure the replication came back up successfully (that would SUCK for your next rolling upgrade, would it not?). For postgres-operator/spilo pods, you would want to do something like:

Code:

root@k8s-master-2:~# kubectl exec -it authdb-0 -n authelia -- patronictl list
Defaulted container "postgres" out of: postgres, date (init)
+ Cluster: authdb ---------+---------+--------------+----+-----------+
| Member   | Host          | Role    | State        | TL | Lag in MB |
+----------+---------------+---------+--------------+----+-----------+
| authdb-0 | 10.244.79.118 | Replica | start failed |    |   unknown |
| authdb-1 | 10.244.140.44 | Leader  | running      |  7 |           |
+----------+---------------+---------+--------------+----+-----------+

Oof...good thing we checked. The good news is that authelia doesn't do anything stupid like route SELECTs to replicas ONLY (I've seen applications that do stuff like that to "pre-optimize" against overloaded primary pods, which only serves to cause an outage if your replicas are unavailable...SMH). At this point we just troubleshoot to figure out why the replica failed (could be something simple, like a WAL file failing to apply, or perhaps something more complex like a diverged timeline), but typically if you want to "sledgehammer" this issue, just take a fresh backup on the primary, then either reinit the replica (or more aggressively completely nuke the replica AND it's pvc, then let the statefulset bring up a fresh replica (which will sync to the latest backup before replication kicks back in from the primary)). Note this approach works best for small datasets...if you have several hundred GB or even TB to restore from backup, it begs the question of why you're running this dataset within kubernetes...and if you're at that point I can only assume you know how to recover from the situation of a corrupted WAL file or diverged timelines.

The authelia database is sitting at ~850 MB, so sledgehammer is the appropriate approach.

Next Steps

I haven't decided what to post next. If you have any thoughts, hit me up on mastodon (@jon404@ioc.exchange) or comment here if you'd prefer. If I don't get any suggestions, I'll probably do a major version upgrade and some application-specific stuff in a future post.

Cheers!

On-prem Kubernetes, Part 7

Comments