Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.

On-prem kubernetes, Part 2.5

Posted 12-13-2023 at 03:58 PM by rocket357
Updated 12-27-2023 at 04:47 AM by rocket357

Tags kubernetes, linux, networking, openbsd, virtualization

Posts in this series:
Project Goals and Description: Background info and goals
Preparing the installers: pxeboot configs
Installing the Xen Hosts: installing Debian/Xen dom0
Installing the K8s VMs: installing the k8s domUs
Initializing the Control Plane (this post): Bootstrapping a bare-bones HA Kubernetes Cluster
Installation/Configuration of Calico/MetalLB/ingress-nginx: Installing the CNI/Network Infrastructure
Installation/Configuration of LVM-CSI, S3-CSI, and Kadalu (GlusterFS): Installing the CSIs for Persistent Volumes
Installation/Configuration of cert-manager: Installing/Configuring cert-manager
Automating the boring bits: Installing/Configuring ArgoCD and GitOps Concepts
Authentication considerations: Installing/Configuring Authelia/Vault and LDAP/OAuth Integrations
Authentication configurations: Securing Applications with Authelia
Staying up to date: Keeping your cluster up-to-date

Github for example configuration files: rocket357/on-prem-kubernetes

Overview

It is time to begin, finally. We're heading into the world of Kubernetes, ready or not.

I'd mentioned in part 1.999999 that I'd like to keep the haproxy/keepalived instances on the control plane VMs. Due to the way I'm configuring haproxy, it turns out it is actually easier to have haproxy/keepalived "above" the control plane VMs, so I've gone back on my previous words and installed haproxy/keepalived on dom0. This is to reduce port conflicts on the control plane VMs, and it works all the same.

Changes are reflected in github.

HAProxy and Keepalived

If you need to, install haproxy and keepalived on the xen hosts where your control plane domUs will run (actually, put it on whatever Xen hosts you feel like...I won't tell you how to live). Then pull the configs out from the above linked github and place them in /etc/haproxy and /etc/keepalived, respectively. Keepalived's config will require some tweaking, mainly the "unicast_src_ip" should be the IP of the Xen host keepalived is running on, and unicast_peer should be a list of the *other* Xen hosts where keepalived is running. Yes, this means each Xen host running Keepalived gets a "unique" config =)

It goes without saying, but "virtual_ipaddress" should match on them all, too. This virtual ip should also have a dns record that you can resolve at least locally. I used "kube-apiserver.$MYTLD" and I'll put that in the following commands as I go along.

At this point you should have HAProxy complaining that none of the backend hosts in kube-apiserver are up. That's a good sign, because none of the kubelets/kube-apiserver/kube-schedulers/etcd pods are up.

For reasons unknown to me, I had some issues with haproxy hitting the correct hosts during the init phase of setting up kubernetes, so I systemctl stop'd haproxy and keepalived on all of the hosts except for the Xen host where the first control plane domU was running. That was probably overkill, but it worked.

Kubeadm Init (Build the first Kubernetes Node!)

Ok, if you've used kubeadm before, you know that something like this will bring up a single control plane kubelet:

Code:

sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16

We're getting fancy and doing HA, though, so my apologies for the overarching complexity here, you need to run that as:

Code:

sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --upload-certs \
  --control-plane-endpoint=kube-apiserver.$MYTLD

:mind-blown:, right?

That's essentially the core of setting up an HA kubernetes control plane (well, the first node, of course). If everything goes well (it should, because you're just bootstrapping the initial node), then it's time to join the remaining control plane nodes to your cluster. If stuff broke, check that you don't have swap enabled still, because kubelet hates swap. (swapoff -a, then comment the swap line out in /etc/fstab). Also ensure that your load balancer is passing traffic. That could cause a failure as well.

Join Others to the Cluster

This is where it truly gets fun. There are a gazillion things that can go wrong adding in additional control plane nodes, so check the logs, pay attention to systemctl status or journalctl output, and don't give up. Again, kubelet hates swap, so double-check that before you run the join command, and if you're running firewalls, double check the ports with this list (works for kubernetes 1.28, to my knowledge):

Port 6443 – Kubernetes API server.
Ports 2379-2380 – etcd server client API.
Port 10250 – Kubelet API.
Port 10259 – kube-scheduler.
Port 10257 – kube-controller manager.

At the bottom of the kubeadm init command, you'll get a join command for workers, and if you supplied a working control-plane-endpoint to kubeadm init, you'll get another join command specifically for additional control plane nodes. Run that control plane join command (as root) on your other control plane nodes, and barring any stupidity (such as what I'm about to cover next), you'll be in business.

We Interrupt This Blog Post For A Special Announcement

Now, not that I know anything about causing outages, but if you were for instance to perhaps do something dumb, like...I don't know...run xl create with a xen domU config file that you copy/pasta'd from a control plane node to create your first worker node, and you forgot to update the LVM mapping in the file (which still points to your control plane node's root disk), and without realizing it you launched an automated pxeboot install to said LVM mapping, you'd end up with a very unhappy control plane node.

Don't ask how I know that. I'm totally guessing here.

Point is, schtuff happens. Sometimes you gotta shoot a node in the head and bring up a new one for $REASONS. I'm going to tell you how to accomplish that, because I'm always prepared like that (right?).

Ideally, you just re-install the domU (I wouldn't trust the LVM image that had both an installer and a running OS writing to it...just wipe it and start over!), then run a join command for the new node. Sometimes, though, that's going to hang at checking etcd's health (etcd is the "distributed, reliable key-value store" that kubernetes uses as its backing store). I learned this the fun way, but essentially if you nuke a control plane node so it can't be removed cleanly, you still have references to it in etcd. If you then attempt to re-join the node with the same name, it will assume the etcd data is still present on the (newly nuked) node, and the join command will fail. When this happens, you need to update etcd to remove the node from the key-value store. The method I used for this is etcdctl, running on my second control plane host (the first one is the one I nuked), and I essentially performed a "member remove" on the first host.

Note to the wise, before you go manually editing etcd's data, check that your surviving kube-apiserver pods aren't "cross-talking" to etcd on other hosts. Not that it'd matter in a situation like this (etcd on $NUKED_NODE is already down, amirite?) but it's good to get in the habit now. In my configuration (mostly defaults), each kube-apiserver was only talking to the local etcd on the same host, so...control plane #1's kube-apiserver is dead, no worries there.

Ok, once you've removed the nuked host, you should be able to run the join command again and get everyone happy once more. \o/

If you ever need to generate a new join command for your control plane hosts, you can run the following:

Code:

echo $(kubeadm token create --print-join-command) --control-plane --certificate-key $(kubeadm init phase upload-certs --upload-certs | grep -vw -e certificate -e Namespace)

(Stolen unceremoniously during a panic'd reinstall of a control plane node last night from this stackoverflow answer.

Ok, we're in business! Now just run the worker join command on your worker nodes, copy the generated /etc/kubernetes/admin.conf file to a suitable place, and you should be able to see something like this now:

Code:

$ kubectl get nodes
NAME           STATUS   ROLES           AGE     VERSION
k8s-master-1   NotReady    control-plane   5h59m   v1.28.2
k8s-master-2   NotReady    control-plane   19h     v1.28.2
k8s-master-3   NotReady    control-plane   19h     v1.28.2
k8s-worker-1   NotReady    <none>          18h     v1.28.2
k8s-worker-2   NotReady    <none>          6h39m   v1.28.2
k8s-worker-3   NotReady    <none>          6h32m   v1.28.2
k8s-worker-4   NotReady    <none>          175m    v1.28.2

Luckily replacing a nuked worker node is no where near as complex of an operation as replacing a nuked control plane node, so reinstalling and re-running the join command there should be fairly straight forward.

A Side Note for Future Reference

If you need to take a node down for maintenance, it's always good to cordon/drain the host so the workloads it is carrying get rescheduled properly on other hosts, then shut the system down for maintenance or whatever (see the kubernetes docs for more info on that).

Next Steps

Also, if you note the above output, it says the nodes aren't ready. That's because we haven't installed a CNI (Container Network Interface) yet, so we can't actually *do* anything in the cluster yet. We also should get some CSI (Container Storage Interface) going for data persistence, and a host of other fun things that I'll cover in my next blog post.

Cheers!

On-prem kubernetes, Part 2.5

Comments