Kubernetes Upgrade via kOps

There are a few different ways to install/configure/upgrade and maintain a kubernetes cluster. This post reviews how to upgrade a kubernetes installation that is installed via kOps, and ismaintained via kOps. It will also go through troubleshooting the deployment to take out some of “mystery” of how kOps performs some of the functions.

https://kops.sigs.k8s.io/

https://github.com/kubernetes/kops

Prepare for Upgrade

Before upgrading, the cluster must be reviewed and validated to ensure that the newer version of kubernetes will not break the current configuration. I recommend using:

Pluto – https://github.com/FairwindsOps/pluto
kops release notes – https://kops.sigs.k8s.io/releases/1.30-notes/
kubernetes release notes – https://kubernetes.io/releases/

Update Kops config

Before kops can be upgraded, the kops configuration file needs to be reviewed and updated. Version updates often include breaking changes. These items need to reviewed and updated via the kops config file, to insure that the upgrade works correctly.

kops release notes – https://kops.sigs.k8s.io/releases/1.30-notes/

Upgrade the kops version

Upgrading the kops controller running on the cluster, is entirely dependent on the version of the kops binary used to run the kops cluster update commands. This can cause a lot of problems if kops cluster upgrades aren’t managed effectively. If two different engineers run upgrades using different versions of the binary, the version of the kops-controller can go back and forth, potentially causing a lot of issues.
One potential solution is to run all cluster upgrades through a Continuous Deployment process. This forces all changes to be reviewed, and run via one system that is standardized.

Updating kops is fairly straight forward.

Run kops upgrade on the cluster, using a newer version of the kops binary than is currently running.
Once the upgrade is complete, the control-plane nodes need to rolled.

Definition: A roll is when each node is restarted one at a time in an orderly fashion. Each node is validated before the next node is restarted.

Apply Kubernetes Version Changes to the Cluster

The following line needs to be updated in the kops config file:

kubernetes_version: <New Kubernetes Version>

Once updated, the kops config need to be pushed to the cluster state file, then the cluster must be updated.

# Update Cluster State
kops replace -f kops_config.yaml --state s3://<cluster-state-store>

# Update the Cluster
kops update cluster --name <cluster-name> --yes --state s3://<cluster-state-store>

Roll the Control Plane Nodes

One the cluster has been updated, all of the nodes on the cluster must be restarted. On restart the nodes will be bootstrapped using the updated Kubernetes binaries. First the control-plane nodes need to be restarted.

# This will list all of the instance groups, and whether or not they need to rolled
kops rolling-update cluster --name <cluster-name> --state s3://<cluster-state-store> 

# This will list only the defined instance groups, and whether or not they need to rolled
# The instance groups are defined in your cluster configurations.  In this case all of the control-plane nodes are in three separate instance groups. To isolate them from the other nodes.
kops rolling-update cluster --name <cluster-name> --state s3://<cluster-state-store> --instance-groups <control-plane-1>,<control-plane-2>


# The --yes flag will perform the restart of the specific defined instance group.
kops rolling-update cluster --name <cluster-name> --state s3://<cluster-state-store> --instance-groups <control-plane-1>,<control-plane-2> --yes

Roll the rest of the Nodes

# This will list all of the instance groups, and whether or not they need to rolled
kops rolling-update cluster --name <cluster-name> --state s3://<cluster-state-store> 

# This will restart all of the nodes that need to be restarted
kops rolling-update cluster --name <cluster-name> --state s3://<cluster-state-store> --yes

Troubleshooting Rolling Nodes

In the background the

kops rolling-update cluster –name <cluster-name> –state s3://<cluster-state-store> –yes

command follows this general approach:

Choose a node that is in the Needs Update state
Drain the node from the cluster
- This can potentially cause a lot of problems. Namely, if a pod is evicted, there is the potential of an application to go down if it is not configured correctly. Some workloads can be offline for a few seconds/minutes, so this is not an issue. However, if this is an issue, pod disruption budgets (PDB) need to be created.
  - https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Terminate the node
Allow the kops controller to notice the node ismissing, and to create a new one.
The new node will be bootstrapped with the updated kubernetes binaries
The new node is validated then added to the cluster.
Then the rolling-restart will move to the next node

Notes:

kops rolling-update has many other flags to allow it to behave differently than the explanation above.

If a pod can’t be evicted, the upgrade will get stuck at that point. Eventually the node will be skipped and the next node will be rolled.

If a node doesn’t start up correctly, the process will hang at that point. The validation will never succeed if the the node doesn’t start up. This will cause the command to timeout without rolling all the nodes.

Rolling update can be cancelled at anytime. If you cancel after a node has been terminated, the kops-controller will automatically create a new one.

If pods can’t be terminated, the deleteion may need to be forced via:

kubectl delete pod -n <namespace> --force --grace-period=0

Be careful with this. It will vaporize the pod, and could potentially bring down a running application

If pods get stuck in the Pending state too long, it may mean that the upgrades have caused issues with a pods taints/tolerations, and that pods are not starting up correctly.

This could cause many nodes to restart, but the applications do not run effectively, because too many pods are down. Again, setting appropriate PDB’s for critical applications is very important to prevent outages when restarting nodes.

kubectl get pods -A | grep Pending

Pods should not stay in this state for more than a few minutes.