Achieving High Availability #805

adampetrovic · 2023-06-23T07:04:04Z

adampetrovic
Jun 23, 2023

Hi there. I am new to Kubernetes and your repo has been immensely helpful, thank you!

As part of my learning I decided to start migrating my homelab cluster to kube and am sorta building the plane while its flying.

Everything has been reasonably straight forward so far, however I am having trouble achieving a HA setup. I appreciate none of this is specific to this particular repo, but given the components are familiar with those using this template, maybe people have ideas or could learn from the outcome of this discussion.

My setup is as follows: (repo: https://github.com/adampetrovic/k8s-home)

4 physical boxes running Proxmox. Each k8s node is a Ubuntu 22.04 VM
Across those 4 boxes: 7 nodes (3 master & 4 worker). Each master is colocated with a worker on the same physical node
External services via cloudflare tunnels
Internal services have a CNAME pointing to the ingress (which is an internal IP on my network, not METALLB_INGRESS IP)

In my pursuit of learning I have tried adding a bit of chaos engineering into my testing by shutting a master node and its colocated worker node to see what happens. Whenever I kill a master node though, things break in strange ways. Presumably because I'm not running everything in an optimal fashion.

Firstly, the node enters NotReady state and takes about a minute to reach that state. Following that, my understanding is that by default, pods won't be rescheduled until 300s passes.

Looking at the typical service that relies on postgres and redis, the pods enter a 'Terminating' state forever and never get rescheduled onto another healthy node:

NAME                              READY   STATUS        RESTARTS      AGE     IP            NODE         NOMINATED NODE   READINESS GATES
cloudnative-pg-77bfc6b77f-mxwd2   1/1     Running       1 (12m ago)   7h12m   10.42.6.137   k8s-node-7   <none>           <none>
pgadmin-b4c9cc7c9-t6sb6           1/1     Running       0             5m10s   10.42.6.51    k8s-node-7   <none>           <none>
postgres-1                        1/1     Running       0             23h     10.42.6.132   k8s-node-7   <none>           <none>
postgres-2                        1/1     Terminating   0             23h     10.42.1.11    k8s-node-2   <none>           <none>
postgres-3                        1/1     Running       0             23h     10.42.3.194   k8s-node-4   <none>           <none>
redis-node-0                      3/3     Terminating   0             23h     10.42.0.66    k8s-node-1   <none>           <none>
redis-node-1                      3/3     Terminating   0             3d20h   10.42.1.170   k8s-node-2   <none>           <none>
redis-node-2                      3/3     Running       1 (10m ago)   23h     10.42.6.170   k8s-node-7   <none>           <none>

My immediate thought is to label my nodes to indicate which physical node they reside on, then ensure that pods don't 'double up' on the same physical node. Though the questions I have are:

Why do pods stay in 'Terminating' state forever and not get recreated on another node?

Judging from this stackoverflow answer, this is by design. Pods won't get rescheduled until the node is forcefully deleted. My follow question to this is, is there any way around this? In a truly HA setup, user intervention shouldn't be required in order for the system to continue working.

Why can't my apps continue speaking to postgres and redis when only postgres-2 is affected (leaving 2 other nodes) and redis-node-2 (presumably a read replica) still should work?

Thanks in advance for any advice.

Answered by onedr0p

Jun 23, 2023

Looking at the typical service that relies on postgres and redis, the pods enter a 'Terminating' state forever and never get rescheduled onto another healthy node

This is very dependent on the application and what underlaying storage is being used. Try this with a stateless service (like echo-server) and it should behave as you expect.

Why do pods stay in 'Terminating' state forever and not get recreated on another node?

That is a feature of statefulset, they will live on the node they were created on unless there is human interaction to say otherwise or the node becomes healthy again. The only other alternative is to use a deployment instead but even that will leave pods stuck in ter…

View full answer

onedr0p · 2023-06-23T12:14:14Z

onedr0p
Jun 23, 2023
Maintainer

Looking at the typical service that relies on postgres and redis, the pods enter a 'Terminating' state forever and never get rescheduled onto another healthy node

This is very dependent on the application and what underlaying storage is being used. Try this with a stateless service (like echo-server) and it should behave as you expect.

Why do pods stay in 'Terminating' state forever and not get recreated on another node?

That is a feature of statefulset, they will live on the node they were created on unless there is human interaction to say otherwise or the node becomes healthy again. The only other alternative is to use a deployment instead but even that will leave pods stuck in terminating due to using local-path. Even using rook-ceph or longhorn storage classes a deployment will still be stuck in terminating due to nuances with those CSI because they not automatically give up the volume attachment.

Why can't my apps continue speaking to postgres and redis when only postgres-2 is affected (leaving 2 other nodes) and redis-node-2 (presumably a read replica) still should work?

I don't really have a great answer here because it is pretty much up to the applications to figure out what to do, not much on kubernetes.

Basically stateful workloads in Kubernetes are still pretty painful but you will learn to live with their nuances.

1 reply

adampetrovic Jun 27, 2023
Author

Thanks for the response. I've spent some time digging deeper. Essentially it comes down to getting into the weeds of all the underlying primatives.

My takeaways:

Accurate liveness probes, without this, Kube really doesn't know what it should be doing with a prod and when.
On the above, Longhorn works fine when the pod attached to a PVC can be released as RWO volumes are then able to be reattached to a newly created pod spun up in its place. This was ultimately the problem with cloudnative-pg, the replicas weren't able to be promoted to master because they couldn't attach to their volume.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieving High Availability #805

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Achieving High Availability #805

adampetrovic Jun 23, 2023

Replies: 1 comment · 1 reply

onedr0p Jun 23, 2023 Maintainer

adampetrovic Jun 27, 2023 Author

adampetrovic
Jun 23, 2023

Replies: 1 comment 1 reply

onedr0p
Jun 23, 2023
Maintainer

adampetrovic Jun 27, 2023
Author