Architecting Workflows For Reliability

“Kubernetes Will Kill Your Pod”
Kubernetes is designed for stateless scalable web applications, apps where if one process dies, then another process can be dropped in its place. Kubernetes makes one promise — it will kill your pods.
Dear user,
I will kill your pod:
* If I want the node for something more important.
* If I’m draining the node, or scaling down a cluster.
* If it runs out of memory (because you got the config wrong).
* If I overcommitted.
* Hardware failure (computer catches fire).
* Kernel panic.
* Absolutely any reason I feel like.I’m sorry — I am who I am.
All the best,
Kubernetes xx
Kubernetes expects applications built on it to be tolerant of both any disruption— so apps must be designed with that in mind.
Let’s think about how a deployment deals with this. If a pod is deleted, the deployment will simply create another pod.
What About A Workflow?
By default, if a workflow pod is deleted, the task is marked as failed, and the workflow fails. This can be a big problem, e.g.:
- The task was running for a very long time, e.g. a Spark job.
- The task could be costly, e.g. using a lot of expensive GPU.
You can retry the workflow, but there’s no guarantee another disruption wont’ happen again.
Retry Strategy
Your first defence against pod deletion, is using retry strategy. The most important detail here is that your task needs to be “retryable”, i.e. if it is interrupted, it can be run again just fine. Often this means it is idempotent, i.e. if you run it again, the result is exactly the same.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-container-
spec:
entrypoint: retry-container
templates:
- name: retry-container
retryStrategy:
limit: "10"
container:
image: python:alpine3.6
command: ["python", -c]
# fail with a 66% probability
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
Many teams use activeDeadlineSeconds
to ensure that tasks fail quickly if something is wrong. So think about howretryStrategy
, activeDeadlineSeconds,
and backoff interact.
Adding retry logic to the code called in a task is often faster than waiting for Argo to create a new pod. If you have a Golang task you can use https://pkg.go.dev/k8s.io/client-go/util/retry#OnError for example.
This is even more effective if your tasks are short running, less than 10m. If you break large tasks into small tasks, the impact of the disruption of pod deletion is smaller.
Another best practice is work avoidance, i.e. writing your tasks in such a way, that if they are retried, they can finish faster.
If you use a resource template, you can actually use retry with it, so if the underlying pod that creates the resource is deleted, then when is retried, it just tries to create the same resource:
- Make sure your resource has a deterministic name, e.g.
my-resource-{{workflow.name}}
. - Use the “apply” action instead of the “create” action.
Learn about resource templates
Reducing The Chance Of Pod Deletion
The first tool you might try is a pod disruption budget. This allows you to tolerate voluntary disruptions such as a cluster drain.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: default-pdb-support-
spec:
entrypoint: pdbcreate
serviceAccountName: default
podDisruptionBudget:
minAvailable: "9999" # Provide arbitrary big number if you don't know how many pods the workflow creates
templates:
- name: pdbcreate
container:
image: alpine:latest
command: [sh, -c]
args: ["sleep 10"]
But remember, this cannot prevent involuntary disruptions, such as hardware failure.
Learn more about pod disruption budgets
You can also try cost optimising your workflows to use less resources. This reduces the chance of pods being deleted due to being out of resource. Also, your workflow will cost less.
Learn about workflow cost optimisation
We’ll soon have a blog post in collaboration with Kubecost that will go into more depth on this.
Summary
A workflow is like a small Kubernetes app, and by keeping this in mind we can design our workflows to be robust to the kinds of disruption than any Kubernetes app is subject to. By using retry strategies, work avoidance, pod disruption budgets, we can write robust workflows that only fail if the task themselves fail.