The design of Kubernetes has its origins at Google as a platform for microservices. It does support batch jobs but they do not have the same level of support as microservices. The only difference between a Pod and Job is that a Job does not restart automatically.
The Pod abstraction assumes that the CPU and memory usage of the processes running inside a Pod are predictable and fairly constant. When a Pod’s memory usage is larger than its allocation, the assumption is that there is a bug or memory leak and the process should be killed.
For Pods the default resource allocation is “Best Effort”. When no CPU or memory limits are defined the Pod is considered low priority and does not have any resource limits. During my initial explorations with K8s I created 10 Best Effort Jobs that compiled software. K8s started all the Pods on a single Node. The compile jobs quickly used up all available memory on the system and K8s started killing Jobs at random.
This behavior makes sense when the Pods are stateless and part of Deployments and con be easily moved to other nodes. Of course the compile jobs can just be restarted but it is wasteful to throw away the work in progress. In this case “Best Effort” doesn’t really seem appropriate for running batch jobs.
Using the Pod AnitAffinity with the nodename and/or labels, the K8s scheduler will spread the Pods out over a set of Nodes. This way if the nodes have enough resources “Best Effort” pods will have the best chance of not consuming too many resources on the nodes. But it isn’t a guarantee and it is difficult to predict if a Job/Pod will finish.
If “Best Effort” isn’t appropriate, then the next step is to give each Job a CPU and memory limit. The question then becomes what those limits should be? A Burstable Pod has lower limit and max limit. A Guaranteed Pod has a max limit. A Burstable Pod allows a form of resource over commitment. If the memory usage of the processes in the Pod ever tries to allocate more memory than the resource limit, the allocation will fail and K8s will kill the Pod.
The default Pod resources CPU and Memory have different behavior when the resource is exhausted. CPU over commitment is handled by the kernel scheduler and the scheduler will allocate CPU time based on the scheduler policy. A Pod that attempts to use more CPU time than allocated will be throttled. CPU is considered a compressible resource.
Memory is different because if there isn’t any memory available, attempted allocations will fail. This is considered a uncompressible resource. There isn’t a way to throttle process memory usage. The only option is swap memory which can allow allocations to proceed but it comes with problems like thrashing. With containers it gets even more complicated because by default the kernel does not account for swap memory in the memory resource usage.
So the only way to prevent a Pod from being killed when it uses too much memory is to set the memory allocation high enough. This is where the predictability of microservices makes figuring out the max memory limit easier.
But what about large and unpredictable Yocto builds? Yocto provides infinite configuration options and also supports two forms of build parallelization: jobs and parallel packages. These options speed up the build but make the the package build order non deterministic. SState can be used to speed up builds but it makes predicting memory usage even more difficult because it is impossible to know ahead of time which packages will be rebuilt.
What about using swap to add memory temporarily to a Pod that has used all of its allocation? K8s does not support swap and will require that swap be disabled. As far as I can tell the main issue is around how to account for the swap memory. Swap cannot just be added to memory of the machine because it is much slower. The current kube tools do not track swap usage and the kernel does not enable swap memory accounting by default for performance reasons. Swap can increase performance by removing unused memory pages to disk and giving applications more RAM. However a Pod using swap could thrash the entire node causing failures of Pods that are not using too many resources.
There is work underway to add swap alpha support to K8s 1.22 but it has many limitations and may be restricted to “Best Effort” Pods. Exactly how swap will be managed and accounted for are still open questions. At some point swap may be a part of the solution but it isn’t feasible now.
Without a way to make memory compressible there are only workarounds and tradeoffs.
It is the combination of the following:
that makes this a very difficult problem. The only “proper” solution would be to make the builds have more predictable memory usage. This would require a feedback mechanism to make/ninja/bitbake to adjust the number of running processes based on the current container memory usage.