Deep PostgreSQL Thoughts: The Linux Assassin

Joe Conway
PostgreSQL

If you run Linux in production for any significant amount of time, you have likely run into the "Linux Assassin" that is, the OOM (out-of-memory) killer. When Linux detects that the system is using too much memory, it will identify processes for termination and, well, assassinate them. The OOM killer has a noble role in ensuring a system does not run out of memory, but this can lead to unintended consequences.

For years the PostgreSQL community has made recommendations on how to set up Linux systems to keep the Linux Assassin away from PostgreSQL processes, which I will describe below. These recommendations carried forward from bare metal machines to virtual machines, but what about containers and Kubernetes?

Below is an explanation of experiments and observations I've made on how the Linux Assassin works in conjunction with containers and Kubernetes, and methods to keep it away from PostgreSQL clusters in your environment.

Community Guidance

The first PostgreSQL community mailing list thread on the topic is circa 2003, and the first commit is right about the same time. The exact method suggested to skirt the Linux OOM Killer has changed slightly since that time, but it was, and currently still is, to avoid memory overcommit  i.e. in recent years by setting vm.overcommit_memory=2.

Avoidance of memory overcommit means that when a PostgreSQL backend process requests memory and the request cannot be met, the kernel returns an error which PostgreSQL handles appropriately. Therefore, although the offending client then receives an error from PostgreSQL, importantly the client connection is not killed, nor are any other PostgreSQL child processes (see below).

In addition, or when that is not possible, the guidance specifies changing oom_score_adj=-1000 for the parent "postmaster" process via the privileged startup mechanism (e.g. service script or systemd unit file), and making oom_score_adj=0 for all child processes via two environment variables that are read during child process startup. This ensures that should the OOM killer need to reap one or more processes, the postmaster will be protected, and the most likely candidate to get killed will be a client backend. That way the damage can be minimized.

Host level OOM killer mechanics

It is worth a small detour to cover the OOM Killer in a bit more detail in order to understand what oom_score_adj does. However the true details are complex, with a long sordid history (certainly not all inclusive, but for a nice summary of articles on the OOM killer see LWN), so this description is still very superficial.

At the host OS level, when the system becomes too short of memory, the OOM killer kicks in. In a nutshell, it will determine which process has the highest value for oom_score, and kill it with a SIGKILL signal. The value of oom_score for a process is essentially "percentage of host memory consumed by this process" times 10 (let's call that "memory score"), plus oom_score_adj.

The value of oom_score_adj may be set to any value in the range -1000 to +1000, inclusive. As mentioned above, note that oom_score_adj=-1000 is a magic value in that the OOM killer will never reap a process with this setting.

Combining these two bits of kernel trivia result in the value of oom_score ranging from 0 to 2000. For example a process with oom_score_adj=-998 that uses 100% of host memory (i.e. a "score" of 1000) has an oom_score equal to 2 (1000 + -998), and a process with oom_score_adj=500 that uses 50% of host memory (i.e. a "memory score" of 500) has an oom_score equal to 1000 (500 + 500). Obviously this means that a process consuming a large portion of system memory with a high oom_score_adj is at or near the top of the list for the OOM killer.

CGroup Level OOM Killer Mechanics

The OOM killer works pretty much the same at the CGroup level, except a couple small but important differences.

First of all, the OOM killer is triggered when the sum of memory consumed by the cgroup processes exceeds the assigned cgroup memory limit. While running a shell in a container, the former can be read from /sys/fs/cgroup/memory/memory.usage_in_bytes and the latter from /sys/fs/cgroup/memory/memory.limit_in_bytes.

Secondly, only processes within the offending cgroup are targeted. But the cgroup process with the highest oom_score is still the first one to go.

Why OOM killer avoidance is important for PostgreSQL

Some of the reasons for this emphasis on OOM Killer avoidance are:

  • Lost committed transactions: if the postmaster (or in HA setups the controlling Patroni processes) are killed, and replication is asynchronous (which is usually the case), transactions that have been committed on the primary database may be lost entirely when the database cluster fails over to a replica.
  • Lost active connections: if a client backend process is killed, the postmaster assumes shared memory may have been corrupted, and as a result it kills all active database connections and goes into crash recovery (rolls forward through transaction logs since the last checkpoint).
  • Lost inflight transactions: when client backend processes are killed, transactions which have been started but not committed will be lost entirely. At that point the client application is the only source for the inflight data.
  • Down time: A PostgreSQL cluster has only a single writable primary node. If it goes down, at least some application down time is incurred.
  • Reset statistics: the crash recovery process causes collected statistics to be reset (i.e. zeroed out). This affects maintenance operations such as autovacuum and autoanalyze, which in turn will cause performance degradation, or in severe cases outages (e.g. due to out of disk space). It also affects the integrity of monitoring data collected on PostgreSQL, potentially causing lost alerts.

Undoubtedly there are others neglected here.

Issues related to Kubernetes

There are several problems related to the OOM killer when PostgreSQL is run under Kubernetes which are noteworthy:

Overcommit

Kubernetes actively sets vm.overcommit_memory=1. This leads to promiscuous overcommit behavior and is in direct contrast with PostgreSQL best practice. It greatly increases the probability that OOM Killer reaping will be necessary.

cgroup OOM behavior

Even worse, an OOM kill can happen even when the host node does not have any memory pressure. When the memory usage of a cgroup (pod) exceeds its memory limit, the OOM killer will reap one or more processes in the cgroup.

OOM Score adjust

oom_score_adj values are almost completely out of control of the PostgreSQL pods, preventing any attempt at following the long established best practices described above. I have created an issue on the Kubernetes github for this, but unfortunately it has not gotten much traction.

Swap

Kubernetes defaults to enforcing swap disabled. This is directly in opposition of the recommendation of Linux kernel developers. For example, see Chris Down's excellent blog on why swap should not be disabled. In particular I have observed dysfunctional behaviors in memory constrained cgroups when switching from I/O dominant workloads to anonymous memory intensive ones. Evidence of other folks who have run into this issue can be seen in this article discussing the need for swap:

"There is also a known issue with memory cgroups, buffer cache and the OOM killer. If you don’t use cgroups and you’re short on memory, the kernel is able to start flushing dirty and clean cache, reclaim some of that memory and give it to whoever needs it. In the case of cgroups, for some reason, there is no such reclaim logic for the clean cache, and the kernel prefers to trigger the OOM killer, who then gets rid of some useful process."

There is also an issue on the Kubernetes github for this problem, which is still being debated three + years later.

Kubernetes QoS and Side Effects

Kubernetes defines 3 Quality of Service (QoS) levels. They impact more than just OOM killer behavior, but for the purposes of this paper only the OOM killer behavior will be addressed. The levels are:

  • Guaranteed: the memory limit and request are both set and equal for all containers in the pod.
  • Burstable: no memory limit, but with a memory request for all containers in the pod.
  • Best Effort: everything else.

With a Guaranteed QoS pod the values for oom_score_adj are almost as desired; PostgreSQL might not be targeted in a host memory pressure scenario. But the cgroup "kill if memory limit exceeded" behavior is undesirable. Relevant characteristics are as follows:

  • oom_score_adj=-998: this is good, but not the recommended -1000 (OOM killer disabled).
  • The documented environment variables are able to successfully reset oom_score_adj=0 for the postmaster children which is also good.

With a Burstable QoS pod, oom_score_adj values are set very high, and with surprising semantics (smaller requested memory leads to higher oom_score_adj). This makes PostgreSQL a prime target if/when the host node is under memory pressure. If the host node had vm.overcommit_memory=2, this situation would be tolerable because OOM kills would be unlikely if not impossible. However, as noted above, Kubernetes recommends/sets vm.overcommit_memory=1. Relevant characteristics are as follows:

  • The cgroup memory constraint OOM killer behavior does not apply -- this is good
  • oom_score_adj=(1000 - 10 * (percent avail mem requested)) (this is a slight simplification -- there is also an enforced minimum value of 2, and maximum value of 999): this leads to very small pod getting higher score adjust value than very large one. E.g. a pod requesting 1% available memory will get oom_score_adj=990 while one requesting 50% available memory will get oom_score_adj=500. This in turn means that if the smaller pod is idle, using essentially no resources it might, for example have oom_score=(0.1*10)+990=991 while the larger pod might be using 40% of system memory and get oom_score=(40*10)+500=900.

Desired behavior

  • The ideal solution would be if the kernel would provide a mechanism to allow equivalent behavior to vm.overcommit_memory=2, except acting at the cgroup level. In other words, allow a process making excess memory request within a cgroup to receive an "out of memory" error instead of using the OOM Killer to enforce the constraint. This would be the ideal solution because most users seem to want Guaranteed QoS pods, but currently the memory limit enforcement via OOM killer is a problem.
  • Another desired change is for Kubernetes to provide a mechanism to allow certain pods (with suitable RBAC controls on which ones) to override the oom_score_adj values which are currently set based on QoS heuristics. This would allow PostgreSQL pods to actively set oom_score_adj to recommended values. Hence the PostgreSQL postmaster process could have the recommended oom_score_adj=-1000, the PostgreSQL child processes could be set to oom_score_adj=0, and Burstable QoS pods would be a more reasonable alternative.
  • Finally, running Kubernetes with swap enabled should not be such a no-no. It took some digging, and I have not personally tested it, but a workaround is mentioned in the very long GitHub issue discussed earlier.

Impact and mitigation

In typical production scenarios the OOM killer semantics described above may never be an issue. Essentially, if your pods are sized well, hopefully based on testing and experience, and you do not allow execution of arbitrary SQL, the OOM killer will probably never strike.

On development systems, OOM killer action might be more likely to occur, but probably not so often as to be a real problem.

However, if the OOM killer has caused distress or consternation in your environment, here are some suggested workarounds.

Option 1:

  • Ensure your pod is Guaranteed QoS (memory limit and memory request sizes set the same).
  • Monitor cgroup memory usage and alert on a fairly conservative threshold, e.g.
    50% of the memory limit setting.
  • Monitor and alert on OOM Killer events.
  • Adjust memory limit/request for the actual maximum memory use based on
    production experience.

Option 2:

  • Ensure your pod is Burstable QoS (with a memory request, but without a memory limit).
  • Monitor Kubernetes host memory usage and alert on a fairly conservative
    threshold, e.g. 50% of physical memory.
  • Monitor and alert on OOM Killer events.
  • Adjust Kubernetes host settings to ensure OOM killer is never invoked.

Option 3:

  • Accept the fact that some OOM Killer events will occur. Monitoring history
    will inform the statistical likelihood and expected frequency of occurrence.
  • Ensure your application is prepared to retry transactions for lost connections.
  • Run a High Availability cluster.
  • Depending on actual workload and usage patterns, the OOM killer event.
    frequency may be equal or nearly equal to zero.

Future work

Crunchy Data is actively working with the PostgreSQL, Kubernetes, and Linux Kernel communities to improve the OOM killer behavior. Some possible longer term solutions include:

  • Linux kernel: cgroup level overcommit_memory control
  • Kubernetes: oom_score_adj override control, swap enablement normalized
  • Crunchy: Explore possible benefits from using cgroup v2 under kube 1.19+

Summary

The dreaded Linux Assassin has been around for many years and shows no signs of retiring soon. But you can avoid being targeted through careful planning, configuration, monitoring, and alerting. The world of containers and Kubernetes brings new challenges, but the requirements for diligent system administration remain very much the same.

Newsletter