Swap Fire on My 7.6GB VPS: A Nightmare That Started with a Kernel

#life #vps #linux #kernel

Why Did Swap Usage Spike on My 7.6GB VPS?

This morning, I received an alert from my server: swap usage had risen abnormally. For a VPS with 7.6 GB of RAM, this was an unexpected situation. Normally, I barely used my swap space. To understand the situation, I immediately connected to the server and checked memory with the htop command.

In the htop output, I saw that much more swap space was being used than I expected. This indicated that the running applications' memory needs had increased, or there was a memory leak somewhere. My first suspect was a recent kernel update I had performed.

Kernel Update and Swap Usage

A few days ago, I had updated the Linux kernel on my server to the latest stable version. These updates typically patch security vulnerabilities and improve performance. However, they can sometimes lead to unexpected side effects. Issues related to kernel modules, in particular, can have significant impacts on the system's overall memory management.

The increase in swap usage after this update couldn't be a coincidence. I immediately started examining the server's logs. I specifically looked at /var/log/syslog and journalctl outputs, trying to catch clues that the kernel was encountering errors during certain memory allocation operations.

ℹ️ The Importance of Swap Space

Swap space is an area that the operating system uses to temporarily store data on disk when physical RAM is full. As swap usage increases, disk I/O operations become more intensive, leading to a noticeable drop in the server's overall performance.

Finding the Root Cause: The Debugging Process

After reviewing the logs, I noticed errors specifically related to a certain kernel module. This module was responsible for processing network packets and had recently been patched to close a CVE (Common Vulnerabilities and Exposures) vulnerability. It appeared that the patch had introduced a regression in the module's memory management.

To understand this error more clearly, I also used the dmesg command. The dmesg output confirmed that the kernel was experiencing issues with memory allocation and deallocation operations. In some cases, memory allocated by the kernel was not being freed correctly, which gradually led to increased swap usage.

Which CVE and Which Module?

Through my research, I identified that the source of the problem was an issue in the algif_aead kernel module. This module provided hardware acceleration for certain encryption algorithms. A recently released security patch had closed this vulnerability, but the patch itself was causing a memory leak. This was a classic example of a "fix" turning into a "break."

At this point, I realized that this problem was not limited to my VPS alone and could occur on similar systems. Kernel-level issues like these can pose significant risks, especially for servers hosting high-traffic or sensitive applications.

⚠️ Careful Application of CVE Patches

While security patches are critically important, updates, especially at the kernel level, must be applied with caution. Testing patches in production environments and observing potential side effects before applying them can prevent serious issues.

Temporary Solution: Managing Swap Usage

Since the root cause was at the kernel level, finding a quick permanent solution was difficult. I had to wait for the kernel developers to release a new patch. In the meantime, I needed to implement temporary solutions to ensure the server's stability.

First, I reduced the swappiness value to make the system manage swap usage more aggressively. swappiness determines how inclined the kernel is to use swap space instead of RAM. Lowering the value encourages the kernel to use RAM for longer.

Additionally, I tried to reduce the overall memory pressure by adjusting the runtime of some memory-intensive applications or using less memory-consuming alternatives. Although this was a temporary measure, it helped bring swap usage under control.

Swap Management with `sysctl` Settings

To adjust the swappiness value, I used the sysctl command. To make it permanent, I also added the necessary settings to the /etc/sysctl.conf file.

# Check the current swappiness value

<figure>
  <Image src={cover} alt="A graph showing swap usage on a Linux terminal." />
</figure>

cat /proc/sys/vm/swappiness

# Lower swappiness to 10 (default is usually 60)
sudo sysctl vm.swappiness=10

# To make it permanent, add to /etc/sysctl.conf
# vm.swappiness=10

These settings reduced the server's tendency to use swap space. However, this did not solve the problem fundamentally; it only alleviated the symptoms. The real solution would come with a new kernel patch or the correction of the existing patch's error.

💡 What Should the Swappiness Value Be?

As a general rule, the swappiness value can be set between 10 and 30. This value should be adjusted based on the server's intended use and its memory usage profile. Lower values may be preferred for high-performance applications.

Permanent Solution: New Kernel Patch and Aftermath

A few days later, the kernel developers released a new patch that fixed the memory leak in the algif_aead module. I immediately applied this patch to my server and restarted my system.

With the new kernel version, swap usage returned to normal. Seeing swap usage drop to almost zero in the htop output was a relief. This experience once again demonstrated how critical kernel updates are and how they can sometimes lead to unexpected problems.

Lessons Learned and Future Steps

One of the most important lessons I learned from this incident is the need to manage kernel updates carefully in production environments. After applying a patch, it's important to monitor the server closely for a period and observe any potential side effects.

Furthermore, I realized how crucial it is to have an automated monitoring system in place for my servers. Receiving automatic alerts when critical metrics like swap usage change suddenly helps me detect problems early.

In the future, I plan to test kernel updates in a staging environment before deploying them to production. This will reduce potential risks and prevent downtime on my production servers. This "swap fire" incident on my small VPS served as a reminder of the continuous learning and adaptation required in system administration.

DEV Community

Swap Fire on My 7.6GB VPS: A Nightmare That Started with a Kernel