Trying to find cause of poor memory management in recent kernels

ordealbyfire83 · 01-07-2020, 06:57 PM

It seems that recent kernel versions (sometime around > 4.x) do not handle virtual memory/cache as reliably as older versions. In my case, I have a particular ThinkPad with a solid state drive, on which I use dm_crypt for full disk encryption. After some time, this system hangs with inability to move the mouse, perhaps the last several seconds of audio playing endlessly, and the disk activity light on steady. The only way out is a hard reset. After which there are no errors in any of the logs. Sometimes there is file system corruption, sometimes there is not. I have lived with this for the last 2 years and it seemed this issue is inevitable.

I know of someone with the same exact model of Thinkpad, with all the exact same innards, using the exact same OS block-for-block (just dd'ed it over) on a traditional rotating hard drive with no encryption. That laptop can routinely hibernate/resume and such for months on end without issue. Mine, on the other hand, rarely lasts more than a week. I asked for help on this on another unrelated forum, which suggested buggy microcode might be the culprit, and was basically told that such is life with linux on the desktop nowadays. I should add that I've tried every I/O scheduler available; some bring on the bug sooner than others, but they all get there eventually.

The above "experiment" seems to suggest that encryption or use of SSD is the culprit. So I used a traditional hard drive for some time and hit the same issue. Most likely a dm_crypt problem. Or so it seems.

After reading up extensively on similar bug reports like this one ( https://bugzilla.kernel.org/show_bug.cgi?id=12309 ), which claims that the bug is fixed with a code fix though no code was actually attached, I decided to delve deeper into the kernel documentation for virtual memory.

The disk activity light puzzled me and the best I could figure is that there was extensive swapping even though this should not be the case. So...I set out to make sure the kernel doesn't swap no matter what. This particular desktop has 4 GB memory, some of which is shared with the GPU, so free -m tells me that 3685 is actually available for use. There is 2 GB of swap. So I set /proc/sys/vm/overcommit_memory to "2" (meaning to disallow overcommitting) and /proc/sys/vm/overcommit_ratio to "44." This *should* keep everything neatly in RAM. (I didn't disable swap because I want to hibernate.) I also set cache pressure to 150.

One thing I noticed is that my cooling fan hardly ever runs now. It ran nearly constantly before, and I thought that was normal for this laptop. Not sure the correlation.

It also means that applications cannot hog up all the memory, forcing arbitrary OOM actions. I would rather see a friendly message from the application saying that it cannot allocate memory rather than potentially having the kernel kill off the dm_crypt machinery (!) leaving my running OS in limbo. To test this I opened up several TIFF's in GIMP until I saw a message say something like fork() cannot allocate memory.

For a week I used this laptop heavily, processing large, high-resolution TIFF's in GIMP, playing full HD x264 videos deliberately larger than than the CPU can reasonably handle, and the like. All the while running "free -m" shows the cache growing and shrinking as it is designed to do. After a week, though, things started going awry.

Typically at this point I would have experienced the hang, but I think not overcommitting staved it off. But Gimp failed to open a small 640x480 jpg and I could not open a text editor at the point free -m gave me this:

user@hostname:~$ free -m
total used free shared buff/cache available
Mem: 3685 416 2114 667 1154 2168
Swap: 2050 3 2046

These numbers seem to indicate there is still A LOT of RAM available in proportion to the total amount. Likewise I could not evict these pesky 3MB from the swap:

user@hostname:~$ sudo swapoff -a
swapoff: /dev/mapper/swap: swapoff failed: Cannot allocate memory

user@hostname:~$ cat /proc/meminfo
MemTotal: 3773792 kB
MemFree: 2166992 kB
MemAvailable: 2223152 kB
Buffers: 46252 kB
Cached: 1064608 kB
SwapCached: 436 kB
Active: 580996 kB
Inactive: 904532 kB
Active(anon): 426100 kB
Inactive(anon): 628044 kB
Active(file): 154896 kB
Inactive(file): 276488 kB
Unevictable: 16 kB
Mlocked: 16 kB
SwapTotal: 2099224 kB
SwapFree: 2095160 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 374248 kB
Mapped: 98696 kB
Shmem: 679476 kB
Slab: 67712 kB
SReclaimable: 35380 kB
SUnreclaim: 32332 kB
KernelStack: 4688 kB
PageTables: 14728 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 3759692 kB
Committed_AS: 1982152 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
DirectMap4k: 467584 kB
DirectMap2M: 3459072 kB

Perhaps someone can spot something "off" in here? I'm not totally sure what all of this means.

All of the above is Kernel 4.14.68 64bits with 7 days uptime, hibernated/resumed 6 times, no caches dropped, I/O scheduler deadline, dirty_ratio 10, dirty_background_ratio 5.

After this, I exited my desktop environment, undid a couple init scripts (networkmanager/dbus/bluetooth), dropped caches, swapoff, swapon, drop caches again, and then brought everything back up. Then I could open GIMP as usual along with my text editor.

So to conclude had I not dropped the caches manually this computer would not have been able to accomplish much. But by barring overcommitting we may have stopped a pesky hang.

pan64 · 01-08-2020, 02:23 PM

hm. did you think about hardware related error or overheating?

ordealbyfire83 · 01-08-2020, 09:22 PM

Well, I've read that these Thinkpads are very aggressive with cooling. This is handled in the embedded controller which isn't even replaceable with Libreboot etc. Then again, I'm running the CPU pegged at the minimum 800 MHz, because this kernel version makes my frequency scaling applet go crazy. Temperatures are usually < 50 Celsius, so I'm assuming the hardware is ok.

pan64 · 01-09-2020, 01:01 AM

Ah, assumptions. Probably better to check RAM, disk and temperature (and fan). Gratuitous assumption may lead to a lot of excrescent work.

ordealbyfire83 · 01-30-2020, 07:53 PM

Just to follow up with this, I've been watching the temperatures over the last couple weeks and it has been consistently cool, the fan working properly, and so forth. I've now reached 31 days of uptime here, including 23 hibernate/resume cycles and more than a few suspend/resume cycles. I've not ever had anything close to that much on this laptop before.

The tentative conclusion is that using dm_crypt plus overcommitting somehow causes lockups. I doubt this issue will see any attention, because the steps to reproduce it are not concrete - just "live your life" until it shows up, basically.