[SOLVED] How can I write a program to trigger memory direct reclaim ?

plznobug · 10-10-2023, 03:52 AM

First of all, I would like to kindly ask for your assistance in teaching me how to write a program that can trigger direct memory reclamation.

I wrote a program that utilizes mmap to allocate memory as below.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define PAGE_SZ 4096
#define GB (1<<30)
int random_number(int min, int max) {
return min + rand() % (max - min + 1);
}

int main(int argc, char** argv) {
size_t size;
size = 5ul * GB;
printf("size %lu\n",size);

void *mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}

size_t num_pages = size / PAGE_SZ;

size_t i;
for (i = 0; i < num_pages; i++) {
*((char *)mem + i * PAGE_SZ) = '6';
}

srand(time(NULL));
while (1) {
size_t random_page = random_number(0, num_pages - 1);
char first_byte = *((char *)mem + random_page * PAGE_SZ);
}

return 0;
}

However, I noticed that only the available field decreased, while the free field did not show a significant decrease. This made it difficult to trigger direct memory reclamation.

So I am wondering how to make both the free and available fields decrease simultaneously?

I would greatly appreciate any advice or suggestions on how to write testing programs that can effectively trigger direct memory reclamation.

The reason why I want to conduct this test is as follows.

Recently, my server experienced several instances of hang, which lasted from several seconds to even ten minutes.

We have launched a program to monitor the call trace of processes in the 'D' state and the memory status.

Let me explain what I found during the hanging period first.

1) There were many processes in the 'D' state with a similar call trace: page_fault->...->filmap_fault()->__lock_page_or_retry()->io_schedule().
No processes were hung in __alloc_page_slowpath.
2) The system triggered a direct memory reclaim, as observed through the logging that monitors the change of pgsteal_direct in /proc/vmstat every 2 seconds. The changes can range from tens to even ten million.
3) The available field and free field of 'free -h' were 15GB and 2GB respectively, before and after the direct reclamation occurred.

And here is some information about the server.
The server has 128GB of memory.
The value of /proc/sys/vm/min_free_kbytes is set to 2GB.
The swap file is disabled.

Based on this information, I believe:

1) Some processes triggered direct reclamation but did not hang.
2) The direct reclamation evicted some pages that belonged to other processes. (As far as I know, direct reclamation only evicts clean pages when it occurs.)
3) A group of processes triggered page faults, causing a high system load and resulting in them getting stuck.

To validate my hypothesis, I need to reproduce direct reclamation.

sag47 · 10-10-2023, 06:57 AM

Can you share more details about the hung program? Is it running Java?

Memory costs the same empty as it does 100% full (power consumption wise) and the Linux kernel treats empty memory as wasted memory. So it will do other things with it like use it as a cache. free will show all usage including memory treated as cache. This is normal and good. Aiming for no memory usage or high values of empty memory leaves resources unused and is not ideal for a server (otherwise you can just give your program fewer resources).

You may be chasing the wrong path to solve your issue so, if you could, share some details about the original issue.

dugan · 10-10-2023, 08:09 AM

Are you sure the hang wasn't from swapping?

plznobug · 10-10-2023, 08:15 AM

Quote:

Originally Posted by dugan

Are you sure the hang wasn't from swapping?

The swap file is disable in my server.

dugan · 10-10-2023, 08:21 AM

https://www.kernel.org/doc/html/next...s.html#reclaim

A direct memory reclamation is a hang. You wrote a program that was designed to use up enough RAM to make that happen.

So let me get this straight. Your server hung and now you are trying to... make it happen again?

What are you hoping to accomplish by "testing" this?

pan64 · 10-10-2023, 08:25 AM

in that case you must enable it again. That will solve your issue. see here: https://chrisdown.name/2018/01/02/in...e-of-swap.html
additionally: http://www.linuxatemyram.com/ (but it looks unreachable at this moment)

dugan · 10-10-2023, 08:57 AM

Quote:

Originally Posted by plznobug

During the hung period, I noticed that the available field and free field of 'free -h' were 15GB and 2GB respectively.

He mentions elsewhere that the server has 128GB RAM...

https://unix.stackexchange.com/quest...1445540_758557

plznobug · 10-10-2023, 10:03 PM

Quote:

Originally Posted by dugan

He mentions elsewhere that the server has 128GB RAM...

https://unix.stackexchange.com/quest...1445540_758557

Thank you very much for your attention to my question. I have revised the original question and added more information.

pan64 · 10-11-2023, 01:09 AM

Quote:

Originally Posted by plznobug

Thank you very much for your attention to my question. I have revised the original question and added more information.

first read the link I posted: In defence of swap: common misconceptions. That will explain how does it really work (and why should you enable swap anyway). Also it will explain why does your system hang.
As far as I see your problem is not the reclaiming itself, but there is nothing to reclaim.
You must not implement your own memory management to fix this issue, but give more space to the kernel to do the job, that's all.
probably: https://stackoverflow.com/questions/...claim-in-linux

plznobug · 10-11-2023, 08:09 AM

Quote:

Originally Posted by pan64

first read the link I posted: In defence of swap: common misconceptions. That will explain how does it really work (and why should you enable swap anyway). Also it will explain why does your system hang.
As far as I see your problem is not the reclaiming itself, but there is nothing to reclaim.
You must not implement your own memory management to fix this issue, but give more space to the kernel to do the job, that's all.
probably: https://stackoverflow.com/questions/...claim-in-linux

Thank you very much for your response. I will carefully read these articles. Perhaps enabling swap can solve my problem. However, I am more interested in understanding the reason for the system hang, as I believe it will allow me to learn more about the kernel.

“As far as I see your problem is not the reclaiming itself, but there is nothing to reclaim.“
I apologize, but I don't quite understand this statement. Why do you say there is nothing to reclaim? Isn't there still around 10GB of available memory?

Furthermore, I have observed a phenomenon on the server that I cannot comprehend.

Currently, I am killing some processes based on the free field. For example, when the free field approaches 2GB, I kill the process with the highest vmRSS. And, new processes will be scheduled to this server soon.

I have noticed that after killing a process, both the free and available fields increase simultaneously. But after the new process starts, the free field decreases rapidly while the available field remains mostly unchanged.

For example, when the kill occurs, the "free -h" command shows that the system has 3GB free and 40GB available. After killing a task that consumes more than 6GB, the free field becomes 10GB and the available field becomes 48GB. But when a new task is scheduled to this server, the free field decreases to 3GB at a rate of 1GB/s and the available field initially decreases by 1GB but then remains at 47GB.
At the same time, by monitoring /proc/vmstat, I observed that nr_file_pages increased from 10991321 to 11079287 after killing a task, keeping increasing to 12488812 after the new task started.

Based on my understanding, alloc_page should first consume the available field before consuming the free field. Therefore, I cannot comprehend this phenomenon.

I really want to know why this is happening, and I can't think of any program that can reproduce this phenomenon.

Oh, I should also mention that it is HDD in the server.

pan64 · 10-11-2023, 08:55 AM

Yes, I mean there is no available (or enough) memory to reclaim. If you have 128 GB of RAM and only a few GB are free - you have actually run out of RAM.
(For example, if a process needs 10 GB, it cannot start if the system only has 2 GB of free RAM).

sag47 · 10-11-2023, 02:30 PM

Quote:

Originally Posted by plznobug

Thank you very much for your response. I will carefully read these articles. Perhaps enabling swap can solve my problem. However, I am more interested in understanding the reason for the system hang, as I believe it will allow me to learn more about the kernel.

“As far as I see your problem is not the reclaiming itself, but there is nothing to reclaim.“
I apologize, but I don't quite understand this statement. Why do you say there is nothing to reclaim? Isn't there still around 10GB of available memory?

Furthermore, I have observed a phenomenon on the server that I cannot comprehend.

Currently, I am killing some processes based on the free field. For example, when the free field approaches 2GB, I kill the process with the highest vmRSS. And, new processes will be scheduled to this server soon.

I have noticed that after killing a process, both the free and available fields increase simultaneously. But after the new process starts, the free field decreases rapidly while the available field remains mostly unchanged.

For example, when the kill occurs, the "free -h" command shows that the system has 3GB free and 40GB available. After killing a task that consumes more than 6GB, the free field becomes 10GB and the available field becomes 48GB. But when a new task is scheduled to this server, the free field decreases to 3GB at a rate of 1GB/s and the available field initially decreases by 1GB but then remains at 47GB.
At the same time, by monitoring /proc/vmstat, I observed that nr_file_pages increased from 10991321 to 11079287 after killing a task, keeping increasing to 12488812 after the new task started.

Based on my understanding, alloc_page should first consume the available field before consuming the free field. Therefore, I cannot comprehend this phenomenon.

I really want to know why this is happening, and I can't think of any program that can reproduce this phenomenon.

Oh, I should also mention that it is HDD in the server.

I read the articles they shared (quite helpful) and I changed vm.swappiness=100 for my system since it is SSD backed. I am running a heavily distributed application and it has 8CPU cores and 32GB of RAM for its controlling node (that schedules out work to distributed compute nodes).

The application takes up 24GB of memory, disk caching takes up about 4GB of memory, available is 6GB of memory. I enabled a file-based swapfile (16GB of swap) and changed vm swappiness to 100 and reviewing the application performance profiles.

Currently, zero swap is used because my system is not under memory pressure. The fact that I have 6GB of memory available (i.e. the kernel will drop disk cache in memory immediately any time an application requests more memory) means there's no need to move pages from physical memory to disk (swap).

In my case, for these reasons, there's no memory to reclaim. The Linux kernel is working just fine with the workload at hand.

If you have no swap file available then typically running application is also non-reclaimable and the kernel will attempt to reclaim other types of memory in use which can cause IO issues and affects overall performance.

By having a swap file you give the Linux kernel the option to move infrequently accessed application memory to disk while keeping frequently accessed memory in physical memory. The Linux kernel will use swappiness as a factor to control how it manages what goes to disk or not. In the case of the articles shared (and associated talks) you can also use cgroups or cgroupsV2 to control swappiness, memory, and other resource constraints on an individual process.

By you having plenty of available memory your OS does not need to reclaim in this moment. However, because your application did hang due to a reclaim you definitely should be using a swap file so that the Linux kernel can persist infrequently used pages to disk (specifically application memory which isn't normally reclaimable or can be persisted because the kernel assumes the application needs it).

In summary, By having swap you give the Linux kernel more memory management options so that your application does not hang due to a reclaim operation. You should verify this information with application performance monitoring (APM) and system monitoring to better understand the performance profile of your application so that you can tune the system you have.

Today I learned; thanks for sharing articles all.