OOM killer even though there is memory available
We currently have two identical servers both with the exact same set up with server A set up and server B a recovery of server A to have the software also identical.
Both servers are running a single Xeon CPU with 18GB ram and 8GB swap
Server A is running perfect fine with no issues.
Server B seems to be a lot slower in writing data to disk.
As a test in this, I have performed a timed dd of /dev/zero on both servers which shown that this was the case and not just what users were saying.
It then comes more interesting if I increase this 8GB zero file to 16GB. On server A this is no problem and complete under 2mins with sync included. On server B the server when it uses around 13GB of RAM in total (going up due to caching) it them starts the OOM killer even though there is still plenty of free RAM and the swap space hasn't even been touched?
I am running a memtest86+ on server B at the moment however this is basically finished and is not showing any errors.
The OOM message is:
I see from the OOM errors that it reports "DMA32: empty" but why is this only on 1 server and not the other?
I have tried the dd command on different RAIDs that is installed on the server and the two I tried both seem to cause the same problem. They are using the same RAID controller which uses CCISS.
Does anyone have any ideas of what is the cause of this?
I don't see any hint in your post about what processes are using the 8GB of swap space and how different the use of swap space and ram might be between the two servers.
The free command will tell you about overall use of swap space. The RES column of top (after you sort on that column) is one of several places you can get a rough idea of memory use per process. Getting info on per process use of swap space may be much harder. Post some of the easier to get memory statistics and that might give some clue about where you need to take a closer look.
You may need to increase swap space. A large amount of unused swap space may be necessary for insurance against OOM during unusual conditions. If you are using a significant fraction of the 8GB of swap, then it is not enough for insurance. Disk space is cheap. Giving more of it to swap space may avoid serious problems.
It sounds like you ALSO have a problem with disk writes on server B. I don't know anything about diagnosing that problem. Slow disk writes could easily be the direct cause of the OOM and could be the only reason the OOM occurs only on one of the two servers. But I think the OOM still indicates either you are using more anonymous memory in some process(es) than you intended, or you configured less swap space than your workload needs for safe operation.
Thanks for the quick reply.
On both servers during the dd file creation there is none of the 8GB swap space is actually used and on server B when the OOm kicks in, there is still 5-6GB of free RAM as well (going by top). This is part of why I don't the reason for OOM to kick in.
I will get a free report and post when I can redoing the dd test.
Increase the swap space is not an issue, however as mentioned above this is not actually being used at all and free seems to be available.
You can avoid such conditions in Linux by changing the "over commit" settings. It is possible the over commit settings in your server B somehow got changed to non default values that more readily cause OOM. If so, it would be best to set them back to default.
It is also possible that you have some unusual use of address space by some of your processes that (combined with default over commit settings) leads to the system thinking it needs an absurdly large amount of free swap space. In that case some experts (in other threads) have advised adjusting the over commit settings so the system no longer thinks it needs so much swap space. That approach can work well if you know what you're doing, but most people in that situation misunderstand the documentation of over commit settings and get the details wrong. Simply providing the excess swap space that the system thinks it needs may be easier and safer than messing with over commit settings.
I really do not think the amount of swap space is the issue here. This is only creating a 16GB file.
I have however increased the swap space to 32GB and then retried creating a 16GB file and this is still no better.
Here is the OOM message:
The server seems to have locked up, so I was unable to run the free command but you can see the memory usage from the above OOM message.
The only way I can see the swap space causing the problem which would make sense is in that the disks are struggling that much it is not able to write out quick enough.
I am going to check the RAID controller settings on both these servers to check they have been set up identically but otherwise I am still not too sure how the DMA32 has got empty and is not able to use the swap space.
|All times are GMT -5. The time now is 09:36 PM.|