Swapping causes processes & system to hang

MirceaKitsune · 11-26-2013, 05:50 AM

This is the most severe issue I've been experiencing since moving to Linux. I already discussed it on the openSUSE forums, but it appears this might be a general issue of the Kernel. I'm looking to know if anyone has the problem, knows more about it and what exactly is going on, or can estimate when it will be fixed.

This is what's happening: After a given amount of uptime (typically about 12 to 48 hours of not restarting) the machine would suddenly hang. Image would remain frozen on the screen, sound stopped working, mouse and keyboard stopped responding (even toggling the NumLock / CapsLock / ScrollLock leds). But despite similar freezes I've had in Windows, the busy led (red bulb on the computer case) wouldn't stay permanently lit, just flash occasionally as it does normally... indicating the machine isn't completely frozen and still alive in the background. Still, the only way to recover was to power off the machine and restart it.

At first I thought the fglrx (proprietary ATI) driver was the cause. But after switching back to the OSS Radeon driver with a distribution upgrade, I found that the issue still exists, although it's more rare and takes slightly different forms. It took me months of examining and putting things together to realize what was truly triggering this: SWAP.

Apparently, swapping can cause the system to hang completely for long periods of time... even if barely any SWAP is used. Because I have a lot of RAM, my SWAP usage is almost always 0 GB. While this is the case, everything works perfectly fine. But sooner or later, something adds a little bit of data to the SWAP. The moment SWAP has as little as 0.01 GB usage, it's only a few hours until random processes start to hang and the machine stops responding. The system freezes are probably when this happens to a vital component, such as the X server.

Those are my specifications, in case any are relevant: I'm running openSUSE 13.1 x64, with KDE 4.11.2, on Kernel 3.11.6. I have a Radeon HD 6870 card, running on latest official versions of Mesa and the free Radeon driver. The previous setup in which I experienced this was openSUSE 12.3 x64, with KDE 4.10.5, on Kernel 3.7.10, and with the fglrx driver instead. I have 9 GB of triple-channel RAM (3 x 1 GB + 3 x 2 GB sticks) and 8 GB of SWAP on my primary drive, on an Intel Core i7 920 CPU (4 cores, 8 threads).

Has anyone else experienced kernel swapping causing processes to freeze for long amounts of time? Even when little SWAP and / or RAM are used, and going as far as freezing vital parts of the system? Is the problem known and being worked on? What are the workarounds (apart from no longer using SWAP)?

johnsfine · 11-26-2013, 08:34 AM

Quote:

Originally Posted by MirceaKitsune

It took me months of examining and putting things together to realize what was truly triggering this: SWAP.

Apparently, swapping can cause the system to hang completely for long periods of time... even if barely any SWAP is used.
...
The moment SWAP has as little as 0.01 GB usage, it's only a few hours until random processes start to hang and the machine stops responding. The system freezes are probably when this happens to a vital component, such as the X server.

I think you are jumping to the conclusion that SWAP is a factor on far too little evidence. SWAP usage is more likely to be another symptom of the underlying cause, rather than any step in the chain of events leading to the hang.

I would wild guess (also on too little evidence) that your problem is internal to the X server process.

MirceaKitsune · 11-26-2013, 11:57 AM

Quote:

Originally Posted by johnsfine

I think you are jumping to the conclusion that SWAP is a factor on far too little evidence. SWAP usage is more likely to be another symptom of the underlying cause, rather than any step in the chain of events leading to the hang.

I would wild guess (also on too little evidence) that your problem is internal to the X server process.

It is true there's no 100% certain evidence it's the SWAP. But everything I tested and noticed points only to this conclusion.

A few days ago I did an even more direct experiment; I waited for SWAP to get big enough to cause trouble. Luckily, only non-vital processes started freezing or lagging this time. Once I was sure the issue started, I used "swapoff -a" followed by "swapon -a" to restart and clear the swap. From that moment, everything went back to normal and there were no more freezes.

For multiple reasons (not just to avoid this bug), I set vm.swappiness to 10 over the default of 60. This should cause SWAP to almost never be used. During the next days / weeks, I will see if the freezes are permanently gone now. If not, I will also test with the SWAP disabled entirely.

syg00 · 11-27-2013, 02:37 AM

The pastebin you referenced on the OpenSUSE forum shows a segmentation fault in X - did you contact X.org support as the log suggested ?.

MirceaKitsune · 11-27-2013, 05:50 AM

Quote:

Originally Posted by syg00

The pastebin you referenced on the OpenSUSE forum shows a segmentation fault in X - did you contact X.org support as the log suggested ?.

Hmmm... I didn't read the log in-depth before posting it. Maybe the SWAP issue causes processes to get such faults? It might have been due to fglrx which I got rid of though, so it could be part of a problem that's solved. I shall bring this to the Xorg team too... but the issue is so unclear I can't tell who exactly should see it.

As I might have said, it's not always complete system freezes. Sometimes I get various processes dying in place, which doesn't seem like X's doing. Last night I got KWin + Plasma-Desktop + other non-vital processes all freezing suddenly, but managed to open KSysGuard to look at what was happening. Surprisingly, SWAP wasn't shown to be in use during this time. Only other thing of relevance was that the field which shows CPU usage was saying "disk"... which I assume means the process got suspended to disk for some reason. Either way, everything recovered after a few minutes this time.

MirceaKitsune · 11-29-2013, 06:57 PM

Several minutes ago, the potential holy grail of the system freezes has been delivered to me. After approximately 3 days of uptime, I suddenly started getting temporary system freezes (about a few seconds each) which soon lead to a permanent system hang. This time I waited to see what happens rather than instantly powering off the computer. After a minute, the system went into a console where information about what happened was printed. Everything was still completely frozen, but the text was readable.

With no print-screen function available, I took a photo of the screen. It's a little blurry, but good enough so the text can be read. If anyone can't understand what the last lines say, I can translate them. In essence, it seems to talk about a "panic", "cpu lockup", and "paging request":

http://i43.tinypic.com/29woopg.jpg

syg00 · 11-29-2013, 09:39 PM

Ok, that's a kernel oops, and earlier X had a segfault.
Could be a real error in the kernel, but I'd start looking at hardware as well. Start with memtest - I'd let it run (at least) all night

MirceaKitsune · 11-30-2013, 03:36 PM

Since I didn't have much to do today, I decided to boot into Memtest 4.20 (from the Clonezilla CD) and let it run for 8 hours straight. It did 6 passes with the default tests, and no errors where found. From what I saw many people say, 6 passes should be more than enough to be sure the RAM is fine... even when overclocked. So this hopefully means a bad memory module is off the list of suspects.

http://i44.tinypic.com/34evmsz.jpg

MirceaKitsune · 12-14-2013, 09:52 AM

I started testing various components which I considered could be at fault. First thing I tried was turning off Skype, but it seems it's unrelated. Next up I disabled SWAP, by using "swapoff -a" immediately after login... yet I still got a Kernel panic several minutes ago after 4 days of uptime.

I now disabled some BIOS settings which preform builtin overclocking of the CPU and RAM (CPU frequency multiplier back to 20x instead of 21x, Extreme Memory Profile disabled). If this fails too, I'll try unplugging my webcam and other USB devices to see if any are behind this (unlikely since they aren't used during the moment of the crash). If this fails too, I'm out of ideas.

I also noticed something else: The small freezes (1 to 3 seconds each minute) appear to be unrelated to the permanent hangs. Soon after starting up the machine I opened up Second Life, and while it was running the system would hiccup every now and then. Once I closed SL it stopped, and there were no more problems for 3 more days until the system hang came. I also had SWAP turned off during this test.

Also, here's the Kernel panic I got this time. The message is somewhat different than the first picture I posted:

http://i40.tinypic.com/qsljd4.jpg

onamatic · 01-02-2014, 06:32 PM

AMD Athlon(tm) II X4 620 Processor
RADEON HD4200 Radeon Driver
KDE 4.10.5
Kernel 3.8.0-34-generic
I'm a bit late to the party, but I'll just chip in and say the symptoms sound very similar to those I've experienced recently. I get the occasional freeze when no activity has taken place for an hour or so. Sometimes it's recoverable by flipping to the console and back which suggests Xorg got disconnected somewhere, sometimes a reboot is the only answer. Yesterday I saw evidence of kernel panics on two processors. On previous wobbly systems I've set nolapic_timer as a kernel parameter which completely cured all instability; whether it might help an Intel architecture I've no idea but maybe worth a try?

MirceaKitsune · 01-11-2014, 06:22 PM

After several weeks of observing and testing, I believe it's safe to say I found and solved the problem, and this case is finally closed. I no longer seem to have any system crashes, even after many days of uptime.

As I was away for Christmas and new year, I used my laptop for two weeks, which has the exact same software including video driver. It caught over 10 days of uptime without any problem. This was the first indication that it wasn't the issue I used to know long ago (fglrx, which crashed both my PC and laptop) and my desktop computer was the only one that had it. At this moment, my PC also reached 5 days of uptime for the first time.

The problem was the BIOS setting "Extreme Memory Profile". What it does is running the RAM modules at a higher frequency, which is considered safe and supposed to work although it's higher than the factory defaults. Considering this, I didn't suspect it until recently. My RAM modules are meant to work at 1066 Mhz, and this put them at 1600 Mhz. They do work well like this, but only for a few days at a time.

SAWP was indeed unrelated. I was linking it to the issue because both the SWAP coming into use and the system crashing happened precisely after a few days. Also, the temporary freezes (system becoming completely unresponsive for 10 - 30 seconds) are unrelated, and the system always recovers from them. Only annoyance here is that if a program freezes for too long, I get an error that "Klauncher could not communicate to the application" or something.

So yeah... it was overclocking after all. Even if you didn't increase any frequencies manually, be on the lookout for BIOS settings that automatically overclock certain components.