LinuxQuestions.org - 100% Swap usage with loads of free memory

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - 100% Swap usage with loads of free memory (https://www.linuxquestions.org/questions/linux-server-73/100-swap-usage-with-loads-of-free-memory-849452/)

100% Swap usage with loads of free memory

Hi everybody,

I wonder if anybody could help shed light on an obscure issue I faced on one of my servers this morning that very nearly caused it to crash.

Nagios had alerted me that the server had a very high load average exceeding the critical level (17+), when logging onto the server I found that all 4GB of the swap was in use despite the fact that there was 15GB+ of free memory (and that's not even including memory from cache and buffers!)

Because it seems all heavily used pages were being stored in swap, the I/O wait on the server became very high, and 4 kswapd daemons were taking up nearly 100% available CPU.

This did coincide with an error reported by Bacula during a backup job while changing to a bad tape... From /var/log/bacula.log:

Code:

10-Dec 02:11 bacula-sd JobId 1898: End of medium on Volume "4097" Bytes=434,170,000,000 Blocks=217,084 at 10-Dec-2010 02:11.

10-Dec 02:11 bacula-sd JobId 1898: 3307 Issuing autochanger "unload slot 4097, drive 0" command.

10-Dec 02:12 bacula-sd JobId 1898: 3301 Issuing autochanger "loaded? drive 0" command.

10-Dec 02:12 bacula-sd JobId 1898: 3302 Autochanger "loaded? drive 0", result: nothing loaded.

10-Dec 02:12 bacula-sd JobId 1898: 3304 Issuing autochanger "load slot 4096, drive 0" command.

10-Dec 02:13 bacula-sd JobId 1898: 3305 Autochanger "load slot 4096, drive 0", status is OK.

10-Dec 02:13 bacula-sd JobId 1898: Volume "4096" previously written, moving to end of data.

10-Dec 03:51 bacula-sd JobId 1898: Error: Unable to position to end of data on device "Tape-1" (/dev/IBMtape0n): ERR=dev.c:1384 read e

rror on "Tape-1" (/dev/IBMtape0n). ERR=Input/output error.



10-Dec 03:51 bacula-sd JobId 1898: Marking Volume "4096" in Error in Catalog.

10-Dec 03:51 bacula-sd JobId 1898: 3307 Issuing autochanger "unload slot 4096, drive 0" command.

10-Dec 03:58 bacula-sd JobId 1898: 3301 Issuing autochanger "loaded? drive 0" command.

10-Dec 03:58 bacula-sd JobId 1898: 3302 Autochanger "loaded? drive 0", result: nothing loaded.

10-Dec 03:58 bacula-sd JobId 1898: 3304 Issuing autochanger "load slot 4098, drive 0" command.

10-Dec 03:58 bacula-sd JobId 1898: 3305 Autochanger "load slot 4098, drive 0", status is OK.

10-Dec 03:59 bacula-sd JobId 1898: Wrote label to prelabeled Volume "4098" on device "Tape-1" (/dev/IBMtape0n)

10-Dec 03:59 bacula-sd JobId 1898: New volume "4098" mounted on device "Tape-1" (/dev/IBMtape0n) at 10-Dec-2010 03:59.

At the same time, these messages starting occuring in /var/log/messages:

Code:

Dec 10 03:51:47 07 kernel: Mem-info:

Dec 10 03:51:47 07 kernel: Node 0 DMA per-cpu:

Dec 10 03:51:47 07 kernel: cpu 0 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 0 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 1 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 1 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 2 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 2 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 3 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 3 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 4 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 4 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 5 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 5 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 6 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 6 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 7 hot: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: cpu 7 cold: high 0, batch 1 used:0

Dec 10 03:51:47 07 kernel: Node 0 DMA32 per-cpu:

Dec 10 03:51:47 07 kernel: cpu 0 hot: high 186, batch 31 used:162

Dec 10 03:51:47 07 kernel: cpu 0 cold: high 62, batch 15 used:48

Dec 10 03:51:47 07 kernel: cpu 1 hot: high 186, batch 31 used:0

Dec 10 03:51:47 07 kernel: cpu 1 cold: high 62, batch 15 used:0

Dec 10 03:51:47 07 kernel: cpu 2 hot: high 186, batch 31 used:0

Dec 10 03:51:47 07 kernel: cpu 2 cold: high 62, batch 15 used:0

Dec 10 03:51:47 07 kernel: cpu 3 hot: high 186, batch 31 used:18

Dec 10 03:51:47 07 kernel: cpu 3 cold: high 62, batch 15 used:0

Dec 10 03:51:47 07 kernel: cpu 4 hot: high 186, batch 31 used:159

Dec 10 03:51:47 07 kernel: cpu 4 cold: high 62, batch 15 used:56

...

Dec 10 03:51:47 07 kernel: Node 3 HighMem per-cpu: empty

Dec 10 03:51:47 07 kernel: Free pages:      732052kB (0kB HighMem)

Dec 10 03:51:47 07 kernel: Active:4232128 inactive:3071288 dirty:158210 writeback:0 unstable:0 free:183320 slab:256840 mapped-file:289545 mapped-anon:3805487 pagetables:13063

Dec 10 03:51:47 07 kernel: Node 0 DMA free:10796kB min:4kB low:4kB high:4kB active:0kB inactive:0kB present:10356kB pages_scanned:0 all_unreclaimable? yes

Dec 10 03:51:47 07 kernel: lowmem_reserve[]: 0 3512 9067 9067

Dec 10 03:51:47 07 kernel: Node 0 DMA32 free:213332kB min:2500kB low:3124kB high:3748kB active:1794108kB inactive:1463220kB present:3596296kB pages_scanned:64 all_unreclaimable? no

Dec 10 03:51:47 07 kernel: lowmem_reserve[]: 0 0 5555 5555

Dec 10 03:51:47 07 kernel: Node 0 Normal free:41028kB min:3952kB low:4940kB high:5928kB active:3409444kB inactive:1471120kB present:5688320kB pages_scanned:0 all_unreclaimable? no

Dec 10 03:51:47 07 kernel: lowmem_reserve[]: 0 0 0 0

Dec 10 03:51:47 07 kernel: Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no

Dec 10 03:51:47 07 kernel: lowmem_reserve[]: 0 0 0 0

Dec 10 03:51:47 07 kernel: Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no

...

Well to cut a long story short, I fixed the problem by disabling the swap partition with 'swapoff'. After about 30 mins all the swap was freed and the server went back to normal. I don't dare reactivate the swap partition and unfortunately as this is a live server which currently has no fail over, I can't reboot either :(

Server Spec:
4 * Dual-Core AMD Opteron(tm) Processor 8214
32GB DDR2 ECC RAM
RHEL 5.5, 2.6.18-194.11.3.el5 SMP x86_64
Running many KVM VMs (All CentOS x64) and kksmd is used.
bacula-dir Version: 5.0.0
IBM Tape Drive using lin_tape module version 1.34.0 according to modinfo

And before anybody asks ;)
# sysctl vm.swappiness
vm.swappiness = 10

Any insight would be greatly appreciated.

The messages lines look like part of a kernel OOPS or BUG. Can you post their start? A snapshot of "free" and "top' would also have helped. For example "top" would have told us if your load was in IOWAIT (which it very well might have been due to waiting for the bad tape).

Attached is the full log from /var/log/messages at the time the problem occurred until it was fixed.

I think the first few stack traces in this is just because the bacula process is hung as it is waiting for the tape to wind to the correct location, I don't think because it has actually crashed. There are then a few page allocation failures from the lin_tape module, maybe this could be the culprit?

Unfortunately I didn't think to save the outputs for top or free. I do remember that the iowait% was very high, maybe around 80%. I don't think it would have been caused by bacula waiting for the tape, an iostat would have confirmed it... but I think the iowait was due to memory being paged from swap to ram. Also I do remember that there were 4 processes (kswapd0-3) which were taking approx 99% CPU each. The free output also showed that swap memory was 100% used and there was at least 15GB of ram free excluding buffers and cache.

Got the problem occurring again... help!

kswapd0 has a high CPU usage and is causing a 7% iowait. Server is getting a bit sluggish but nothing critical.
If I enable swap with swapon, the swap gets written to very quickly and that makes the problem worse. I disabled the swap again with swapoff.
The cache seems to be increasing very quickly which is unusual for this server. Here is the output from top:

Code:

top - 10:27:00 up 56 days,  1:18,  2 users,  load average: 4.90, 5.16, 4.95

Tasks: 239 total,  3 running, 236 sleeping,  0 stopped,  0 zombie

Cpu(s):  0.2%us, 27.2%sy,  0.0%ni, 66.7%id,  5.6%wa,  0.0%hi,  0.2%si,  0.0%st

Mem:  33032524k total, 22336544k used, 10695980k free,    76996k buffers

Swap:        0k total,        0k used,        0k free,  5817564k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                  

19416 root      15  0 4476m 4.1g 664m S 70.7 12.9  2478:03 qemu-kvm                                                                

  573 root      11  -5    0    0    0 D 51.5  0.0  94:08.77 kswapd0                                                                  

 2968 root      10  -5    0    0    0 R 37.9  0.0  14073:21 kksmd                                                                    

 6661 root      15  0 2804m 2.5g 274m S 37.9  8.0  7042:30 qemu-kvm                                                                

19559 root      15  0 4892m 1.7g 703m S 12.6  5.5 264:29.56 qemu-kvm                                                                

 6430 root      15  0 2791m 2.5g 419m S  7.6  8.0  3803:21 qemu-kvm                                                                

 5106 root      10  -5    0    0    0 S  1.0  0.0  23:42.15 gfs2_logd                                                                

    1 root      15  0 10352  196  88 S  0.0  0.0  0:34.74 init                                                                    

    2 root      RT  -5    0    0    0 S  0.0  0.0  0:02.07 migration/0                                                              

    3 root      34  19    0    0    0 S  0.0  0.0  0:01.19 ksoftirqd/0                                                              

    4 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/0                                                              

    5 root      RT  -5    0    0    0 S  0.0  0.0  0:20.57 migration/1                                                              

    6 root      34  19    0    0    0 S  0.0  0.0  0:27.44 ksoftirqd/1                                                              

    7 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/1                                                              

    8 root      RT  -5    0    0    0 S  0.0  0.0  0:34.27 migration/2                                                              

    9 root      34  19    0    0    0 S  0.0  0.0  0:01.77 ksoftirqd/2                                                              

  10 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/2                                                              

  11 root      RT  -5    0    0    0 S  0.0  0.0  0:54.93 migration/3                                                              

  12 root      34  19    0    0    0 S  0.0  0.0  0:03.16 ksoftirqd/3                                                              

  13 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/3                                                              

  14 root      RT  -5    0    0    0 S  0.0  0.0  0:06.03 migration/4                                                              

  15 root      34  19    0    0    0 S  0.0  0.0  0:15.67 ksoftirqd/4                                                              

  16 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/4                                                              

  17 root      RT  -5    0    0    0 S  0.0  0.0  0:01.04 migration/5                                                              

  18 root      34  19    0    0    0 S  0.0  0.0  0:01.70 ksoftirqd/5                                                              

  19 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/5                                                              

  20 root      RT  -5    0    0    0 S  0.0  0.0  0:00.30 migration/6

Just imagine the cache increasing at about 1MB a second.

Before this happened...
I removed the lin_tape module and updated to version 1.48 of the lin_tape module and daemon.
I them restarted bacula and started a quick job to test the tape.
The tape was loaded in the changer and the tape started to wind to the correct location, at this point the kswapd0 process started taking about 50% CPU and the cache started increasing. I tried stopping bacula-sd with no success, the last resort is to rmmod lin_tape...

Help :)

Well there has been an interesting turn of events...

After stopping all non-critical VMs and services on the server, the kswapd daemons all went down to 0% CPU and 0% iowait.
I then restarted bacula and performed a backup. Interestingly I noticed as soon as bacula-sd started winding the tape to the correct location, all cache was purged from memory.
I'm thinking this cache purge is what caused kswapd to start aggressively trying to cache memory again causing a high load on the server.
Well bacula finished the backup successfully and eventually when the cache filled to around 600MB, kswapd went down to 0% CPU and 0% iowait again.
I restarted all of the services and VMs I had shutdown earlier and now the server is back to normal :D

I won't call this solved yet because when bacula decides to change tapes next this could all kick off again... we will see.

BTW, my understanding of kswapd and memory cache may be way off. If it is I hope someone will enlighten me :)

kswapd doesn't deal with (page) cache directly - and certainly doesn't "cache memory again" (if you mean what I think you might mean).

high kswapd usage is (usually) a symptom of another deeper problem - it's trying to either;
- swap pages out due to a low memory event
- or swap pages in due to storage reference causing a page fault on a swapped out page.
Which can, of course, lead to thrashing.

You look like you're having problems in the DMA zone I can't imagine you'd ever want to see "all_unreclaimable? yes" - the lin_tape might indeed be a good first suspect. Of course it could be an innocent bystander trying to help, just like kswapd. Stopping the VMs having that effect looks a bit suspect.

I wonder if this is exacerbated by the AMD processors being separate nodes - means the memory is split up as well, so no task (bacula) has access to all the (physical) DMA{32} region.