Hi dear fellows,
I'm afraid I have a huge issue with my newest Fedora 14 server. I recently migrated to Fedora 14 from Centos 5, which was very stable, but had ancient packages and libraries and my users were revolting...
The machine is a HP ProLiant 360 G7, with 12G RAM and 6 SAS drives in RAID 5.
After I migrated to Fedora 14, I noticed that for some reason, during the course of about 24 hours, all usable RAM "disappears" and applications are forced down to swap space. Needless to say I didn't have this issue on CentOS. The server does heavy IO as per it's function (it's a heavily loaded file processing server and user simulation computing station among other things, which causes lots of random IO), so I thought it may be the cache, but then I realized it cannot be - because obviously Linux will use onyl "unused" RAM for caching and frees it up as soon as an app need it.
Then, I thought to check the "slabtop" to see what's going on in Kernel memory. Unfortunately I don't have the screenshot from the time just before the latest crash, but there's a certain value displayed by slabtop, which slowly, byte-from-byte creeps over all available RAM, eventually forcing applications down to the swap. This is malloc-64, and as you can see from the bellow copy-paste, it's building up again even now...
Code:
Active / Total Objects (% used) : 9118075 / 9153600 (99.6%)
Active / Total Slabs (% used) : 152157 / 152157 (100.0%)
Active / Total Caches (% used) : 75 / 94 (79.8%)
Active / Total Size (% used) : 704083.01K / 718307.90K (98.0%)
Minimum / Average / Maximum Object : 0.01K / 0.08K / 11.19K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
8176704 8176427 99% 0.06K 127761 64 511044K kmalloc-64
554814 553957 99% 0.10K 14226 39 56904K buffer_head
82824 79807 96% 0.55K 2856 29 45696K radix_tree_node
73836 66062 89% 0.19K 1758 42 14064K dentry
23940 19264 80% 0.19K 570 42 4560K kmalloc-192
22016 21908 99% 0.01K 43 512 172K kmalloc-8
19737 19737 100% 0.08K 387 51 1548K sysfs_dir_cache
......
By this time tomorrow, all 12GB RAM will just about be eaten up by kmalloc-64, and even a simple SSH login won't be possible.
Nothing I tried to force out reclaimin of this memory was successful. Among other things, I tried to reload X, change runlevels, restarted all kinds of processes... and I went as far as trying ugly things like:
echo 1 > zone_reclaim_mode
echo 150 > vfs_cache_pressure
sync; echo 3 > drop_caches
... to no avail whatsoever. Nothing I did had any effect on this thing short of a clean reboot.
I thought that maybe some user app they are running on the box maybe causing this... so I denied access for everyone for one day, and allowed only the regular file processing load to go on. The problem persisted, even the most basic system daemons were forced into swap eventually, then the box was dead.
FYI my kernel := "Linux gepard.unixhosting.local 2.6.35.11-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux"
Obviously I need some help on either: 1) how to force-reclaim this space periodically, or better still, 2) find out what's causing this if this is my fault and get this over with...
Please, anyone who has a clue on what could be going on, help me. I've exhausted my Google options and I'm completely stuck.