[SOLVED] Very high SLAB usage, hard to understand

nwrk · 09-05-2011, 10:15 PM

Hi there,

I'm not used to ask but this time I have followed every piece of information I could find and I never found solution, so I'm trying here as there seems to be a lot of good people

On a virtual server hosted under VMware, I have some LXC containers hosting a dozen of PHP websites. It's not a very high load for a VM showing 2 CPU at 2.5GHz and 3GB of memory. But it is swaping... My analysis is the following :

Code:

www2 ~ # uname -a
Linux www2 2.6.38-gentoo-r6 #9 SMP Fri Jun 24 14:28:08 NCT 2011 x86_64 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz GenuineIntel GNU/Linux

www2 ~ # free -m
             total       used       free     shared    buffers     cached
Mem:          3018       2915        103          0         11         38
-/+ buffers/cache:       2864        153
Swap:         4095        916       3179

As you can see, there a lot a swap space used while there's no real reason. I fact, if I sum up all the cgroup's used memory it comes to this :

Code:

www2 ~ # sum=0; for f in /cgroup/*/memory.usage_in_bytes; do sum=$[$sum+$(<$f)];
 done; echo $[sum / 1024 / 1024]
259

Meaning that I have 259MB of memory used by the containers. After investigating, I found that the missing memory is going in the SLAB allocator, which should be fine too. atop gives the following stats :

Code:

MEM | tot    2.9G | free  121.2M | cache  33.5M | buff    9.2M | slab    2.4G |

So, the SLAB is eating a lot of memory, which wouldn't be a problem if its space was reclaimed as the kernel advertise ~2GB of reclaimable SLAB space :

Code:

www2 / # grep SReclaimable /proc/meminfo 
SReclaimable:    2177344 kB

Furthermore, slabtop gives me the following information :

Code:

 Active / Total Objects (% used)    : 6541983 / 6578453 (99.4%)
 Active / Total Slabs (% used)      : 615834 / 615841 (100.0%)
 Active / Total Caches (% used)     : 131 / 227 (57.7%)
 Active / Total Size (% used)       : 2324507.76K / 2329858.41K (99.8%)
 Minimum / Average / Maximum Object : 0.02K / 0.35K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
2507580 2504636  99%    0.19K 125379       20    501516K dentry

2491500 2491494  99%    0.61K 415250        6   1661000K inode_cache

And this is where I'm stuck, I can't find a way to identify what is holding 99% of the inode_cache and dentry SLABs. I'm sorry if I missed a link somewhere before asking, but I really think I exhausted any information Google could get me about this.

If anyone can help me with some tool to get more information about SLAB objects and how to make the kernel reclaim the space it pretends to be reclaimable.

Thanks a lot for (at least) reading !

mulyadi.santosa · 09-06-2011, 04:37 PM

Hi nwrk

Reading your post, very especially your slabtop output ( I assume you use the default sorting criteria which is based on number of objects), I could say that your Linux server is doing quite a lot ( maybe massive! ) file read/write.

In short, for every file reading, its inode is cached and it is called inode_cache (icache). Furthermore, since one can point to a file after finding in which directory it resides, directory entry is also read and cached. This is named dentry (directory entry).

But adding the fact that your "cached" amount (shown by free) is low, I might deduct that the I/O might be so frequent that your slab allocator (that is what kernel folks name it, it's like cache manager for data structures) decide not to release it. It could be fine, but it could be a leak.

I suggest, if possible, to do kernel upgrade to latest longterm stable. Also, it might be useful to switch to other slab allocator method (usually it's SLOB afaik) and see if it helps. For this slab change, you need to do kernel recompilation by your own.

Hope it helps....

syg00 · 09-06-2011, 09:19 PM

If you're running gentoo, you have a kernel source tree. Have a look at ../Documentation/vm/slub.txt - I haven't tried this debug info, but certainly looks promising. Might add it to my "to-do" list sometime ...
Unless you've done something very specific/silly, you *will* be using the slub allocator.

There have been examples of problems with the allocator (I haven't noticed one reported for a while) - if you feel you have such, that's a reportable kernel bug.

nwrk · 09-07-2011, 12:54 AM

Hi syg00 and mulyadi.santosa, thanks for your useful answers.

I didn't knew the SLUB allocator was the new default, I didn't change that and it was on "SLAB". I must admit that I was wondering about the SLUB, as the description states "SLUB can use memory efficiently and has enhanced diagnostics", which looked really interesting in my case.

Upgrading to the lastest mainline kernel was also my next move, sadly requiring a reboot but necessary given your answers. Anyway when the memory gets full of caches, the load goes to 60 or more with no other solution than a reboot.

About what looks like a bug, is that slabtop report à 99% usage of the cache, around that 2.5M objects in both inode_cache and dentry caches while lsof -n |wc -l tells me there are at most 19k open files. If you think I've hit a bug because of that, then I will report this on the LKML; but these are people I disturb in last resort, because they are dangerous

For know I will follow your good advices (that also confirm my intuitions), build a 3.0.4 vanilla (instead of the 2.6.38-gentoo) with the SLUB and reboot. I'll tell you about the results, but expect a delay of around 14 days, before I can confirm that the caches should be reclaimed and they are not.

Thanks again and see you !

syg00 · 09-07-2011, 02:51 AM

You really need to be on the SLUB allocator - a lot of effort (over years now) has been put into improving its efficiency and slab consolidation.
I don't think I'd want to be disturbing the residents on LKML with SLAB issues unless you can prove the problem(s) also affect SLUB. As you say, they can be a bit touchy ...

Will wait with interest.

mulyadi.santosa · 09-07-2011, 02:58 AM

Hi nwrk

Well, sometimes reporting that to LKML is a good move

But since you slightly mention about -gentoo kernel, I think you should report that to Gentoo kernel dev team first. I am not saying you have to, it's just if you do that and it is indeed a corner case bug, then you indirectly helps probably hundreds maybe thousands people which might have same workload like you.

Regarding slab allocator choice, actually under no strange condition, both Slab, Slub or Slob or whatever should be fine. What I saw here was a leak and that could happen in any method you choose. Since you mention about LXC, I suspect it could come from LXC, not from the bare Linux kernel.

ehm, about kernel version, I still suggest you pick latest longterm stable. The reason is that the stability is usually better than just "stable". After all, all the fixes are backported from latest stable to them. As we speak, you could choose between 2.6.35.14, 2.6.34.10, 2.6.33.19, or 2.6.32.46. Also pay attention on kernel hacking section during kernel config. Usually, per item there chew more memory, so pick carefully.

nwrk · 09-07-2011, 04:48 AM

Okay then I'll try to just change the SLAB implementation to SLUB, and see. I'll stick with 2.6.38-gentoo-r6 which is the kernel+gentoo patchset that gentoo people considered stable when I built the server (it is 2.6.39-r3 now). FYI, these kernels include bugfix patches too.

If the problem persists, I'll try to contact the Gentoo kernel maintainers as you suggested. It will be a very good step before LKML

(off subject: there seems to be a problem resolving kernel.org right now; tried from New Caledonia and France... I'm still lucky it seems...)

syg00 · 09-07-2011, 05:14 AM

Same here - maybe they took it down for a rebuild after they were compromised.

mulyadi.santosa · 09-07-2011, 09:38 PM

Hi

Quote:

Originally Posted by nwrk

Okay then I'll try to just change the SLAB implementation to SLUB, and see. I'll stick with 2.6.38-gentoo-r6 with is the kernel+gentoo patchset that gentoo people considered stable when I built the server (it is 2.6.39-r3 now). FYI, these kernels include bugfix patches too.

Good choice I think

Picking one which is supported by your distro of choice will make life easier

After all, 3.x is still new..

Quote:

Originally Posted by nwrk

(off subject: there seems to be a problem resolving kernel.org right now; tried from New Caledonia and France... I'm still lucky it seems...)

I got slowdown too and I agree it might have something to do with latest compromise.

nwrk · 09-18-2011, 03:02 AM

Hi again,

It doesn't look better :

Code:

www2 ~ # free -m
             total       used       free     shared    buffers     cached
Mem:          3018       2822        195          0         44        101
-/+ buffers/cache:       2677        341
Swap:         4095        351       3744
www2 ~ # atop |grep MEM
MEM | tot    2.9G | free  177.9M | cache 104.9M | buff   47.3M | slab    1.8G |

I'm upgrading to the latest stable gentoo kernel (2.6.39-gentoo-r3) to see.

mulyadi.santosa · 09-18-2011, 09:34 AM

Hi....

Quote:

Originally Posted by nwrk

Code:

www2 ~ # atop |grep MEM
MEM | tot    2.9G | free  177.9M | cache 104.9M | buff   47.3M | slab    1.8G |

I'm upgrading to the latest stable gentoo kernel (2.6.39-gentoo-r3) to see.

Sheesh, 1.8 GiB for slab...that'a lot.... could you run:
slabtop -s c
and let us know the five biggest cache name in yours?

PS: So far I still sense a leak somewhere but this might need deeper Linux kernel memory tracing,which might be unpleasant and quite complicated task to do.

NB: AFAIK, I could only recall "kmemleak" but that would slow down your machine entirely by several magnitude, so I am not sure this will be feasible

nwrk · 09-18-2011, 02:59 PM

Hi,

thanks again for the answer. I forgot to take a "shot" of slaptop, but it's still the same, ratios like 65% inode_cache, 33% dentry and maybe 1% for everything else; with 99% to 100% usage. I suppose that this usage percent is calculated through something like a refcount, and since LXC is still quite new, maybe there's a problem releasing the references (unshared refcount??). The structure is quite simple and LXC-like : I have the filesystems mounted by the main system and they are bind-mounted in their namespaces. I also have 1 BTRFS volume. And since BTRFS and LXC have been improved in linux 3.0, I think I'll upgrade if the problem is not solved and even if it's not in the gentoo's stable branch -- I have an evil nerd side

FWIW :

Code:

www2 ~ # mount |sed -e 's/vg-h_[^ ]*/vg-h_***/' -e 's,/lxc/[^/]*/,/lxc/***/,'
rootfs on / type rootfs (rw)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=10240k,nr_inodes=386055,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
/dev/sda2 on / type ext2 (rw,noatime,user_xattr,acl,barrier=1,data=ordered)
rc-svcdir on /lib64/rc/init.d type tmpfs (rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime)
none on /cgroup type cgroup (rw)
/dev/mapper/vg-usr on /usr type ext4 (rw,noatime)
/dev/mapper/vg-portage on /usr/portage type ext4 (rw,noatime)
/dev/mapper/vg-distfiles on /usr/portage/distfiles type ext4 (rw,noatime)
/dev/mapper/vg-home on /home type ext4 (rw,noatime)
/dev/mapper/vg-opt on /opt type ext4 (rw,noatime)
/dev/mapper/vg-tmp on /tmp type ext4 (rw,noatime)
/dev/mapper/vg-var on /var type ext4 (rw,noatime)
/dev/mapper/vg-vartmp on /var/tmp type ext4 (rw,noatime)
/dev/mapper/vg-hosting_base on /home/hosting/system type btrfs (ro,noatime,compress=lzo)
/dev/mapper/vg-hosting_user--data on /home/hosting/user-data type ext4 (ro,noatime)
/home/hosting/system on /home/hosting/template/system type none (ro,bind)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
/dev/mapper/vg-h_*** on /home/hosting/lxc/***/user-data type ext4 (rw,relatime,barrier=1)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw,noexec,nosuid,nodev)

mulyadi.santosa · 09-19-2011, 10:13 AM

Hi again...

Quote:

Originally Posted by nwrk

Hi,

thanks again for the answer. I forgot to take a "shot" of slaptop, but it's still the same, ratios like 65% inode_cache, 33% dentry and maybe 1% for everything else; with 99% to 100% usage.

Alright, now I am quite sure it's memory leak...in...uhm, btrfs? ext4 is quite stable so I would point my suspicion toward btrfs. Maybe you enabled some features that somehow hold something for long time, snapshots (like ZFS) maybe?

regarding LXC, that could be an amplifying factor, or another factor that comes together. Are you suffering same thing on your other machine which uses btrfs but not using LXC?

Quote:

Originally Posted by nwrk

Hi,

thanks again for the answer. I forgot to take a "shot" of slaptop, but it's still the same, ratios like 65% inode_cache, 33% dentry and maybe 1% for everything else; with 99% to 100% usage. I suppose that this usage percent is calculated through something like a refcount, and since LXC is still quite new, maybe there's a problem releasing the references (unshared refcount??). The structure is quite simple and LXC-like : I have the filesystems mounted by the main system and they are bind-mounted in their namespaces. I also have 1 BTRFS volume. And since BTRFS and LXC have been improved in linux 3.0, I think I'll upgrade if the problem is not solved and even if it's not in the gentoo's stable branch -- I have an evil nerd side

FWIW :

Code:

/dev/mapper/vg-hosting_base on /home/hosting/system type btrfs (ro,noatime,compress=lzo)

wait wait wait, "compress"? hmmmmm could that be the problem?

PS: care to award me reputation point ?

nwrk · 09-19-2011, 03:18 PM

Hello,

Yeah I though there could something with the quite new btrfs; my approach was "well, it's read only and I could benefit quite a lot of a fast compression like LZO". I think you're right pointing it, because even in read-only it may hold references, thus causing memory leaks. Maybe compression but I'm not sure.

My other BTRFS filesystem is used as a "buffer" (because I don't trust BTRFS for now) : it's mounted with the compress option (defaults to zlib) to have faster Oracle datafiles checks. The MD5 and SHA1 sums on the files are always good, and the controls on the blocks are good too (Oracle maintains checksums at block level). The difference is that this filesystem is unmounted after use to keep another safe copy of the backup, until the next backup. On this host, slabtop gives me that :

22780 8028 35% 0.19K 1139 20 4556K dentry
180 142 78% 0.61K 30 6 120K inode_cache

So, following your good advice,

I'll let the memory go up for some time and try an unmount to see what it gives. It will some night work so give me some time please

Quote:

Originally Posted by mulyadi.santosa

PS: care to award me reputation point ?

You mean post rating, right ?

mulyadi.santosa · 09-19-2011, 08:14 PM

Quote:

Originally Posted by nwrk

Hello,

Yeah I though there could something with the quite new btrfs; my approach was "well, it's read only and I could benefit quite a lot of a fast compression like LZO". I think you're right pointing it, because even in read-only it may hold references, thus causing memory leaks. Maybe compression but I'm not sure.

I have no very strong believe upon it, but I quite suspect it's something related to btrfs. Mind you, even Fedora 16 plans to delay using btrfs as default filesystem. Might have something to do with this kind of stability perhaps....

Quote:

Originally Posted by nwrk

The difference is that this filesystem is unmounted after use to keep another safe copy of the backup, until the next backup. On this host, slabtop gives me that :

22780 8028 35% 0.19K 1139 20 4556K dentry
180 142 78% 0.61K 30 6 120K inode_cache

Ok, I assume the field left to the cache name is the cache size...looks sane for me. And that percentage is the usage percentage (active objects vs total number object in the certain cache), I guess. Again, looks sane....

So, to summary so far:
Actually, high percentage of slab size especially dentry and inode_cache are not rare. The thing is, as I notice on your early posts, that it force swapping out. So, probably this slab (or maybe other) is kept (or locked) in RAM. And the suspect here is your btrfs. LXC could be the augmentation factor too.

Quote:

Originally Posted by nwrk

You mean post rating, right ?

Yes, please (if I help you somehow)