Very high fault pages with free mem

abiadak · 11-08-2011, 12:34 PM

I'm running a program that is causing high cpu sys time (82%), while low user time (17%) and is lasting for ages...

sar is reporting very high fault/s:

Linux 2.6.32-71.el6.x86_64 () 11/08/11 _x86_64_ (16 CPU)

19:22:07 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
19:22:08 0.00 0.00 20968606.06 0.00 70.71 0.00 0.00 0.00 0.00

There're no disk IOs reported by iostat

avg-cpu: %user %nice %system %iowait %steal %idle
18.14 0.00 81.86 0.00 0.00 0.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sdb 0.00 0.00 0.00 0 0
md0 0.00 0.00 0.00 0 0
md1 0.00 0.00 0.00 0 0
md3 0.00 0.00 0.00 0 0
md2 0.00 0.00 0.00 0 0

There's plenty of free memory:

total used free shared buffers cached
Mem: 24598936 19499688 5099248 0 438140 16362916
-/+ buffers/cache: 2698632 21900304
Swap: 4192248 0 4192248

The number of context switches is relatively high, but I'm running 16 instances of the program, so it's not unexpected (the server has 8 cores + HT):

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
16 0 0 5099168 438144 16362916 0 0 0 0 0 0 2 0 98 0 0
16 0 0 5099168 438144 16362916 0 0 0 0 16127 49 18 82 0 0 0
17 0 0 5099044 438144 16362916 0 0 0 0 16125 69 17 83 0 0 0

This is the memory state:

cat /proc/meminfo
MemTotal: 24598936 kB
MemFree: 5099200 kB
Buffers: 438160 kB
Cached: 16362920 kB
SwapCached: 0 kB
Active: 7316380 kB
Inactive: 10909312 kB
Active(anon): 1342240 kB
Inactive(anon): 84944 kB
Active(file): 5974140 kB
Inactive(file): 10824368 kB
Unevictable: 85628 kB
Mlocked: 20220 kB
SwapTotal: 4192248 kB
SwapFree: 4192248 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 1510244 kB
Mapped: 62564 kB
Shmem: 292 kB
Slab: 1028044 kB
SReclaimable: 569336 kB
SUnreclaim: 458708 kB
KernelStack: 4888 kB
PageTables: 16508 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16491716 kB
Committed_AS: 2124660 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 370188 kB
VmallocChunk: 34346206020 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1292288 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5696 kB
DirectMap2M: 2082816 kB
DirectMap1G: 23068672 kB

This is the VM config reported by sysctl:

vm.overcommit_memory = 0
vm.panic_on_oom = 0
vm.oom_kill_allocating_task = 0
vm.extfrag_threshold = 500
vm.oom_dump_tasks = 0
vm.would_have_oomkilled = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
vm.nr_pdflush_threads = 0
vm.swappiness = 60
vm.nr_hugepages = 0
vm.hugetlb_shm_group = 0
vm.hugepages_treat_as_movable = 0
vm.nr_overcommit_hugepages = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.drop_caches = 0
vm.min_free_kbytes = 25000
vm.percpu_pagelist_fraction = 0
vm.max_map_count = 65530
vm.laptop_mode = 0
vm.block_dump = 0
vm.vfs_cache_pressure = 100
vm.legacy_va_layout = 0
vm.zone_reclaim_mode = 1
vm.min_unmapped_ratio = 1
vm.min_slab_ratio = 5
vm.stat_interval = 1
vm.mmap_min_addr = 4096
vm.numa_zonelist_order = default
vm.scan_unevictable_pages = 0
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1

And the same program in similar servers run perfectly.

Any clue???

Thanks.

David.

grim76 · 11-08-2011, 02:22 PM

Have you checked to make sure that your RAM is good, or physically properly seated in the server?

abiadak · 11-08-2011, 02:30 PM

I don't think it's a problem with RAM modules. The problem is reproducible in 3 different servers.

abiadak · 11-09-2011, 05:13 AM

I've updated the initial post with the VM config.

johnsfine · 11-09-2011, 07:36 AM

I have often been frustrated by that same problem on Windows. I've never found any kind of work around on Windows. On Windows, it is a flaw in the OS's memory management algorithms and there doesn't seem to be anything a user can do about it.

I haven't seen much of this behavior in Linux and I never really investigated it in Linux. I expect Linux gives the expert user more control over such things, but I don't know the details of what you ought to adjust.

From your description, I can't quite tell whether you fully understand the problem behavior, so I'll try to fill in some details:

The program is actively using more memory than the OS is letting the program keep resident. So the program is constantly soft faulting pages from the cache into its resident set, while the OS is bumping other pages out of that resident set into the cache. Then those other pages will be soft faulted soon after.

So if the memory management algorithms were tuned better, the process would have a larger resident set and there would be fewer pages pushed back into cache.

markseger · 11-10-2011, 07:42 AM

whenever there is a performance problem of any kind, point in time snapshots of the system state are usually not particularly helpful. what was the state when the job started> did it immediately change or was it gradual. Was the state constant or did it change over time?

These are all critical questions to better understanding the overall behavior.

If you really want to take a different approach, and it's really pretty easy, install collectl and turn it on. It will sample almost everything the system is doing and write the results to a log in /var/log/collectl, taking samples every 10 second. Not to worry, the load is ~0.1%.

Now, wait a day or at least until you're application has run for awhile or better yet finishes - and does the state return to normal when it does?

At this point if you install collectl-utils on a system that has a webserver runnign on it you'll find a nifty tool called colplot. Browse to http://hostname/colplot and you should see colplot start up. You point it to a directory containing collectl plot files and tell it to plot everything. You'll see 24 hour plots of virtually everything your system is doing and hopefully the answer lies in the data.

btw - this same technique will work with ANY data but you need:
- the data visible as plots
- sufficient types of data: cpu, network, disk, memory at minimum but more types are better
- samples taken at a reasonable frequency and 10 seconds seems to work very well

-mark

SecretCode · 11-10-2011, 09:17 AM

mark: What is the difference / relationship between collectl and collectd?

markseger · 11-10-2011, 09:24 AM

collectl and collectd are 2 totally different tools. I never heard of collectd when I wrote collectl and don't even know which came first. collectl is based on Ron Urban's collect tool, which ran on DEC's Tru64 Unix. When linux was becoming more visible in High Performance Computing, I ported the collect functionality to collectl, hence the meaning of the name - collect for linux.

While I can't tell you the difference, the focus of collectl has always been
- support the broadest set of performance counters around (and I think it really does)
- run with a relatively frequent monitoring rate of sampling every 10 seconds, though process sampling runs at once a minute since it is heavier-weight
- be light weight enough so that people will just turn it on an leave it running, and it does tend to use ~.1% of a cpu

As an aside, it's not unusual to find collectl on some of the largest and fastest clusters in the world. If you look at the list of the top 500 clusters, collectl at least runs on most if not all HP's.

-mark

grim76 · 11-10-2011, 12:37 PM

I am wondering if the user that is running the application is hitting a limit situation. Might want to check into how system resources are allowed to be used by checking ulimit -a as the user that is running the application.

johnsfine · 11-10-2011, 12:47 PM

Quote:

Originally Posted by grim76

I am wondering if the user that is running the application is hitting a limit situation. Might want to check into how system resources are allowed to be used by checking ulimit -a as the user that is running the application.

The only relevant limit would be resident set size. I'm pretty sure in Linux ulimit tracks that limit but it doesn't connect to any support in the OS.

It appears something is limiting the resident set size, but I'm pretty sure that something isn't the value managed by ulimit.

abiadak · 11-10-2011, 01:28 PM

Thank you all for your help. Interesting to know about collectl, sure it's quite useful, I'll give it a try.

With respect to the initial post, the behaviour I was reporting happened at 3 servers at the same time, although they had already run similar processes from the queue. After stopping them and send again, they run flawlessly, so it must have been a transient situation that halted them in that strange state.
I'd bet something related with the shared file system or the network, but what I don't know yet is why only affected to those three and not to the rest. I'll have to keep an eye on them to see if it happens again.

markseger · 11-10-2011, 01:49 PM

aha! you raise an interesting question. did this happen at exactly the same time or approximately the same time? this is why it's so important to run a tool like collectl. it actually synchronizes its sampling down to the msec level so if you're running ntp on all your machines, every 10 seconds all counters will be sampled within a couple of msec of each other. Then when the problem reoccurs, which it probably will, you will be able to compare behaviors. perhaps it happens on one machine first and 'spreads' to others OR maybe it happens simultaneously. could it be a shared resource like a network? hard to say without detailed data (and I do mean detailed).

-mark