RHEL Server Rel 5.4 freezes with large jobs

ArthurGoldberg · 02-25-2010, 07:06 PM

Hello

We're running
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
$ uname -r
2.6.18-164.11.1.el5

It hosts an Apache/2.2.3 web server. We also run apache-tomcat-5.5.23. Most of our programs are mod_perl. Sometimes our users input over-sized data sets, or queries that generate too much output. (I realize that we should try to prevent them from doing that, but right now I'm looking for a more general solution.)

When a large job runs it can 'freeze' our system. The system becomes unresponsive to everything, including command line commands. Sometimes it unfreezes after a while. Once, in this situation I was able to create a high-priority shell. ps reported:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
349 root 10 -5 0 0 0 R 37.7 0.0 5:53.56 kswapd1
348 root 20 -5 0 0 0 R 35.8 0.0 5:57.67 kswapd0

It froze again yesterday, sometime around 14:50; about then /var/log/messages says

Feb 24 14:34:47 ourMachine avahi-daemon[3133]: Invalid query packet.
Feb 24 14:35:18 ourMachine last message repeated 6 times
Feb 24 14:35:18 ourMachine last message repeated 2 times
Feb 24 14:35:32 ourMachine setroubleshoot: SELinux is preventing the http daemon from connecting to network port 3306 For complete SELinux messages. run sealert -l 0afcfa46-07b8-48eb-aec3-e7dda9872b84
Feb 24 14:35:34 ourMachine avahi-daemon[3133]: Invalid query packet.
Feb 24 14:55:06 ourMachine last message repeated 6 times
Feb 24 15:00:44 ourMachine last message repeated 3 times
Feb 24 15:00:55 ourMachine last message repeated 5 times
Feb 24 15:01:21 ourMachine dhclient: DHCPREQUEST on eth0 to 128.122.128.24 port 67
Feb 24 15:09:51 ourMachine kernel: hald-addon-stor invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Feb 24 15:09:51 ourMachine kernel:
Feb 24 15:09:51 ourMachine kernel: Call Trace:
Feb 24 15:09:51 ourMachine kernel: [<ffffffff800c6076>] out_of_memory+0x8e/0x2f3
Feb 24 15:09:51 ourMachine kernel: [<ffffffff8000f487>] __alloc_pages+0x245/0x2ce
Feb 24 15:09:51 ourMachine kernel: [<ffffffff80017812>] cache_grow+0x133/0x3c1
Feb 24 15:09:51 ourMachine kernel: [<ffffffff8005c2e5>] cache_alloc_refill+0x136/0x186
Feb 24 15:09:51 ourMachine kernel: [<ffffffff8000ac12>] kmem_cache_alloc+0x6c/0x76
Feb 24 15:09:51 ourMachine kernel: [<ffffffff80012658>] getname+0x25/0x1c2
Feb 24 15:09:51 ourMachine kernel: [<ffffffff80019cba>] do_sys_open+0x17/0xbe
Feb 24 15:09:51 ourMachine kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Feb 24 15:09:51 ourMachine kernel:
Feb 24 15:09:51 ourMachine kernel: Mem-info:
Feb 24 15:09:48 ourMachine dhclient: DHCPREQUEST on eth0 to 128.122.128.24 port 67
Feb 24 15:03:45 ourMachine avahi-daemon[3133]: Invalid query packet.
Feb 24 15:09:54 ourMachine kernel: Node 0 DMA per-cpu:
Feb 24 15:09:54 ourMachine dhclient: DHCPREQUEST on eth0 to 128.122.128.24 port 67
Feb 24 15:09:54 ourMachine avahi-daemon[3133]: Invalid query packet.
Feb 24 15:09:54 ourMachine kernel: cpu 0 hot: high 0, batch 1 used:0
Feb 24 15:09:55 ourMachine dhclient: DHCPACK from 128.122.128.24
Feb 24 15:09:55 ourMachine avahi-daemon[3133]: Invalid query packet.
Feb 24 15:09:55 ourMachine kernel: cpu 0 cold: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 1 hot: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 1 cold: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 2 hot: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 2 cold: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 3 hot: high 0, batch 1 used:0
Feb 24 15:09:56 ourMachine kernel: cpu 3 cold: high 0, batch 1 used:0
Feb 24 15:09:57 ourMachine kernel: Node 0 DMA32 per-cpu:

Observing the machine, I see at least one very busy disk. I suspect that some high priority system process (perhaps kswapd) is using all the cpus, preventing anything else from running. Unfortunately, I cannot find much info on kswapd, or debuggging this problem.

Thanks
Arthur

John VV · 02-25-2010, 07:30 PM

well mod_perl can bog a system down
but i routinely work with 5 Gig to 9 Gig( or bigger) imaging data sets on CentOS5.4

can you define what you consider a "too large" file is and what they are doing with it.

mesiol · 02-26-2010, 12:34 AM

Hi,

you can limit the resources for the user running apache/tomcat to prevent from a complete unresponsive system.
Take a look at

Code:

ulimit

.

ArthurGoldberg · 02-26-2010, 11:14 AM

thanks folks

I investigated ulimit. These are our current ulimits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 135168
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 135168
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Our box has 16 GB RAM. Right now ps (ps -ale --sort=-vsize) says (I added commas for readability - wish code would do that):

S UID PID PPID C PRI NI RSS SZ WCHAN TTY TIME CMD
S 48 3490 3484 1 75 0 7,861,380 2,071,971 stext ? 00:35:10 httpd
S 48 3569 3484 0 76 0 1,217,284 402,132 stext ? 00:05:51 httpd
S 0 3200 1 1 79 0 140,716 336,470 stext ? 00:26:26 java
S 48 3488 3484 0 76 0 364,932 199,448 stext ? 00:01:23 httpd
S 48 3571 3484 0 75 0 312,572 175,107 stext ? 00:01:46 httpd

where
RSS = resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
SZ = approximate amount of swap space that would be required if the process were to dirty all writable pages and then be swapped out. This number is very rough!

The biggest httpd seems too big. Perhaps it allocated a bunch of memory and never freed it.

I think that the important ulimit options for us are:

-d The maximum size of a process’s data segment
-l The maximum size that may be locked into memory
-v The maximum amount of virtual memory available to the shell

I'm thinking of trying a 2 GB limit on data segments with "ulimit -d 2000000".

e.g., see http://httpd.apache.org/docs/2.0/vhosts/fd-limits.html:
#!/bin/sh
ulimit -S -n 100
exec httpd

we could start httpd with
#!/bin/sh
ulimit -d 2000000
exec /usr/sbin/apachectl restart

then processes won't be able to exceed a 2 GB data segment.

For the system call, see http://linux.die.net/man/2/setrlimit, and the underlying calls, "brk, sbrk - change data segment size" at
http://www.kernel.org/doc/man-pages/...an2/brk.2.html.

Comments?