Help! Forking errors!

strider · 04-24-2002, 10:22 AM

I've got some problems. Here's the situation: upgraded an old RedHat 6.1 machine, some new hardware and installed RedHat 7.2. I've got about 120-130 users on this machine and several databases of info. Problem is in the morning when everyone is logging on I start getting all of these ksh: cannot fork errors or fork: cannot allocate memory, when trying to do just about anything: (ie. ls, who, uptime, top). At first I thought that it might be swap space, but I have twice my RAM at 2.0G. Then I thought that it might be the number of instances in telnet for xinetd, so I made it unlimited. I am still having the problem, but it only seems to be in the morning. Has anyone had this problem before? Thanks in advance for your help.

akohlsmith · 04-24-2002, 10:26 AM

You mention that you've changed your ulimits -- you may have to check that it's taking effect and that something isn't changing it back on you.

How many processes are running on the system in the morning? With that many users and all those databases you may just be running out of pids. I believe there are 64-bit pid patches for the kernel and system libraries.

strider · 04-24-2002, 10:32 AM

Here is something from /var/log/messages:

Apr 24 08:25:37 copper xinetd[9273]: telnet: fork failed: Cannot allocate memory (errno =12)
Apr 24 08:25:37 copper xinetd[9273]: service telnet: too many consecutive fork failures
Apr 24 08:25:38 copper xinetd[9273]: pop3: fork failed: Cannot allocate memory (errno = 12)

Here is the top when this is happening (sorry for the mess):

8:28am up 12 days, 22:53, 58 users, load average: 0.26, 0.18, 0.11
656 processes: 655 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states: 0.3% user, 0.2% system, 0.0% nice, 99.0% idle
CPU1 states: 2.0% user, 2.0% system, 0.0% nice, 95.3% idle
Mem: 1028412K av, 1021992K used, 6420K free, 192K shrd, 62576K buff
Swap: 2040244K av, 0K used, 2040244K free 540496K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
19783 mitchsi 15 0 1428 1428 836 R 3.0 0.1 0:00 top
19786 valerca 10 0 1816 1816 1156 S 0.5 0.1 0:00 fglgo
19787 valerca 10 0 1504 1504 1256 S 0.5 0.1 0:00 sqlexec
19333 alicihe 10 0 1896 1896 1212 S 0.3 0.1 0:00 fglgo
200 root 10 0 0 0 0 SW 0.1 0.0 0:07 kjournald
19600 kimmc 9 0 1864 1864 1216 S 0.1 0.1 0:00 fglgo
19732 root 9 0 820 820 684 S 0.1 0.0 0:00 in.telnetd
1 root 8 0 520 520 452 S 0.0 0.0 0:07 init

And finally from xinetd.conf:
#
# Simple configuration file for xinetd
#
# Some defaults, and include /etc/xinetd.d/

defaults
{
instances = 100
log_type = SYSLOG authpriv
log_on_success = HOST PID
log_on_failure = HOST
cps = 25 30
}

includedir /etc/xinetd.d

Hope this helps you help me!

strider · 04-24-2002, 10:45 AM

I don't think I changed ulimits, just the file telnet under /etc/xinetd.d, then I restarted xinetd. Maybe that was what you were referring to? Well I check the file and the change is still there. Also I'm not so sure that it would be processes, as you can see from my top there are around 650 right now (during the problem), but I have had well over 1300 on the machine at times with no problems what so ever. It's an annoying problem.

akohlsmith · 04-24-2002, 11:01 AM

Offhand what are the limits for the system? (ulimit -Ha and ulimit -Sa)

can you simulate the problem (like a controlled DoS attack) at all? What are the /proc/sys/fs/file-nr and inode-nr values normally and in the morning? What is swap usage like? (/proc/meminfo)

Sorry for all the questions but this one is intriguing. :-)

strider · 04-24-2002, 11:23 AM

No problem, interest is good.

Here is output from ulimit -Ha and -Sa:

mitchsi on copper.bdsn.com = > ulimit -Ha
time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) 0
data(kbytes) unlimited
stack(kbytes) unlimited
lockedmem(kbytes) unlimited
memory(kbytes) unlimited
nofiles(descriptors) 1024
processes 4095

mitchsi on copper.bdsn.com = > ulimit -Sa
time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) 0
data(kbytes) unlimited
stack(kbytes) unlimited
lockedmem(kbytes) unlimited
memory(kbytes) unlimited
nofiles(descriptors) 1024
processes 4095

We thought maybe it had something to do with mail (people checking mail in the morning), but we can't seem to replicate it once it has gone away (the machine's fine now by the way).

file-nr: 23885 7055 49152
inode-nr: 153647 0

And right now meminfo says that swap is not being used up:

total: used: free: shared: buffers: cached:
Mem: 1053093888 1042399232 10694656 196608 30420992 452923392
Swap: 2089209856 0 2089209856

akohlsmith · 04-24-2002, 12:36 PM

I was doing some reading on forking problems but everything points to memory (apparently the error really is telling the truth, go figure) :-)

I was hoping it was something simple like running out of inodes or file descriptors and the error message was just wrong, but some research seems to go against this idea. You sure seem to have a lot of activity on that system though with inode #s like that!

I can only suggest turning off swap, running mkswap -c on the swap partition. You've been warned: 2G will take a while to check. :-)

From what I've been reading this can be caused when the virtual memory system has been corrupted (bad blocks on swap, etc.) -- what is the kernel and uptime of the machine?

strider · 04-24-2002, 03:24 PM

Hmmm....I may have to try that, with the mkswap -c, problem is I don't want to disrupt users and I don't want to reboot if I don't have to.

Current uptime is 13days, 77 users, load avg. 0.21, 0.18, 0.18

Thanks for all of your help, btw.

akohlsmith · 04-24-2002, 04:24 PM

Since swap isn't doing any good anyway (and you have none used, which is VERY odd if it was turned on at boot) then swapoff -a won't disrupt anyone, and the mkswap/swapon -a won't affect anyone. (unless it's on the same channel as your main drive, in which case go dunk your head in a toilet and fix it.)

I'd let it run overnight, you shouldn't have any troubles.

aside: normally when there is no memory init (I think it's init, it may be the kernel itself) will kill off the biggest memory hogs in sequence to free up memory so the whole fork error is strange to begin with.

aside 2: the reason I say it's very odd that 0 bytes of swap are used (assuming it was turned on at boot) is because Linux will push less-used pages to swap when it runs out of memory, but it will NOT page them back in until they're needed. That almost ALWAYS means that once some swap is used, there is always some showing as used. I wouldn't mind seeing a pagein daemon that gently pages things back in if memory is free, since at least on desktops you end up with a big delay because vmware got paged out since you were doing a compile or something. :-)

strider · 04-25-2002, 11:58 AM

Well I broke down and called RedHat Linux support. Their response was that it is a problem with the kernel that I am running. We were running the Enterprise kernel, which I guess is optomized for over 4GB of memory. We should be using the SMP kernel. I think this may be the solution as the machine that is working is running SMP, but the two which are not are running Enterprise. I will be rebooting the machines at lunch and crossing my fingers.

akohlsmith · 04-25-2002, 12:22 PM

... Custom frikkin' kernels. UGH I never even thought to ask about that.

Good luck, I hope this solves your problem!