First thought is that the hardware is old and tired and any kind of serious load kills it.
Somewhat related idea: bug in the driver for one of the two network devices is killing the machine, and increased activity triggers the bug.
A few ideas for tracking down the problem:
Write a small script to simply log the fact that the box is up. Pay careful attention to the time (and make sure ahead of time that both boxes have good time) when you reboot it. If the script continued to log right up until the time you rebooted it, the box isn't completely dead.
Quote:
#!/bin/bash
while :
do
date >> /tmp/lastup
sleep 1
done
|
Also look for gaps in that. On occasional one of one second is nothing to worry about, but if there are long gaps, or lots and lots of one and two second gaps, that means the systems bogged down very heavily.
If it stops when you first see the symptoms of the box's "death", then it really is locking up hard. That points to either a serious kernel bug (almost certainly in one of the network drivers) or hardware. (My money would be on hardware.)
If the box dies slowly, a trick I've found to be extremely useful is to set up a bunch of commands (examples: "vmstat 5 5", "netstat", "ps -ef", "tail -20 /var/log/messages","ifconfig", etc.) that you would LIKE to had the chance to run when the thing started screwing up. Run this script every 5, 10, 15 minutes -- whatever's appropriate to how long it takes to get messed up. (The closer together you're going to run it, the shorter it should be.)
Finally, put an entry in /etc/syslogd.conf that says "*.debug" and point it to some file (probably would be better not to make it your /var/log/messages file). That might get you some info on the problem that you currently can't see.
Hope this helps,
CHL