Getting a snapshot of server activity at freeze time
I'm troubleshooting a problem on the head node of a cluster we have where the filesystems on a connected SCSI RAID suddenly become RO. I get errors in /var/log/messages that I'm working with the vendor on resolving.
ALSO, more infrequently, the machine goes deaf to connections and when I log in at the console, I quickly find it seemingly bogged down to a crawl and unresponsive. The only way I can recover is a hard reboot. When this happens, I see nothing in /var/log/messages. Just a blank gap for hours before my reboot.
I'm not sure if the two are related problems, so what I'm looking for here is a suggestion for what I should run to give me a snapshot of everything that is going on on the server at any given time. A time-stamped log more excrutiatingly verbose than /var/log/messages so I can know processor load, users logged in, and more. Also something separate from the syslog process driving /var/log/messages so when that has nothing in it, I can hopefully have another log to look to that will let me piece together why.
What would be the best tool for this job?
|