how can i find when server goes down?

eldhochacko · 09-03-2009, 11:29 PM

When i checked my server in today morning,i found server was hang.
How can i find whether it is hang or not?
Where can i find the log from which i can find the exact time?

JulianTosh · 09-03-2009, 11:30 PM

You'll need another server to monitor it. Usually via pings or some other kind of service test.

Check /var/log/messages for the last message to see if there's anything useful.

eldhochacko · 09-03-2009, 11:34 PM

I had checked that command,but i couldnt find the time by which i could find the exact time by which server was down or hang.

JulianTosh · 09-03-2009, 11:35 PM

If the server freezes... there's not much you can do about that. Check application logs for the last timestamped entry to guestimate when it went down - and perhaps why too.

That's why it's nice to have another system monitoring so you can get a timeline of it's state.. memory usage, cpu usage, disk usage, etc

eldhochacko · 09-04-2009, 12:15 AM

Hi beotch,

thanks for ur replay,

i sending my server System log details.I have manually restarted server on 3rd Septemper morning.

But upto sep 1 , 3 O'clock AM we r working on our appllication ,

But it's not update in message log,

What r possbility for missing the log in sep 1 & sep 2

Aug 31 00:08:46 cginq01 last message repeated 3 times
Aug 31 07:04:12 cginq01 last message repeated 3 times
Aug 31 08:17:49 cginq01 last message repeated 3 times
Aug 31 11:45:44 cginq01 last message repeated 2 times
Aug 31 12:18:49 cginq01 last message repeated 3 times
Aug 31 12:29:36 cginq01 last message repeated 3 times
Aug 31 12:31:58 cginq01 last message repeated 3 times
Aug 31 12:35:56 cginq01 last message repeated 3 times
Aug 31 12:42:09 cginq01 last message repeated 2 times
Aug 31 12:44:12 cginq01 last message repeated 3 times
Aug 31 12:56:44 cginq01 last message repeated 3 times
Aug 31 12:58:47 cginq01 last message repeated 3 times
Aug 31 13:16:35 cginq01 last message repeated 3 times
Aug 31 13:18:23 cginq01 last message repeated 3 times
Aug 31 13:22:10 cginq01 last message repeated 3 times
Aug 31 13:24:31 cginq01 last message repeated 3 times
Aug 31 13:30:31 cginq01 last message repeated 3 times
Aug 31 15:24:28 cginq01 last message repeated 3 times
Aug 31 15:35:20 cginq01 last message repeated 3 times
Aug 31 16:01:04 cginq01 last message repeated 2 times
Aug 31 17:13:44 cginq01 last message repeated 3 times
Aug 31 17:18:30 cginq01 last message repeated 3 times
Sep 3 10:07:10 cginq01 syslogd 1.4.1: restart.
Sep 3 10:07:10 cginq01 audispd: starting audispd
Sep 3 10:07:10 cginq01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 3 10:07:10 cginq01 kernel: Linux version 2.6.18-53.el5 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc version 4.1.2 2007
0626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct 10 16:34:19 EDT 2007
Sep 3 10:07:10 cginq01 kernel: Command line: ro root=LABEL=/ rhgb quiet
Sep 3 10:07:10 cginq01 kernel: BIOS-provided physical RAM map:
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 0000000000000000 - 000000000009ac00 (usable)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 000000000009ac00 - 00000000000a0000 (reserved)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 0000000000100000 - 00000000bffcb440 (usable)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 00000000bffcb440 - 00000000bffceac0 (ACPI data)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 00000000bffceac0 - 00000000c0000000 (reserved)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
Sep 3 10:07:10 cginq01 kernel: BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
Sep 3 10:07:10 cginq01 kernel: DMI 2.4 present.
Sep 3 10:07:10 cginq01 rpc.statd[2805]: Version 1.0.9 Starting
Sep 3 10:07:10 cginq01 kernel: SRAT: PXM 0 -> APIC 0 -> Node 0
Sep 3 10:07:10 cginq01 kernel: SRAT: PXM 0 -> APIC 1 -> Node 0
Sep 3 10:07:10 cginq01 kernel: SRAT: PXM 0 -> APIC 2 -> Node 0
Sep 3 10:07:10 cginq01 kernel: SRAT: PXM 0 -> APIC 3 -> Node 0
Sep 3 10:07:10 cginq01 kernel: SRAT: Node 0 PXM 0 0-c0000000
Sep 3 10:07:10 cginq01 kernel: SRAT: Node 0 PXM 0 0-140000000
Sep 3 10:07:10 cginq01 kernel: SRAT: Node 0 PXM 0 0-1000000000
Sep 3 10:07:10 cginq01 kernel: SRAT: hot plug zone found 140000000 - 1000000000
Sep 3 10:07:10 cginq01 kernel: SRAT: Hotplug region ignored
Sep 3 10:07:10 cginq01 kernel: Bootmem setup node 0 0000000000000000-0000000140000000
Sep 3 10:07:10 cginq01 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Sep 3 10:07:10 cginq01 kernel: disabling kdump
Sep 3 10:07:10 cginq01 rpc.statd[2805]: statd running as root. chown /var/lib/nfs/statd/sm to choose different user
Sep 3 10:07:10 cginq01 kernel: ACPI: PM-Timer IO Port: 0x588
Sep 3 10:07:10 cginq01 kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Sep 3 10:07:10 cginq01 kernel: Processor #0 6:15 APIC version 20
Sep 3 10:07:10 cginq01 kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Sep 3 10:07:10 cginq01 kernel: Processor #1 6:15 APIC version 20
Sep 3 10:07:10 cginq01 kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Sep 3 10:07:10 cginq01 kernel: Processor #2 6:15 APIC version 20
Sep 3 10:07:10 cginq01 kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Sep 3 10:07:10 cginq01 kernel: Processor #3 6:15 APIC version 20
Sep 3 10:07:10 cginq01 kernel: ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])

Best Regards
Chacko

JulianTosh · 09-04-2009, 12:21 AM

I'm gonna say your server went down shortly after Aug 31 17:18:30.

go back further... want to see what message was being repeated so many times.

Also, what kind of server is it? web server? tail the httpd logs and post them too if it is.

eldhochacko · 09-04-2009, 12:28 AM

Hi Beotch,

thanks,

This is not WEB Server,which we r running SAP ECC6 on this server (Red Hat Linux 5)

But upto sep 1 , 3 O'clock AM we r working on SAP appllication ,

Best Regards
Chacko

JulianTosh · 09-04-2009, 12:33 AM

go back further in /var/log/messages and grab the log entries. We need to see what message was repeating so skip back until you see something other than "last message repeated..."

JulianTosh · 09-04-2009, 01:35 AM

Ah. ok then... well the evidence, if any, might be in another application log. Get a list of services/daemons together that are running on that machine (ssh, iptables, etc) and start going through their logs for the time that the primary service went down. If you get a list of those services, I can help you lookup where they typically store their own log files.

Also, I'd still be curious about what those repeated messages were.

canyonbreeze · 09-04-2009, 12:33 PM

You can install Webmin. It has a function to notify you if your Apache, Postfix, MySQL, etc, goes down. My setup sends a message to my cell phone via email.

JulianTosh · 09-04-2009, 04:35 PM

I love webmin and that's an OK solution for monitoring local services, but it won't help if the server goes down hard. In this case, it needs to be monitored by a separate server and monitoring service that is immune to any volatile states the watched server/service is currently experiencing.

But if that's all you got, it's better than nothing! 8D

eldhochacko · 09-23-2009, 11:46 PM

Hi Beotch;

again my Quality linux Server has been restarted yesterday morning.what was the correct reason for linux server restarting problem again and again?

eldhochacko · 09-24-2009, 12:02 AM

Hi Beotch;

again my Quality linux Server has been restarted yesterday morning.what was the correct reason for linux server restarting problem again and again?

rsciw · 09-24-2009, 01:29 PM

Quote:

Originally Posted by Admiral Beotch

I love webmin and that's an OK solution for monitoring local services, but it won't help if the server goes down hard. In this case, it needs to be monitored by a separate server and monitoring service that is immune to any volatile states the watched server/service is currently experiencing.

But if that's all you got, it's better than nothing! 8D

Munin imo is also a nice monitoring tool.
Handy to see when something dies off via the graphs

(that said, I don't know webmin, but'll check it out)

kschmitt · 09-24-2009, 02:44 PM

You need two things: a monitoring system, and a syslog server. One tells you when there's a problem, the other is used to diagnose the problem.

The monitoring system could be really simple, like putting a script in crontab that pings each server and email a list of which don't reply.

or

The monitoring system could be insanely complex and feature rich like Zenoss or Hypernic or something.

Now you need a syslog server! Syslog servers are really easy to setup, just google for setting up a syslog server in your favorite distro. What you want is for every server in your environment to log to that one syslog server. This is important so all the logs are in one place _outside_ of the server that's having problems. Then you can review the logs while the dead server is being rebuilt, or kept offline for security reasons.

When that's setup, what will happen is: your server logs what it's doing to your syslog server; something goes wrong, it writes it to syslog; the monitoring system alerts you that there was a problem; you go into your syslog server and read the logs to figure out what happened.

It's what I do here at work.

I've got a pretty large environment that I take care of (dev), and a more important, but smaller environment (production). Right now I'm monitoring both of them with zenoss, which is pretty cool, but honestly, for what I _really_ need, a pinging script would do fine. Dev and production each have their own syslog server, and prettymuch everything logs to one server or the other. When something goes wrong, Zenoss sends me an email. At that point I hop on syslog and see if it can tell me what happened (most of the time, it can).

Sidenote: All HP jetdirect cards can all do syslog! If you point a JD card to your syslog server, you have an easy way of knowing when something is going wrong with your printers. For instance, if I see more than a handful of paper-late jams on one printer in a day, I can be pretty sure the fuser is going.

--Kyle