LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (http://www.linuxquestions.org/questions/slackware-14/)
-   -   Root server crash: hunting down the cause of the crash (http://www.linuxquestions.org/questions/slackware-14/root-server-crash-hunting-down-the-cause-of-the-crash-4175476201/)

kikinovak 09-06-2013 12:32 PM

Root server crash: hunting down the cause of the crash
 
Hi,

I have a public server running Slackware64 14.0, with the following services :
  • DNS (Bind)
  • LAMP (Apache / PHP / MySQL)
  • IMAP mail (Postfix / Dovecot / Postgrey)
  • Streaming audio (Icecast / MPD)

The server is hosting a few static sites, a few dynamic CMS sites, our local school's management platform, and a small webradio.

It's not very powerful: a single-core processor (VIA Nano processor U2250 (1.6GHz Capable)) and 2 GB RAM.

I've carefully added services and users one by one, each time measuring resources using top and free and the likes.

Everything sort of works fine, but nonetheless, every three or four days, the server becomes unresponsive, and the automatic monitoring services ends up rebooting it after a while. So every three or four days, I have something like 15 minutes of downtime, which is not good.

Now I've setup a good dozen local LAN servers for clients, running 24/7/365, without any major problems. None of these machines has ever given me a headache. But now I'm puzzled. I'd like to investigate the cause of these regular crashes of my public machine, but I don't quite know where even to begin.

Any suggestions?

YankeePride13 09-06-2013 12:43 PM

First place to look would be the system logs. Check to see what's going on in there at the time of the crash.

Alien Bob 09-06-2013 01:10 PM

My first guess would be a DDoS against your web server. Or just too many people interested in downloading your MLES.

Eric

kikinovak 09-06-2013 01:19 PM

Quote:

Originally Posted by YankeePride13 (Post 5023183)
First place to look would be the system logs. Check to see what's going on in there at the time of the crash.

I just spent some time leafing through everything in /var/log around +/- 10 min the time of the crash, but there's nothing suspicious.

kikinovak 09-06-2013 01:21 PM

Quote:

Originally Posted by Alien Bob (Post 5023195)
My first guess would be a DDoS against your web server. Or just too many people interested in downloading your MLES.

Eric

My MLES/MLED/MLWS is hosted on another server, so this is not the cause.

Is there any way to know if a DDoS has happened? And if that is the case, are there any countermeasures?

NeoMetal 09-06-2013 03:03 PM

Check weblogs and bind logs for unusual activity. GoAccess might by handy for getting a quick weblog overview.

If you can configure your monitoring to restart the affected services rather than the whole machine, then, assuming that is enough to recover, you might be able to mitigate the downtime at least while you narrow things down.

TracyTiger 09-06-2013 03:27 PM

Disk Space
 
On a less sophisticated level you may want to see if it's running out of disk space. The somewhat regular failure suggests a memory leak or full disk (temporary files) is worth investigating.

Once a needed disk partition is full all sorts of symptoms can appear. In a previous life managing lots of UNIX servers this problem used to bite me about once a year. I eventually learned to check for resource exhaustion first.

kikinovak 09-06-2013 03:46 PM

Quote:

Originally Posted by Tracy Tiger (Post 5023263)
On a less sophisticated level you may want to see if it's running out of disk space. The somewhat regular failure suggests a memory leak or full disk (temporary files) is worth investigating.

Once a needed disk partition is full all sorts of symptoms can appear. In a previous life managing lots of UNIX servers this problem used to bite me about once a year. I eventually learned to check for resource exhaustion first.

No, that's not it.

Code:

# df -h
Sys. fich.    Taille Util. Dispo Uti% Monté sur
/dev/sda3        145G  7,6G  130G  6% /
/dev/sda1        92M  34M  53M  40% /boot
tmpfs            986M    0  986M  0% /dev/shm


TracyTiger 09-06-2013 04:16 PM

Although it's just one of a million things that can cause your system to crash, to check for resource exhaustion you need to know what the state of the resource is just before it crashes, not when the system is running without problems.

In the past I set up cron jobs to regularly log the suspect areas to look for patterns. In some cases I took snapshots of resource usage every few seconds (keeping only the last few minutes worth) as the system went from healthy to broken in less than a minute.

But of course don't waste time on this if it's not the likely problem.

Regarding a network based problem ...
You're not new to the game so you probably already know that DOS/DDOS is a general term than can take many forms. A basic firewall using netfilter (iptables) can eliminate the basic ones by limiting the packets in different ways and to prevent table exhaustion and incomplete sessions. Iptables has helped me narrow down and find network based attacks on a couple of occasions.

I'm probably not mentioning anything that you don't already know. My understanding is basic so others can probably suggest newer and more efficient tools to protect your network and discover problems.

Many of us have spent days trying to solve a software problem that turns out to be an intermittent hardware failure. Don't forget that possibility.

allend 09-06-2013 08:42 PM

What is the form factor of the server? That CPU is associated with notebook designs. Could this be an overheating problem?

ReaperX7 09-06-2013 10:18 PM

Crashes can be one of many things especially if it's software related, but if it's doing it every 3-4 days, then most likely it's something to do with a service you're running that is generating disk usage issues, or a piece of hardware that is slowly failing.

A few questions I might ask:

1. How old is the hardware? (Each component's age would help.)

2. You said you run a mail service IMAP correct? How is the disk usage for the mail system, and how often does it take for the services to generate more than 20GB of disk usage including log files?

3. What temperature does your system idle at? Less or more than 50 degrees Celsius?

4. Do you run a software SPI (Stateful Packet Filtering) firewall like IPTables or a Hardware Firewall like a Barracuda Networks brand firewall?

5. One last question, but have you ever done a stress test to where you have a service running by itself for at least 5 days total to check for instabilities, before adding other services?

My educated guess is pointing towards the mail service allocating too much disk space for itself and then shutting down the server by over taxing the hard disk space with temporary generated files. If necessary, could you allocate a separate server just for mail services alone?

vdemuth 09-07-2013 01:02 AM

Well,

It seems that you are being guided toward this being a software problem, when it seems pretty obvious (to me at least) that it's a hardware problem. I would suspect that the electrolytic caps on the motherboard are in the early stages of failure and over varying and unpredictable periods of time, which would generally seem to be unrelated to anything the server might be doing, cause a reboot at processor level.
I would expect that the times between reboots will start to get closer together over the next few months until eventually it just wont restart. See this all the time where I work. On average out of the 4000+ or so servers we use, at any one time around 10% exhibit this problem and it's always hardware.

truthfatal 09-07-2013 01:18 AM

Template matching works %100 of the time %80 percent of the time! Something like this could easily be either hardware or software. If you have the redundancy/capability to deal with downtime, it might be worth taking the machine apart and doing some component isolation if the logs don't prove helpful.
I don't see any mention of using fsck to check the health of your partitions, or something like memtest86+ to check out your RAMs. (If you have specific, licensed diagnostic tools for your hardware that would be a plus) I'm not great at troubleshooting *nix logfiles, so that's why I'm talking hardware, but still if something is software and can be easily fixed, I'd definitely want to determine that first, especially since a software fix is typically less expensive (In my experience).

kikinovak 09-07-2013 02:10 AM

Thank you very much to everybody for your precious input.

The server itself is not a machine I bought, it's an el-cheapo root server renting offer from the french hoster Online (10 euros per month with unlimited bandwith). It's a single core processor, 2 GB RAM and 160 GB disk space (upgraded to 500 GB on recent offers). The machine comes with either Debian, Ubuntu LTS or CentOS preinstalled, but I managed to install Slackware on it using the Live Rescue session.

I think the right thing to do here would be a simple upgrade to real server hardware. I cringe at the thought of migrating all my freshly installed mail accounts, CMS sites and everything, but I think this would be the least of all evils.

kikinovak 09-07-2013 05:43 AM

Quote:

Originally Posted by kikinovak (Post 5023497)
I think the right thing to do here would be a simple upgrade to real server hardware. I cringe at the thought of migrating all my freshly installed mail accounts, CMS sites and everything, but I think this would be the least of all evils.

OK, just ordered a new server with a big fat hardware upgrade. Costs about thrice as much, but that's the price of sound sleep. In the meantime I'll mark this thread as SOLVED.


All times are GMT -5. The time now is 10:30 PM.