Quote:
Originally Posted by trickykid
A tape drive being down isn't an outage and I'm not talking just backups. Backups being down aren't considered downtime in a production environment, well, at least at no place I've worked. I guess you work in a small shop or you just run perfect hardware, network, proprietary application developed in house, etc and have the perfect co-located facility and or never have maintenance windows that go wrong with everything else. That's what I'm talking about, not just backups, no such thing as downtimes with those, especially when they only run at periodic times and not 24/7.
Oh well, I wish I lived in your perfect world if you never had hardware failure or network, power, etc, with anything else.
|
I was only giving the backup server as an example. If I hadn't been running Amanda (and had it configured with redundancy and fallbacks) backups probably would have been down for 2 days, or I would have been scrambling to deal with the backup setup as well as trying to fix the tape drive. So, would a backup outage not be an outage when a user asks for a lost file and you have to tell them you don't have a backup? Personally, I would call that an unacceptable outage.
I suppose you could call my environment relatively small. I am responsible for a couple of significant departments in a large university. I have about 20 Sun servers and maybe about 10 OpenBSD boxes that serve as filtering bridges and routers. I build redundancy in, I watch logs, and I take a proactive approach to potential failures. I've replaced a couple of drives (in the 8 year old range) before they failed, when I saw errors creeping up in the logs. SEC (simple event correlater --
http://www.estpak.ee/~risto/sec/) is another simple elegant tool that can help in this process, along with syslogng --
http://www.balabit.com/network-secur...ogging-system/
On a couple of occasions, I've managed to switch to a new server almost invisibly, with no hesitation in the transition -- just watching DNS updates propagate and user requests transition from the old server to the new one (changing ttl's to something short the day before and back to normal the day after). The users might have been unaware except that the performance improved and I told them we had a new server.
As far as what I run, you could almost say everything, but almost 100% open source. We have our own DNS server, NAT, mail servers, web servers, gene sequencing, databases, wikis, etc. The mail and gene sequencing involve a dozen or so software applications each. There are a couple thousand accounts on the primary server, and the mail logs accumulate about a half million lines a day. Amanda covers my butt on all that. I'm not responsible for people's desktops or lab computers. There are a few hundred of those as well as some Silicon Graphics behemoths in a lab doing complex modeling.
We did have some fun with power a couple of years ago. The campus was rebuilding the line to our cluster of buildings. An outside contractor supplied a tractor trailer size generator to carry the load while they did the work. Something was misconfigured, and their generator fired over 200+ volts down the 110 lines throughout the entire building for several minutes before they caught it. I was called in at about 4:00am. All my servers had SmartUPSes. The over voltage caused the UPSes to seriously overheat, and their smart circuitry tripped an internal breaker that broke off the outside connections. I didn't lose a single server or UPS. Almost as soon as the power situation was straightened out, I had my servers back up. Then I was running around helping other people who had cheaper UPSes in front of expensive equipment. Their UPSes fried and allowed the equipment they were "protecting" to be fried as well. When the building circuit breakers were switched back, those fried UPSes would trip it again. We had to find them all and unplug them. The outside contractor's insurance company was out a huge amount of money, and a few people got fired.
I didn't really call that an outage on my part, because it was outside forces beyond my control -- my equipment was just protecting itself and was back up as soon as the outside forces got straightened out. That event actually provided the push I needed to convince administrators to budget for a serious tape library to update our backup system.