Is it possible to acheive HA on a MYSQL replicated setup?

Swakoo · 11-05-2006, 09:38 PM

Currently my database server running mysql is on RAID 1, and I have another server querying it to do a daily backup. So that's the only protection I have.

I am looking into doing a more active backup, so exploring replication. But I realise replication don't do auto failover... or so I read.

Is there a way to achieve it? To achieve auto failover, load-balanced environment so as to ensure HA?

Can anyone please advice?

thanks!

mcrbids · 11-06-2006, 12:07 AM

Quote:

Originally Posted by Swakoo

Currently my database server running mysql is on RAID 1, and I have another server querying it to do a daily backup. So that's the only protection I have.

I am looking into doing a more active backup, so exploring replication. But I realise replication don't do auto failover... or so I read.

Is there a way to achieve it? To achieve auto failover, load-balanced environment so as to ensure HA?

Can anyone please advice?

thanks!

I've run database-driven stuff for years. HA is over-priced and over-rated in all but the most extreme cases. A cheap-o 1U PIII on E-Bay is plenty powerful enough for a surprising number of cases, and will usually deliver 3-4 nines (99.9% to 99.99%) on a shoestring.

Once you get ego out of the way, you'd be surprised how rarely anybody actually needs more than this. At 99.99%, you have about 8 hours of downtime per year. Be honest - what would happen if your system was down 1 day every year or so?

To go HA and get to 5 nines 99.999% is less than 1 hour per YEAR.

And the price is constant vigilance. If hiring a qualified technician full-time in order shave out that 1 or 2 business days per YEAR is out of the question, it's unlikely that you need to worry about it.

Having any system that "cuts over" automatically in a failure is a tremendous pain in the arse. A techie at the datacenter fat-fingers the switch on a power strip, and the 5 minutes of downtime on your database server morphs into a sleepless night rebuilding your primary database on the server and resetting all your logic servers to use the primary DB server again.

Yuck.

My advice? Write a script that backs up your database every hour and copies it to a remote location with scp or rsync over ssh. If you want, you can have a "hot" backup cheezo PIII that loads the database hourly as well, so that if you have to cut over, you change a setting on your web servers and you're done.

Swakoo · 11-06-2006, 01:30 AM

Hi, thanks for your pointers. Noted

Currently, I am already running a spare machine (doing RAID 5 though, hehe) which draws the database from the live production server everyday (5am) using scp over rsync. So that gives me 24hr backup at best (should my RAID1 fail). Can't do it hourly as our dB is busy most of the time.

I'm currently studying HA and realise... I can't just do it with just any distro. I need a cluster-able Distro to do it like Redhat Cluster suite.. to achieve my HA and LB... I suppose that's what you meant by the 'feasibility' factor.

Thus I am looking at ways to backup my dB on a 'live' basis which leads me to exploring replication. But while that gives me 'almost' by the minute backup, it doesn't roll over automatically, and hence as you mention.. probably cater to that 8 hours of downtime per year. It actually downed more than 8hrs this year because the sheer amount of traffic and database is huge.. but well..

so.. do u think I should just rely on replication.. and manually point to the 'slave' machine should it fail... or...? because initially i thought a 'master-slave' replication setup meant that the slave would kick in if the master goes down. Apparently not.

But as I said.. noted your points. very logical indeed

Quote:

Originally Posted by mcrbids

I've run database-driven stuff for years. HA is over-priced and over-rated in all but the most extreme cases. A cheap-o 1U PIII on E-Bay is plenty powerful enough for a surprising number of cases, and will usually deliver 3-4 nines (99.9% to 99.99%) on a shoestring.

Once you get ego out of the way, you'd be surprised how rarely anybody actually needs more than this. At 99.99%, you have about 8 hours of downtime per year. Be honest - what would happen if your system was down 1 day every year or so?

To go HA and get to 5 nines 99.999% is less than 1 hour per YEAR.

And the price is constant vigilance. If hiring a qualified technician full-time in order shave out that 1 or 2 business days per YEAR is out of the question, it's unlikely that you need to worry about it.

Having any system that "cuts over" automatically in a failure is a tremendous pain in the arse. A techie at the datacenter fat-fingers the switch on a power strip, and the 5 minutes of downtime on your database server morphs into a sleepless night rebuilding your primary database on the server and resetting all your logic servers to use the primary DB server again.

Yuck.

My advice? Write a script that backs up your database every hour and copies it to a remote location with scp or rsync over ssh. If you want, you can have a "hot" backup cheezo PIII that loads the database hourly as well, so that if you have to cut over, you change a setting on your web servers and you're done.

mcrbids · 11-08-2006, 05:42 PM

Quote:

Originally Posted by Swakoo

Hi, thanks for your pointers. Noted

Thus I am looking at ways to backup my dB on a 'live' basis which leads me to exploring replication. But while that gives me 'almost' by the minute backup, it doesn't roll over automatically, and hence as you mention.. probably cater to that 8 hours of downtime per year. It actually downed more than 8hrs this year because the sheer amount of traffic and database is huge.. but well..

so.. do u think I should just rely on replication.. and manually point to the 'slave' machine should it fail... or...? because initially i thought a 'master-slave' replication setup meant that the slave would kick in if the master goes down. Apparently not.

Like I've said - replication mostly REDUCES uptime, not improves on it, because there are so many things that can go wrong. I've seen more problems due to replication errors, partial switchover failures, partial failures, and network burps causing more downtime than I've ever seen caused by even catastrophic failure. (EG: motherboard catching fire)

If you are really sure you want to try HA, my suggestion would be to go ahead and replicate to your backup host, and don't actually use your backup host. If your primary fails, then reconfig your backup host as the primary, and change your logic/web servers to use the backup host manually.

I'd suggest a set of scripts (I'd use SSH with RSA keys so it's automatic) that do this all in one fell swoop, to switch from production to failover, and back again. Test them at least monthly, at night or something. Automate the test, as well, so that it's easily enough done that you might actually do it on a regular basis.

HA is non trivial, and I've never seen the business case where it was actually warranted. If you can't justify a full-time DBA position to make sure that database is 100% 24x7, you probably should be looking at having a hot backup system and manual failover, manually propogated every hour or so, with a promise of 1-2 hour turnaround during business hours in the case of a failure.

Make sure to have backups offsite. I use rsync over ssh and a set of scripts to do this - it works rather well. http://www.effortlessis.com/backupbuddy

Swakoo · 11-15-2006, 03:34 AM

i see your point on relying on 'manual' forces for that last 1%

will be playing with it furthur.. will come with questions if i have any

thanks people!