How do I test if mysql dump succeeds while piping it to gzip

PastulioLive · 07-22-2016, 08:03 AM

Hey everybody,

Small question that may have a very simple answer, but google has failed me (I'm probably not using the correct search terms).

I am writing a script that dumps a mysql database and pipes it into gzip.
The problem is that I want to test if the dump was successfull, but the pipe gzip command always returns succesfull.

Here is the test command I'm using:

Code:

if [ `ssh root@${SERVER} "mysqldump --all-databases" 2>> ${LOG_FILE} | gzip > ${DESTINATION_PATH}/mysql/mysql_$(date +%F).gz` ]
then
  echo "ERROR: Database dump failed!" | tee -a ${LOG_FILE}
  ((ERRORS++))
fi

I've switched the "${SERVER}" variable to a server that does not exist.
The log file does not produce any error and the gzip file contains the letter "W".

Does anybody have any advise on how to best solve this.
(I also could try checking the filesize after the dump, but then I still don't know if the database dump is complete)

Thanks!

gda · 07-22-2016, 09:47 AM

You could use the internal bash array variable $PIPESTATUS which contains the exit status(es) of last executed foreground pipe:

Code:

ssh root@${SERVER} "mysqldump --all-databases" 2>> ${LOG_FILE} | gzip > ${DESTINATION_PATH}/mysql/mysql_$(date +%F).gz
if [ ${PIPESTATUS[0]} -ne 0 ]; then
  echo "ERROR: Database dump failed!" | tee -a ${LOG_FILE}
  ((ERRORS++))
fi

Or even better:

Code:

ssh root@${SERVER} "mysqldump --all-databases" 2>> ${LOG_FILE} | gzip > ${DESTINATION_PATH}/mysql/mysql_$(date +%F).gz 2>> ${LOG_FILE}
buffer=( ${PIPESTATUS[*]} )
if [ ${buffer[0]} -ne 0 ] || [ ${buffer[1]} -ne 0 ]; then
  echo "ERROR: Database dump failed!" | tee -a ${LOG_FILE}
  ((ERRORS++))
fi

rknichols · 07-22-2016, 12:41 PM

You can also use the pipefail option ("set -o pipefail") to get a failure return from a pipe if any command in the pipe fails.

TB0ne · 07-22-2016, 01:10 PM

Personally...I trust none of those methods, but I'm old-school. The only way I'm EVER happy with a backup is if I have tested it, and successfully restored it. I've seen too many gremlins over the years, where you get ZERO error messages, but can't seem to get things working again because <reasons>.

I'd not trust it unless I was able to import that dump file somewhere else, pull up the database, and verify it worked.

gda · 07-22-2016, 01:43 PM

You are completely right: zero returned value does not mean that everything went fine necessarily (even if according to my experience this is not usual). At the same time it is true that there are no many other ways to check for errors any script programmatically. The checks you mentioned are typically done manually...

The use of pipefail is also valid but has the disadvantage to not provide you information about which command in the pipe failed.

TB0ne · 07-22-2016, 01:52 PM

Quote:

Originally Posted by gda

You are completely right: zero returned value does not mean that everything went fine necessarily (even if according to my experience this is not usual). At the same time it is true that there are no many other ways to check for errors any script programmatically. The checks you mentioned are typically done manually...

The use of pipefail is also valid but has the disadvantage to not provide you information about which command in the pipe failed.

Absolutely...you SHOULD get your script to check things, and take every advantage you can get. But there is NEVER a substitute for doing a real test.

PastulioLive · 07-25-2016, 02:07 AM

Thanks for the replies guys!

I am also very skeptical when it comes to backups, and I realise that the only way to truely test it, is to restore the backup.
We unfortunately don't have the time or resources to try a restore on a regular bases (even though we really should).
So I am just looking to catch quick and obvious errors like unable to connect to host, disk full, etc... that will notify me if the backup fails.

I'm hoping that in the future, we will have more time to get it done properly, but I imagine that is only done by automating a backup/restore operation in a test-lab?

I think the pipefail should do for now (since I will want to check the backup if any of the commands in the pipe fail)

Thanks!

TB0ne · 07-25-2016, 08:59 AM

Quote:

Originally Posted by PastulioLive

Thanks for the replies guys!

I am also very skeptical when it comes to backups, and I realise that the only way to truely test it, is to restore the backup. We unfortunately don't have the time or resources to try a restore on a regular bases (even though we really should). So I am just looking to catch quick and obvious errors like unable to connect to host, disk full, etc... that will notify me if the backup fails.

I'm hoping that in the future, we will have more time to get it done properly, but I imagine that is only done by automating a backup/restore operation in a test-lab? I think the pipefail should do for now (since I will want to check the backup if any of the commands in the pipe fail)

No, the pipefail is a great thing to do, and you don't have to test EVERY backup...but if you take a little time and do one every month or so, you'll at least feel a bit better. I wouldn't automate the restore, but that's just me. In the event of a failure, you're going to probably do the restore manually. If it was ME, I would:

Document the ENTIRE PROCESS, with pictures and screen-shots. Put this into PDF form, in a knowledgebase online (if you have one), and in a hard-copy binder somewhere. You want anyone with at least one good eye and the ability to read, to be able to restore system functionality. Single points of failure aren't ever good...even if they're a PERSON. If you're the only one who knows how to do a restore, and it dies when you're on vacation...guess what?
Run some basic queries on the database..pull out records from random tables, and run a stored-procedure or two (if you have them), and make sure you get sane data.
Ask the end-user to verify this data.

I'd only do this periodically...chances are, if you have one solid test, you can do spot-checks from then on.

rknichols · 07-26-2016, 05:16 AM

At some point with every new system or new backup scheme I do a "bare metal restore" test. I take the disk drive out of the machine, replace it with an empty drive of similar size, and see what it takes to get the system up and running by booting from a live cd and restoring from my backups. There are typically some "interesting" moments along the way. (Just how was that disk partitioned? What was the LVM structure? What options were used in making those filesystems? How can I get the same UUIDs back so I don't have to edit configuration files? How many of those files that I don't back up are more annoying to lose than I expected? ...)

TB0ne · 07-26-2016, 07:26 AM

Quote:

Originally Posted by rknichols

At some point with every new system or new backup scheme I do a "bare metal restore" test. I take the disk drive out of the machine, replace it with an empty drive of similar size, and see what it takes to get the system up and running by booting from a live cd and restoring from my backups. There are typically some "interesting" moments along the way. (Just how was that disk partitioned? What was the LVM structure? What options were used in making those filesystems? How can I get the same UUIDs back so I don't have to edit configuration files? How many of those files that I don't back up are more annoying to lose than I expected? ...)

Yes...all of this, with copious documentation. That's the ONLY way you're going to know what's up, and what it takes to get things back, truly.

I worked somewhere years ago, where their 'disaster recovery' tests consisted of:

Taking the senior team members who installed everything to start with off site
Having THEM rebuild everything
Pat themselves on the back because stuff worked.

We pointed out each time, that in the event of a DISASTER (the "D" in "DR Test"), the senior people probably would NOT be available because...you know....disaster? They'd never let us build a documentation set or test anything in a lab at the office either, because we 'didn't have time'. Amazingly, when folks began quitting, those 'tests' began failing.

PastulioLive · 07-27-2016, 08:00 AM

Quote:

Originally Posted by TB0ne

No, the pipefail is a great thing to do, and you don't have to test EVERY backup...but if you take a little time and do one every month or so, you'll at least feel a bit better. I wouldn't automate the restore, but that's just me. In the event of a failure, you're going to probably do the restore manually. If it was ME, I would:

Document the ENTIRE PROCESS, with pictures and screen-shots. Put this into PDF form, in a knowledgebase online (if you have one), and in a hard-copy binder somewhere. You want anyone with at least one good eye and the ability to read, to be able to restore system functionality. Single points of failure aren't ever good...even if they're a PERSON. If you're the only one who knows how to do a restore, and it dies when you're on vacation...guess what?
Run some basic queries on the database..pull out records from random tables, and run a stored-procedure or two (if you have them), and make sure you get sane data.
Ask the end-user to verify this data.

I'd only do this periodically...chances are, if you have one solid test, you can do spot-checks from then on.

Documentation:
I have a Wiki (FOSWIKI) where I document the process of setting up the service (the ones I did not inherit at least).
It includes everything to setup the service from scratch (Linux, packages, dependencies, selinux, firewall, resources (nfs), etc...)
It also includes people who are responsible for the service (and what parts) and a list of common tasks/issues.
It also has common tasks like forgotten passwords or locking yourself out of the application.
The restore procedure will also go here.
I have not gone so far as to put it in a binder (preferably off-site), but I understand that I should.

Basic Queries:
I find this to be a bit tricky. As a sysadmin, you usually don't really use the service, your end users do.
The application I'm trying to back up here is Phabricator for instance.
So I guess I could try to find out how a certain SVN commit is stored in the MySQL database and retrieve it.
Doing this for every service we run seems like an awefull lot of work though.
Also, the procedure would be entirely different for all of the different services we offer.
That sounds like an awefull lot of work, but I guess that's the only way to be sure.
How do you guys feel pretty confident that you know if a restore is properly working?

Aks users to verify the data
+1 to this :-)

Quote:

Originally Posted by TB0ne

Yes...all of this, with copious documentation. That's the ONLY way you're going to know what's up, and what it takes to get things back, truly.

I worked somewhere years ago, where their 'disaster recovery' tests consisted of:

Taking the senior team members who installed everything to start with off site
Having THEM rebuild everything
Pat themselves on the back because stuff worked.

We pointed out each time, that in the event of a DISASTER (the "D" in "DR Test"), the senior people probably would NOT be available because...you know....disaster? They'd never let us build a documentation set or test anything in a lab at the office either, because we 'didn't have time'. Amazingly, when folks began quitting, those 'tests' began failing.

Time is a big issue for me too! I work in an SMO and we have about 100 people working here, but we only have 2 Sysadmins.
I am the "Linux-Guy / IT support guy" (helping our 100 emplyees with random Windows/Application/Network/etc... troubleshooting)
I also have to setup new PC's and take care of the networking backend (cisco switches, firewall, dmz, vpn, patching, port forwards, etc...)
My collegue is the Senior Sysadmin, but he basically only has time to take care of his windows environment, and a lot of administrative stuff (licensing and all kinds of stuff).
(All of this doesn't even mention the time we should be taking read up on security, documentation, etc...)

I just can't imagine being able to do all of this in a 40 hour work-week. How do you guys do it!

Also, our company uses jenkins for continuous integration, so downtime is also not something we can really afford, but I don't have the time to look into/setup high availability.

TB0ne · 07-27-2016, 08:13 AM

Quote:

Originally Posted by PastulioLive

Documentation:
I have a Wiki (FOSWIKI) where I document the process of setting up the service (the ones I did not inherit at least).
It includes everything to setup the service from scratch (Linux, packages, dependencies, selinux, firewall, resources (nfs), etc...)
It also includes people who are responsible for the service (and what parts) and a list of common tasks/issues.
It also has common tasks like forgotten passwords or locking yourself out of the application.
The restore procedure will also go here. I have not gone so far as to put it in a binder (preferably off-site), but I understand that I should.

I use a binder somewhere, because if things *REALLY* go wrong, you at least have that. Web server problems? Network problems? Can't access that site?? The wiki is useless at that point, and you have no idea what's in it. Hard copies NEVER go down, unless you toss them in the shredder.

Quote:

Basic Queries:
I find this to be a bit tricky. As a sysadmin, you usually don't really use the service, your end users do. The application I'm trying to back up here is Phabricator for instance.
So I guess I could try to find out how a certain SVN commit is stored in the MySQL database and retrieve it. Doing this for every service we run seems like an awefull lot of work though. Also, the procedure would be entirely different for all of the different services we offer. That sounds like an awefull lot of work, but I guess that's the only way to be sure. How do you guys feel pretty confident that you know if a restore is properly working?

By doing:

Quote:

Aks users to verify the data
+1 to this :-)

...this. Don't overcomplicate it; if you know you're using SVN/MySQL, then focus on getting those two SERVICES back up. From there, grab an end-user and ask them for five minutes of their time, and get THEM to pull up a record. They know how to use their application probably better than you do.

Quote:

Time is a big issue for me too! I work in an SMO and we have about 100 people working here, but we only have 2 Sysadmins. I am the "Linux-Guy / IT support guy" (helping our 100 emplyees with random Windows/Application/Network/etc... troubleshooting) I also have to setup new PC's and take care of the networking backend (cisco switches, firewall, dmz, vpn, patching, port forwards, etc...) My collegue is the Senior Sysadmin, but he basically only has time to take care of his windows environment, and a lot of administrative stuff (licensing and all kinds of stuff). (All of this doesn't even mention the time we should be taking read up on security, documentation, etc...)

Again, don't overcomplicate it. This is a marathon, not a sprint. Set up a machine somewhere; shove a DVD in, and let it install while you're doing something else. Now you've got a test-bed. Copy your mysqldump to it at some point...take two minutes and SSH into it, and start a copy running if it's huge, and the next day/week/whenever, restore that dump file, and bring up the DB. Look at it. Repeat with any other services you want to test. Spending 15-30 minutes a day doing small things adds up BIG on your testing/documentation procedures.

Quote:

I just can't imagine being able to do all of this in a 40 hour work-week. How do you guys do it!

Simple...most of us don't work just 40 hours a week, and automate the crap out of things. If I had SVN/SQL tests, I'd sure cron something to pull a backup file onto my test system, AND cron a restore of it too, so my test rig would 'just be there' whenever I wanted it. Let the machines work for you, don't work for them.

Quote:

Also, our company uses jenkins for continuous integration, so downtime is also not something we can really afford, but I don't have the time to look into/setup high availability.

This is something to beat your boss over the head with, as this makes no sense. Either you CAN afford downtime (because if you can't test/setup/look into things you are GOING TO HAVE IT), or you cannot (in which case, they need to pony up some $$$ for temps/another employee to take the load off you two). There's no other way.

PastulioLive · 07-27-2016, 08:34 AM

Thanks TB0ne, you are giving me some great advice!

I've been working at this company for 5 years now, but the consultant who ran the Linux and Network left a year or two back and I'm just trying to catch up to everything that this company is running, we provide a ton of services (like 11 bugzilla instances with slightly different configurations for all the subcontractors/clients we are working with).
Besides that, we have a lot of in-house applications that need to be maintained for our development/production/testing process to operate.
Theres also a lot of different things like, svn servers, git servers, jenkins, phabricator, storage/file/ftp servers and shares, webservers, wikis...
The list goes on, as I'm sure it does in a lot of other companies.

I didn't used to work 40 hour weeks either, but since I'm underpaid, I try to keep my hours after work to a minimum (Which, at the end of the day, is just shooting myself in the foot).
I have a new negotiation soon about my pay, so I hope my motivation will return.
(I love the work I do, but if you have to work another job to have a decent standard of living, and have about a 2-3 hour total commute, you are not going be working on backups/scripts after hours)

I just feel like I need an extra me to get things done and was wondering if the problem is me or the workload.
The 15 min/day working on tests may indeed get me a long way there, so I'll definately try it instead of looking at it like a "Project" that I need to take the day for.

I'm sorry for going so far off topic, but It's interesting to get your point of view. These things may be better suited somewhere else.