Postgresql database reload taking excessively long time on SuSE 11 box

EnderX · 09-15-2011, 12:36 PM

My office maintains several remote servers for our clients. One of our operating patterns in the past has been to keep a few spare emergency backup servers on-site with us so that they can pick up the replication from our PostgreSQL database - the idea being that this way, the remote site won't have to be down for an extended length of time as the DB rebuilds onsite.

The database we're copying is from our main server - running SuSe 9.3 with PostgreSQL 8.1.3 installed. The servers we're copying to are SuSE 11 boxes with PostgreSQL 8.4.4 on them. The copy on the main box is done by means of the pg_dump command, and output as a pair of sql files - one for the schema, and one for the data. These are loaded against an empty copy of the database on the backup boxes by running them as scripts against the db from the command line.

It has normally taken a while to load the Database, but recently things have gone seriously screwy. While I admit the database has grown somewhat large (data input sql script is reading as 2.7 gigs off of an ls -lah run), until fairly recently a backup could be started at the end of the business day, then left running overnight and would be finished by the time the day started again. Now, however, we have had several machines that ran for more than a day without finishing - in two cases, almost 48 hours without finishing the process.

I'm currently trying to figure out where the likeliest problems are, but I'm not all that familiar with Linux's 'performance/maintenance' commands - I had to request help from a local user group just to learn about the existence of the iostat command.

Speaking of iostat, the output I got from my own run of it is below. I let it run for ~5 minutes, with a 15 second spacing between the reports.

Code:

Linux 2.6.34-12-desktop (datatrac) 	09/15/11 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.94    0.03    0.71   17.60    0.00   79.72

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.98   862.43    9.04  131.63  1291.16  7960.83    65.77    69.35  492.16   4.64  65.30

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.45    0.00    0.41   25.20    0.00   73.94

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   180.27    0.00  160.00     0.00  2781.87    17.39   142.78  877.30   6.25 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.26    0.00    0.63   28.36    0.00   69.75

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1145.33    0.60  180.93   187.73 10328.53    57.93   117.61  694.73   5.35  97.12

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.45    0.00    0.30   43.75    0.00   55.50

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   315.53    0.00  171.60     0.00  4207.47    24.52   155.57  891.52   5.83 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.38    0.00    0.25   46.66    0.00   52.71

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   213.07    0.07  180.40     0.53  3179.73    17.62   154.46  865.49   5.54 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.18    0.00    0.68   30.22    0.00   67.92

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1255.73    0.53  221.47   170.67 11820.80    54.02   121.42  519.80   4.39  97.56

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.37    0.00    0.27   33.87    0.00   65.49

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   161.67    0.00  144.53     0.00  2511.47    17.38   143.70 1020.23   6.92 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.30    0.00    0.73   34.29    0.00   63.68

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.07  1344.07    0.73  225.20   188.27 12568.00    56.46   109.85  485.40   4.28  96.78

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.37    0.00    0.24   36.51    0.00   62.88

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   178.47    0.00  159.73     0.00  2760.53    17.28   143.19  890.51   6.26 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.43    0.00    0.31   28.71    0.00   70.55

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   233.47    0.00  165.40     0.00  3322.13    20.09   124.27  792.68   6.04  99.96

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.33   27.73    0.00   71.43

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   349.33    0.13  185.60    51.20  4258.13    23.20   124.83  644.99   5.36  99.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.98    0.00    0.48   31.18    0.00   67.36

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   938.80    0.47  192.33   119.47  9028.80    47.45   115.30  601.93   5.09  98.04

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.47    0.00    0.33   47.37    0.00   51.83

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   284.67    0.07  173.47    17.07  3721.60    21.54   131.72  746.72   5.76  99.98

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.52    0.00    0.33   31.28    0.00   67.87

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   311.47    0.13  174.33    17.60  3925.33    22.60   122.64  713.80   5.72  99.72

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.02    0.00    0.62   27.73    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1041.67    0.40  207.67   136.53  9997.87    48.71   121.31  575.26   4.72  98.13

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.37    0.00    0.25   25.26    0.00   74.12

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   180.47    0.00  151.33     0.00  2665.60    17.61   143.23  953.02   6.61 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.30    0.05    0.65   28.87    0.00   69.13

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1281.53    0.40  215.27   119.47 12002.67    56.21   112.80  517.17   4.53  97.62

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.35    0.00    0.27   36.36    0.00   63.02

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   189.93    0.00  166.73     0.00  2903.47    17.41   143.69  868.46   6.00 100.01

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.52    0.00    0.48   45.01    0.00   53.99

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.07   283.20    0.13  165.87    34.67  3640.00    22.14   117.66  696.96   6.02  99.99

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.14    0.00    0.51   30.89    0.00   67.45

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1124.60    0.47  197.53   136.53 10590.40    54.18   116.74  589.24   4.95  98.08

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.33    0.00    0.27   29.16    0.00   70.24

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   198.93    0.00  160.87     0.00  2905.07    18.06   143.10  899.60   6.22 100.00

I'm a bit nervous about those numbers - the avgqu-sz and await values are both a lot higher than I can recall from the sample output I've seen online. Then, the numbers I've seen online have been from samples where, I would assume, there's not a database reload going on on the system. I'm not sure if what I'm seeing is evidence of the real problem, or simply an artifact of the conditions the problem occurs under.

Would someone please suggest some other avenues of exploration I could use to follow up on this and determine the true root problem?

tronayne · 09-15-2011, 04:46 PM

There are a number of things that will cause long loading time but the most likely will be triggers and stored procedures (if there are triggers and stored procedures and you're loading tables from a back up, those will be executed -- and you really do not want that to happen).

If you take a look at the manual page for pg_dump, read about the --disable-triggers option and decide if that will work for you (if you do a data-only plain-text dump).

You may want to change your dump format to archive format then use pg_restore with its --disable-triggers option:

Code:

       --disable-triggers
           This option is only relevant when performing a data-only restore. It instructs
           pg_restore to execute commands to temporarily disable triggers on the target
           tables while the data is reloaded. Use this if you have referential integrity
           checks or other triggers on the tables that you do not want to invoke during
           data reload.

You may want to fiddle with it a little (you know, run a test) but the example in the pg_restore manual page might be a good staring point:

Code:

       Assume we have dumped a database called mydb into a custom-format dump file:

           $ pg_dump -Fc mydb > db.dump

       To drop the database and recreate it from the dump:

           $ dropdb mydb
           $ pg_restore -C -d postgres db.dump

       The database named in the -d switch can be any database existing in the cluster;
       pg_restore only uses it to issue the CREATE DATABASE command for mydb. With -C,
       data is always restored into the database name that appears in the dump file.

       To reload the dump into a new database called newdb:

           $ createdb -T template0 newdb
           $ pg_restore -d newdb db.dump

       Notice we don't use -C, and instead connect directly to the database to be restored
       into. Also note that we clone the new database from template0 not template1, to
       ensure it is initially empty.

       <there is more to the example that should be noted -- see the manual page>

Also, depending upon your site, you may want to look into pg_dumpall (used for clusters).

I'm pretty sure that, from you description, your data bases have triggers and stored procedures in them and you should explore the methods for temporarily disabling them during loading. This assumes, of course, that the target machine has sufficient RAM and disk space available, eh?

Hope this helps some.