My office maintains several remote servers for our clients. One of our operating patterns in the past has been to keep a few spare emergency backup servers on-site with us so that they can pick up the replication from our PostgreSQL database - the idea being that this way, the remote site won't have to be down for an extended length of time as the DB rebuilds onsite.
The database we're copying is from our main server - running SuSe 9.3 with PostgreSQL 8.1.3 installed. The servers we're copying to are SuSE 11 boxes with PostgreSQL 8.4.4 on them. The copy on the main box is done by means of the pg_dump command, and output as a pair of sql files - one for the schema, and one for the data. These are loaded against an empty copy of the database on the backup boxes by running them as scripts against the db from the command line.
It has normally taken a while to load the Database, but recently things have gone seriously screwy. While I admit the database has grown somewhat large (data input sql script is reading as 2.7 gigs off of an ls -lah run), until fairly recently a backup could be started at the end of the business day, then left running overnight and would be finished by the time the day started again. Now, however, we have had several machines that ran for more than a day without finishing - in two cases, almost 48 hours without finishing the process.
I'm currently trying to figure out where the likeliest problems are, but I'm not all that familiar with Linux's 'performance/maintenance' commands - I had to request help from a local user group just to learn about the existence of the iostat command.
Speaking of iostat, the output I got from my own run of it is below. I let it run for ~5 minutes, with a 15 second spacing between the reports.
Code:
Linux 2.6.34-12-desktop (datatrac) 09/15/11 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.94 0.03 0.71 17.60 0.00 79.72
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.98 862.43 9.04 131.63 1291.16 7960.83 65.77 69.35 492.16 4.64 65.30
avg-cpu: %user %nice %system %iowait %steal %idle
0.45 0.00 0.41 25.20 0.00 73.94
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 180.27 0.00 160.00 0.00 2781.87 17.39 142.78 877.30 6.25 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.26 0.00 0.63 28.36 0.00 69.75
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1145.33 0.60 180.93 187.73 10328.53 57.93 117.61 694.73 5.35 97.12
avg-cpu: %user %nice %system %iowait %steal %idle
0.45 0.00 0.30 43.75 0.00 55.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 315.53 0.00 171.60 0.00 4207.47 24.52 155.57 891.52 5.83 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.38 0.00 0.25 46.66 0.00 52.71
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 213.07 0.07 180.40 0.53 3179.73 17.62 154.46 865.49 5.54 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.18 0.00 0.68 30.22 0.00 67.92
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1255.73 0.53 221.47 170.67 11820.80 54.02 121.42 519.80 4.39 97.56
avg-cpu: %user %nice %system %iowait %steal %idle
0.37 0.00 0.27 33.87 0.00 65.49
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 161.67 0.00 144.53 0.00 2511.47 17.38 143.70 1020.23 6.92 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.30 0.00 0.73 34.29 0.00 63.68
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.07 1344.07 0.73 225.20 188.27 12568.00 56.46 109.85 485.40 4.28 96.78
avg-cpu: %user %nice %system %iowait %steal %idle
0.37 0.00 0.24 36.51 0.00 62.88
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 178.47 0.00 159.73 0.00 2760.53 17.28 143.19 890.51 6.26 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.43 0.00 0.31 28.71 0.00 70.55
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 233.47 0.00 165.40 0.00 3322.13 20.09 124.27 792.68 6.04 99.96
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.33 27.73 0.00 71.43
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 349.33 0.13 185.60 51.20 4258.13 23.20 124.83 644.99 5.36 99.60
avg-cpu: %user %nice %system %iowait %steal %idle
0.98 0.00 0.48 31.18 0.00 67.36
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 938.80 0.47 192.33 119.47 9028.80 47.45 115.30 601.93 5.09 98.04
avg-cpu: %user %nice %system %iowait %steal %idle
0.47 0.00 0.33 47.37 0.00 51.83
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 284.67 0.07 173.47 17.07 3721.60 21.54 131.72 746.72 5.76 99.98
avg-cpu: %user %nice %system %iowait %steal %idle
0.52 0.00 0.33 31.28 0.00 67.87
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 311.47 0.13 174.33 17.60 3925.33 22.60 122.64 713.80 5.72 99.72
avg-cpu: %user %nice %system %iowait %steal %idle
1.02 0.00 0.62 27.73 0.00 70.64
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1041.67 0.40 207.67 136.53 9997.87 48.71 121.31 575.26 4.72 98.13
avg-cpu: %user %nice %system %iowait %steal %idle
0.37 0.00 0.25 25.26 0.00 74.12
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 180.47 0.00 151.33 0.00 2665.60 17.61 143.23 953.02 6.61 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.30 0.05 0.65 28.87 0.00 69.13
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1281.53 0.40 215.27 119.47 12002.67 56.21 112.80 517.17 4.53 97.62
avg-cpu: %user %nice %system %iowait %steal %idle
0.35 0.00 0.27 36.36 0.00 63.02
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 189.93 0.00 166.73 0.00 2903.47 17.41 143.69 868.46 6.00 100.01
avg-cpu: %user %nice %system %iowait %steal %idle
0.52 0.00 0.48 45.01 0.00 53.99
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.07 283.20 0.13 165.87 34.67 3640.00 22.14 117.66 696.96 6.02 99.99
avg-cpu: %user %nice %system %iowait %steal %idle
1.14 0.00 0.51 30.89 0.00 67.45
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1124.60 0.47 197.53 136.53 10590.40 54.18 116.74 589.24 4.95 98.08
avg-cpu: %user %nice %system %iowait %steal %idle
0.33 0.00 0.27 29.16 0.00 70.24
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 198.93 0.00 160.87 0.00 2905.07 18.06 143.10 899.60 6.22 100.00
I'm a bit nervous about those numbers - the avgqu-sz and await values are both a lot higher than I can recall from the sample output I've seen online. Then, the numbers I've seen online have been from samples where, I would assume, there's not a database reload going on on the system. I'm not sure if what I'm seeing is evidence of the real problem, or simply an artifact of the conditions the problem occurs under.
Would someone please suggest some other avenues of exploration I could use to follow up on this and determine the true root problem?