Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.
PostgreSQL 8.x backup caveat
During the deployment of our latest software revision (a major version bump that migrated all of our databases from MSSQL to Linux/PostgreSQL and migrated all of our web servers from Apache/PHP to Tomcat/GWT), we started hitting snags in IT operations immediately with database backups. When this major version revision came out, it allowed customers to store documents, photos, etc...directly in the database (something MSSQL 2005 isn't capable of). Doing a simple "pg_dump <opts> <dbname>" resulted in a massive wait that for larger customer databases could easily top 10+ hours for the backup operation to complete.
The solution was to break the backup up into four separate files...one for the schema layout, one for a table that contained photos (COPY ... WITH BINARY), one for another table that contained documents (COPY ... WITH BINARY), and one final one for all of the other data (excluding the two tables). Suddenly backups are taking mere minutes. Restores are a bit trickier, as you have to restore the schema, then data, then binary data (in that order), but that's nothing a shell script can't accomplish.
So it raised an interesting question. Why did the straight pg_dump commands take forever to complete?
The answer is that when you store binary data as text, you have to represent "unprintable characters" somehow so there is no data loss. The way this is typically accomplished is called base64 encoding. I won't go into terribly intricate detail here, but basically what base64 encoding does is split the output up into a "stream" of 6 bit blocks, and then each 6 bit block is mapped to a printable character (which takes 8 bits). There's roughly a 33% increase in output size, and processing it takes time...a lot of time. If that sounds inefficient, well, it is.
So remember, if your schema includes binary data such as photos or office documents, split the tables containing the binaries off with a COPY...WITH BINARY command. You'll save yourself a ton of time and storage space.
The solution was to break the backup up into four separate files...one for the schema layout, one for a table that contained photos (COPY ... WITH BINARY), one for another table that contained documents (COPY ... WITH BINARY), and one final one for all of the other data (excluding the two tables). Suddenly backups are taking mere minutes. Restores are a bit trickier, as you have to restore the schema, then data, then binary data (in that order), but that's nothing a shell script can't accomplish.
So it raised an interesting question. Why did the straight pg_dump commands take forever to complete?
The answer is that when you store binary data as text, you have to represent "unprintable characters" somehow so there is no data loss. The way this is typically accomplished is called base64 encoding. I won't go into terribly intricate detail here, but basically what base64 encoding does is split the output up into a "stream" of 6 bit blocks, and then each 6 bit block is mapped to a printable character (which takes 8 bits). There's roughly a 33% increase in output size, and processing it takes time...a lot of time. If that sounds inefficient, well, it is.
So remember, if your schema includes binary data such as photos or office documents, split the tables containing the binaries off with a COPY...WITH BINARY command. You'll save yourself a ton of time and storage space.
Total Comments 0