pgcmp

Purpose

This toolset is intended to perform comparisons (“reconciliation”) of schemas between databases to determine if they are equivalent. There are several contexts where this is useful, each with a somewhat different understanding of equivalent.

Build process
- The build process has database schema in two forms:
  
  Build From Scratch
  A deployable database is built from scratch, from an empty database
  
  Upgrade From Earlier Version
  An older version is upgraded to the latest version using a set of upgrade scripts.
- We need to ensure that the results of building from scratch are identical to those that come from upgrading the old version using upgrade scripts.
- In this case, we do not expect any variations, aside from the fact that OIDs will vary between databases. These databases will be built within the same smoke test environment, and thus have the same sets of users, roles, and such, and there should not be any additional components introduced (as might be the case in production)
- It is therefore considered a build-breaking error to find any variation between the two database schemas
Validating upgrades
- When DBAs are planning to upgrade an instance, using the upgrade scripts, it is essential that their starting point represents a schema that is equivalent to the previous version. Otherwise, if the starting point is wrong, then the upgrade scripts may not provide a proper upgrade to the new version of the schema.
- In this environment, there are a number of acceptable kinds of differences that should be accommodated between the “production” schema and a “development” schema:
  
  Replication
  Development does not include replication, but in production, Slony schema, tables, functions, and such will exist, and the reconciliation should be able to exclude these differences.
  
  DBA Tooling
  DBAs add additional tooling into production databases, generally in their own schemas, to help with monitoring.
  
  Supplementary Applications
  Some applications have add-ons, and, when considering the main application, we may wish to exclude add-ons.
  
  Specialized Ownership
  Often, different users are used in production than are used in development.
- As a result, when reconciling the “old” schema version against the version in production, it will be necessary to accept the need for reconciling entries so that DBAs are not left puzzling over hundreds of differences that may be reasonably expected.
Validating replicas in sync
- When deploying an upgrade, it is not sufficient to know that the “master” node has an agreeable schema; it is essential that all replicas have suitable schemas.
- In the above analysis, it was necessary to accommodate there being considerable difference between the “development” schema and the “production” schema. When comparing replicas, differences should be considered rather more dangerous. Differences to accept/accommodate include:
  
  Node-specific deployment
  In some cases, add-ons are only installed on certain nodes in the cluster.
  
  Origin/Replica differences
  Slony defines some triggers that run only on origin nodes (e.g. - logtrigger()), and others that run only on replica nodes (e.g. - denyaccess()).

The Comparison Script

pgcmp tooling consists of pair of scripts:

This script pulls data about the schema from the two databases that are to be compared, loads that data into a third database where that data is compared in order, and performs a reconciliation of the similarities and differences.

pgcmp-dump: This tool extracts schema data from a database, putting it into a data file for later analysis
pgcmp: This tool uses a pair of files generated by pgcmp-dump, and looks for differences.

Security Considerations

This script requires read-only access to schema data in INFORMATION_SCHEMA and pg_catalog in both databases that are to be compared.

This script requires write access to a third database in which it constructs a series of tables used to perform the reconciliation.

Running pgcmp components

pgcmp-dump

pgcmp-dump uses a series of environment variables to control its activity.

Database connection information

PGDB
Database name

PPGGHOST
Database host

PGUSER
DB User

PGPORT
Port number

PGPGURI
Since PostgreSQL 9.2, libpq supports PGURIs as a singular connection string. pgcmp prefers the use of these URIs. If the individual parameters above are offered, they will be constructed into a URI; if PGURI is populated, it will be used instead.

PGCLABEL: Indicates an identifer for the database’s data to report in analysis output
PGCMPOUTPUT: The name of the file in which to store output, that is, the set of metadata about the database schema that will be used later for comparison with the schema of another database.
PGCMPIGNORESLONYTRIGGERS: defaults to false; if true, then ignore differences involving triggers of the form created by Slony-I

$ PGURI=postgresql://postgres@localhost/test1 PGCMPOUTPUT=/tmp/test-pgcmp-file1 PGCLABEL=db1 ../pgcmp-dump
pgcmp-dump - extracting schema data from database PGURI=[postgresql://postgres@localhost/test1]

Data file containing output: PGCMPOUTPUT[/tmp/test-pgcmp-file1]
Label: PGCLABEL=[db1]

Extracted schema data for postgresql://postgres@localhost/test1
-rw-r--r-- 1 cbbrowne cbbrowne 7350 Nov  3 12:12 /tmp/test-pgcmp-file1

pgcmp

pgcmp uses a pair of input files as produced by pgcmp-dump, loads them into a comparison database, and checks for differences.

It uses a series of environment variables to control its activity.

PGBINDIR: Indicates where to find psql

Database connection information

PGDB
Database name

PPGGHOST
Database host

PGUSER
DB User

PGPORT
Port number

PGURI
Since PostgreSQL 9.2, libpq supports PGURIs as a singular connection string. pgcmp prefers the use of these URIs. If the individual parameters above are offered, they will be constructed into a URI; if PGURI is populated, it will be used instead.

PGCLABEL1: Label identifying data for the first database
PGCLABEL2: Label identifying data for the second database
PGCOMITSCHEMAS: Indicates (as a SQL clause suitable to insert VALUES into a table), a set of schema labels from pg_catalog.pg_namespace that should be omitted from analysis. As identifiers, they may require doublequotes consistent with the function pg_catalog.quote_ident(). pg_catalog and information_schema were automatically left out in pgcmp-dump, so they do not need to be included.

Data file information
PGCWORKDIR
Work directory where work files are placed/found, defaults to /tmp

PGCEXPLANATIONS
Contains records intended to explain expected variations
- Structure is a Postgres COPY of data for a table of expected differences, defined as follows:

create table expected_differences (
    object_type text,
    object_name text,
    difference_type text,
    difference_cause text,
    primary key(object_type, object_name, difference_type, difference_cause)
);

PGCFULLOUTPUT

Contains full output of all records, those that match and those that do not.

Structure is a Postgres COPY of data from a table with the following structure:

create table fulloutput (
  object_name text,
  object_type text,
  label1 text,
  object_definition1 text,
  label2 text,
  object_definition2 text,
  difference_type text,
  difference_cause text,
  importance integer
);

PGCUNEXPLAINED

Contains the set of records that did not match between schemas that did not have explanations provided.

Structure is a Postgres COPY of data from a table with the following structure:

create table unexplained_items (
  object_name text,
  object_schema text,
  object_type text,
  label1 text,
  id1 integer,
  object_definition1 text,
  label2 text,
  id2 integer,
  object_definition2 text,
  difference_type text,
  importance integer
);

PGCBADEXPLAIN

Contains the set of records that did not match between schemas where explanations offered did not match the problem.

Structure is a Postgres COPY of data from a table with the following structure:

create table badexplanations_items (
  object_type text,
  object_schema text,
  object_name text,
  difference_type text,
  difference_cause text,
  importance integer
);

PGCMPINPUT1: Input file containing metadata about the schema for the first database
PGCMPINPUT2: Input file containing metadata about the schema for the second database

Interpretations of differences

object_type

Indicates which kind of object had a difference

object_schema

Indicates the namespace where the object occurs

object_name

Fully qualified name of the object

difference_type

Kind of difference, one of

type	description
match	perfect match across the databases
mismatch	values differ between databases
missing in 1st DB	exists in 2nd database, but not 1st one
missing in 2nd DB	exists in 1st database, but not 2nd one

difference_cause

Indicates whether a difference has been explained satisfactorily via the explanations.txt data

Not Yet Properly Explained: Indicates the difference has not been explained at all
Still Not Properly Explained: Indicates that explanations.txt indicates that the problem was not explained. In effect, the administrator took the data straight out of an earlier pgcmp run, and tried to use it as an explanation, as opposed to analyzing it to determine what should be done about it.
Misexplained: Indicates that the explanation in the explanations.txt file indicated a different sort of difference_type than was found by pgcmp.

importance

Indicates how crucial a discrepancy is. Low importance values indicate the most crucial items. Generally:

NULL: no discrepancy found; no difference to be examined
1: mismatch between databases, likely indicating a problem to be fixed
2: object is present in one database, absent in the other
3 or more: object is present in one database, absent in the other, but is a “child” of some other missing object. For instance, if a schema is missing, then that would be indicated as importance=2, the tables, views, sequences, and such, in that schema that are missing in one database would be marked with importance=3, and columns, indexes, and other attributes of tables/views/sequences would be marked with importance=4. The items with importance of 3 or 4 are the consequence of the importance=2 item; the focus of an analyst should be in the high importance (1,2) items.

Input File: `explanations.txt`

The input file, explanations.txt, provides a set of explanation items that allow an administrator to indicate explanations for discrepancies that are considered acceptable.

For instance:

Production includes Slony: If comparing a development schema that does not include replication against a production schema where Slony has been installed, it is to be expected that all of the Slony objects will comprise a set of “expected” differences.
Production Monitoring: DBAs may add in additional components such as the pgstattuples contrib module, or even additional schemas and tables.
Production Users: The production environment may be expected to have additional users and roles not found in the development schema.

create table expected_differences (
    object_type text,
    object_schema text,
    object_name text,
    difference_type text,
    difference_cause text,
    importance integer,
    primary key(object_type, object_name, difference_type, difference_cause)
);

How To Populate `explanations.txt`

The easiest way to populate this file is by running pgcmp.sh with an empty set of explanations, and then transforming the resulting set of unexplained items into “explained” differences.

insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'Slony objects only in
production' from unexplained_items where object_name like '_oxrspro%';

insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'contrib objects only in
production' from unexplained_items where object_name like
'postgres_contrib%';

insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'Conversion objects to be
removed from production' from unexplained_items where object_name like
'dotpro_conversion%';

\copy expected_differences to '/tmp/expected_differences.txt';

In subsequent runs, these differences become “expected” differences, so that a DBA or QA analyst does not need to spend their attention manually filtering out these expected differences.

Outputs

The process has output in several forms:

Brief report to standard output
Files containing details
Return codes useful for determining success/failure

Brief Report

Here is an example of running a comparison between two schemas:

Parameters: lists values for all the environment variables
Extraction Summary: lists information about the files of extracted schema data
SQL messages: lists commands run against the comparison database
Results Summary: indicates, by object type, statistics on matches, differences, and explanations. This is a summary on the table fulloutput.
Inadequately Explained Items: indicates specific objects that were inadequately explained by the EXPLANATIONS data

File Output

The following files (based on contents of these environment variables) are created and populated via COPY:

FULLOUTPUT: populated from table fulloutput

This contains a full list of all objects examined in both databases, complete with objects, respective definitions, and difference type and cause

UNEXPLAINED
populated from table unexplained_items
This lists all objects where there was some difference, but no item found to explain the difference.

BADEXPLAIN
populated from table badexplanations_items
This lists all objects where there was some difference, and an explanation, but the explanation did not properly explain the difference. For instance, an object was missing from the second database, but the explanation indicated that there should have been a different definition (which indicates that the object was expected to be found in both databases).

Return Codes

If errors are encountered, the script pgcmp.sh will terminate with varying exit codes:

exit 1: If data could not be extracted from the either of the source databases
exit 1: If a connection is not established with the comparison database
exit 2: If the comparison script does not run successfully
exit 3: If not all object differences were adequately explained
exit 0: If all runs to completion, and differences were adequately explained

Thus, generally explaining this:

Comparison ran successfully, found no troublesome differences
Database connectivity problems
Error in processing comparison
Comparison ran, and found irreconcilable differences

These return codes should be useful if running scripts to do automated schema analyses.

Installation Requirements

pgcmp is implemented as a shell script (expects to be run using /bin/bash), and thus requires:

Bash: Available as /bin/bash
psql: Available from a Postgres installation

If building an RPM file, this documentation, in README.org, is transformed to HTML using a Ruby script that uses a Ruby “gem” called org-ruby.

There are several ways the org-ruby gem may be installed:

Manual Installation of Ruby Gem: This Gem may be installed via the command gem install org-ruby. On many systems, this would need to be run by the root user, perhaps using sudo.
Packaged Installation: On Debian (and perhaps derivatives such as Ubuntu), this gem is contained by the package ruby-org, and hence may be installed via apt-get install ruby-org.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
docker		docker
test		test
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
COPYRIGHT		COPYRIGHT
INSTALL		INSTALL
Makefile		Makefile
README.org		README.org
org-to-html		org-to-html
pgcmp		pgcmp
pgcmp work flow.odg		pgcmp work flow.odg
pgcmp work flow.png		pgcmp work flow.png
pgcmp-dump		pgcmp-dump
pgcmp.spec		pgcmp.spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pgcmp

Purpose

The Comparison Script

Security Considerations

Running pgcmp components

pgcmp-dump

pgcmp

Interpretations of differences

Input File: `explanations.txt`

How To Populate `explanations.txt`

Outputs

Brief Report

File Output

Return Codes

Installation Requirements

About

Releases

Packages

Languages

cbbrowne/pgcmp

Folders and files

Latest commit

History

Repository files navigation

pgcmp

Purpose

The Comparison Script

Security Considerations

Running pgcmp components

pgcmp-dump

pgcmp

Interpretations of differences

Input File: explanations.txt

How To Populate explanations.txt

Outputs

Brief Report

File Output

Return Codes

Installation Requirements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Input File: `explanations.txt`

How To Populate `explanations.txt`

Packages