Skip to main content

Codd Design Ideas (Part 2)

Codd is a utility for managing PostgreSQL migrations, written in Haskell. In the Codd Design Ideas blog entry, I wrote about my design ideas, basically listing some features that I wish I had. This blog entry is a continuation of that one.

States, Migrations, and Database Upgrades

In my experiment with migration conflicts, I learned that resolving migration conflicts with Codd is a bit more complicated than I had anticipated. Git only registers conflicts when the migrations change the same column, index, etc. In other cases, Git can automatically merge, but the resulting snapshot is not necessarily correct. The example in my experiment is of migrations that both add a column to the same table, in which case the order of the migrations determines the column order. Resolving such issues involves getting the previous Codd snapshot, resetting the development database, and re-applying the new migration(s) in order to create the new snapshot.

The snapshot representation using hierarchically organized JSON files minimizes Git conflicts, but the column order discrepancy illustrates how these files may be interdependent. In this case, column configuration is stored separately, but each column must have the correct column order index. Adding or deleting a column affects all later columns.

Aside from the database configuration (such as locale) and role configuration, a snapshot is determined by the applied migrations. One generally does not store derivative files in revision control… What if we do not commit the snapshot?

Users could optionally configure a directory for storing snapshots. Users may configure a directory and either commit it or not; it is up to them. When a directory is not configured, a temporary directory can be used. Any number of snapshots can be stored/cached in this directory.

In non-development/test environments, a snapshot directory with the expected snapshots can be provided so that the consistency of the database can be verified like with the current Codd implementation. With multiple snapshots, it is possible to verify the database state before attempting any migrations, not just afterwards.

In development environments, snapshots can be created when an empty database is initialized by applying migrations. I wonder if creating a snapshot for every state would be excessive… When there are many migrations, it might indeed be an issue. Perhaps snapshots could be computed using a temporary database, on demand.

Temporary database functionality could of course be modular. I would like implementations using Docker and Gargoyle, and it should be possible for users to create an implementation for their own infrastructure. A Docker implementation could even make it easy to test across multiple versions of PostgreSQL! (In contrast, Gargoyle uses the version of PostgreSQL pinned in the Nix configuration.)

It may even be worthwhile to make the database modular… This type of migration management and consistency checking works for any type of database. Each implementation needs to create snapshots for the specific database and database version.

To make all of this work, there needs to be a good way to reference migrations and states/snapshots. Codd references migrations by timestamped filename, and there is no need to reference snapshots because only the latest snapshot is used. The codd_schema.sql_migrations table is defined as follows (code).

CREATE TABLE codd_schema.sql_migrations (
  id SERIAL PRIMARY KEY,
  migration_timestamp TIMESTAMPTZ NOT NULL,
  applied_at TIMESTAMPTZ NOT NULL,
  name TEXT NOT NULL,
  UNIQUE (name),
  UNIQUE (migration_timestamp)
);

When the codd up command is used to synchronize a database with the applied transactions, each entry of that table is verified. See the Codd Migration Inverses blog entry for an example of why this is problematic. I would like the database to keep a record of all migrations that were applied, including inverses, but inverses should never be registered in the list of applied migrations that Codd maintains.

It would be interesting to instead manage migrations based on state. Here is an untested, draft replacement for the above table. Note that I once again use the edgar name to avoid confusion (reference).

CREATE SEQUENCE edgar.transition___id___seq;

CREATE TABLE edgar.transition (
  id INTEGER NOT NULL DEFAULT nextval('edgar.transition___id___seq'),
  applied_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
  command TEXT NOT NULL,
  state_from TEXT NOT NULL,
  state_to TEXT NOT NULL,
  file_hash TEXT,
  CONSTRAINT transition_pkey PRIMARY KEY (id)
);

ALTER SEQUENCE edgar.transition___id___seq
  OWNED BY edgar.transition.id;

In this design, each transition specifies the state that is transitioned from (state_from) and the state that is transitioned to (state_to). After considering the many options, I think that using a SHA-256 hash of the (single) snapshot JSON file might be a good choice to reference states. Critically, identical database states have the same hash no matter how they were created. For example, the database state reached by a checkpoint migration must be the same as the state reached by the equivalent sequence of individual migrations, and this state should be referenced consistently. The first transition always migrates from the zero state.

I use a command column instead of a file column because I would like to record database upgrades as well as migrations. (See Codd Experiment 2: PostgreSQL Upgrade.) The idea is to use machine-readable command strings to specify how that transition was performed. Here are some example values:

  • apply("0002-init-schema.sql") specifies that migration file 0002-init-schema.sql was applied.
  • upgrade_checkpoint("14.7", "15.2", "0121-add-arrow.checkpoint.sql") specifies that the database was upgraded from version 14.7 to version 15.2 using the 0121-add-arrow.checkpoint.sql checkpoint migration.
  • upgrade("14.7", "15.2") specifies that the database was upgraded from version 14.7 to version 15.2 without any significant changes to the schema.

The file_hash column can be used to specify the SHA-256 hash of the migration file, when applicable. This information can be used for debugging, and it could also be used to provide more informative error messages.

Note that I use transition as the table name because it represents a transition between states, which may not involve a migration. I considered using step as the surrogate key, but I stuck with id because PostgreSQL sequences are not necessarily sequential.

The Codd table uses a unique index on the migration timestamp, which is the timestamp prepended to the migration filename when it is added in development, to determine order. Note that this means that the table cannot register the same migration being applied twice, but that makes sense because the table must match the list of applied migrations. In this design, the id column determines the order, and the table registers the full history of transitions. It is possible to apply a migration, apply the inverse, and then apply the same migration again after resolving an issue that is unrelated to the migration. This table retains that history.

For the same reason, there is no unique index for the command. This design retains the full history of transitions, and the same transition may be applied more than once, as described above.

Snapshots depend on the database version, as different versions include different low-level details. Snapshots created using the same migrations with different versions of the database therefore have different hash identifiers. Transitions that use upgrade commands specify these states, so transitions are continuous even across database upgrades.

The software should also cache migration information. The hash of each migration can be specified. In addition, the states that each migration transitions from and to can be specified for a given database version. This should be configured separately, because it would be useful to commit this file to the repository even when the snapshot directory is not committed. When snapshots are not committed to the repository, hashes in this file could be used to verify that snapshots that are created on demand perfectly match the snapshots created previously.

The following is a draft example, using dummy data.

[ { "filename": "0001-check-db-config.sql",
    "file_hash": "d7666b963c2969b0014937aae55472eea5098ff21ed3bea8a2e1f595f62856c1",
    "state_from":
      [ { "database": "pg-14",
          "snapshot_hash": "d914176fd50bd7f565700006a31aa97b79d3ad17cee20c8e5ff2061d5cb74817"
        },
        { "database": "pg-15",
          "snapshot_hash": "d914176fd50bd7f565700006a31aa97b79d3ad17cee20c8e5ff2061d5cb74817"
        }
      ],
    "state_to":
      [ { "database": "pg-14",
          "snapshot_hash": "3b0bff2192e5f86be6411daa9199bfdbeb87d6ce6aff66f33a8501c97f1df1af"
        },
        { "database": "pg-15",
          "snapshot_hash": "761b73bbbddf57580a95202dd163ea2b8bff3905ac7c7ae34f50421d64a26c28"
        }
      ]
  },
  ...
]

Note that the file should be formatted with whitespace so that diffs are meaningful and users can easily inspect them. Users may even edit the file in order to remove information that is no longer needed, but the software should provide cache management commands to make this easy. For example, a command could be used to remove state information for pg-14 after all environments have been upgraded to pg-15.

Test Scripts

I wrote about verify scripts, but I realized that checks that should always be run (even in production) could simply be included in the migrations themselves. I think that test scripts would be much more useful. The software could run test scripts only when migrations are performed in test mode. I create such tests all the time, but I currently test manually, and this feature would provide a framework for doing so as part of migration development.

Test scripts could be used to insert test data before a migration and then check the data in the database after the migration. For example, some migrations may work fine on an empty database but fail with a constraint violation in a populated database. Test scripts can ensure that the database is populated when testing migrations.

A test script with extension .test.pre.sql could be run before the migration (or migration step) with the same name. A test script with extension .test.post.sql could be run after the migration (or migration step) with the same name.

Dead Ends

In Codd Migration Inverses, I described a possible scenario where an unforeseen issue causes developers to abort using a (multi-step) migration and create a different (multi-step) migration to use instead. In this case, the migration index numbers overlap. As I clarified in an edit, the filename conventions are not sufficient to create the digraph in this case. The subsequent migration needs to know which state it is based on!

This issue could be resolved by adding more syntax to migration filenames. I think such cases are rare, however, and I have never witnessed one in my career. The migrations directory generally stores all migrations, but I think that removing aborted migrations is preferably to complicating the syntax. The migrations could be moved to an aborted subdirectory, so that they stay in the repository.