Codd Design Ideas (Part 2)
Codd is a utility for managing PostgreSQL migrations, written in Haskell. In the Codd Design Ideas blog entry, I wrote about my design ideas, basically listing some features that I wish I had. This blog entry is a continuation of that one.
States, Migrations, and Database Upgrades
In my experiment with migration conflicts, I learned that resolving migration conflicts with Codd is a bit more complicated than I had anticipated. Git only registers conflicts when the migrations change the same column, index, etc. In other cases, Git can automatically merge, but the resulting snapshot is not necessarily correct. The example in my experiment is of migrations that both add a column to the same table, in which case the order of the migrations determines the column order. Resolving such issues involves getting the previous Codd snapshot, resetting the development database, and re-applying the new migration(s) in order to create the new snapshot.
The snapshot representation using hierarchically organized JSON files minimizes Git conflicts, but the column order discrepancy illustrates how these files may be interdependent. In this case, column configuration is stored separately, but each column must have the correct column order index. Adding or deleting a column affects all later columns.
Aside from the database configuration (such as locale) and role configuration, a snapshot is determined by the applied migrations. One generally does not store derivative files in revision control… What if we do not commit the snapshot?
Users could optionally configure a directory for storing snapshots. Users may configure a directory and either commit it or not; it is up to them. When a directory is not configured, a temporary directory can be used. Any number of snapshots can be stored/cached in this directory.
In non-development/test environments, a snapshot directory with the expected snapshots can be provided so that the consistency of the database can be verified like with the current Codd implementation. With multiple snapshots, it is possible to verify the database state before attempting any migrations, not just afterwards.
In development environments, snapshots can be created when an empty database is initialized by applying migrations. I wonder if creating a snapshot for every state would be excessive… When there are many migrations, it might indeed be an issue. Perhaps snapshots could be computed using a temporary database, on demand.
Temporary database functionality could of course be modular. I would like implementations using Docker and Gargoyle, and it should be possible for users to create an implementation for their own infrastructure. A Docker implementation could even make it easy to test across multiple versions of PostgreSQL! (In contrast, Gargoyle uses the version of PostgreSQL pinned in the Nix configuration.)
It may even be worthwhile to make the database modular… This type of migration management and consistency checking works for any type of database. Each implementation needs to create snapshots for the specific database and database version.
To make all of this work, there needs to be a good way to reference
migrations and states/snapshots. Codd references migrations by
timestamped filename, and there is no need to reference snapshots
because only the latest snapshot is used. The
codd_schema.sql_migrations
table is defined as follows (code).
CREATE TABLE codd_schema.sql_migrations (
id SERIAL PRIMARY KEY,
NOT NULL,
migration_timestamp TIMESTAMPTZ NOT NULL,
applied_at TIMESTAMPTZ NOT NULL,
name TEXT UNIQUE (name),
UNIQUE (migration_timestamp)
);
When the codd up
command is used to synchronize a
database with the applied transactions, each entry of that table is
verified. See the Codd Migration
Inverses blog entry for an example of why this is problematic. I
would like the database to keep a record of all migrations that were
applied, including inverses, but inverses should never be registered in
the list of applied migrations that Codd maintains.
It would be interesting to instead manage migrations based on state.
Here is an untested, draft replacement for the above table.
Note that I once again use the edgar
name to avoid
confusion (reference).
CREATE SEQUENCE edgar.transition___id___seq;
CREATE TABLE edgar.transition (
id INTEGER NOT NULL DEFAULT nextval('edgar.transition___id___seq'),
TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
applied_at NOT NULL,
command TEXT NOT NULL,
state_from TEXT NOT NULL,
state_to TEXT
file_hash TEXT,CONSTRAINT transition_pkey PRIMARY KEY (id)
);
ALTER SEQUENCE edgar.transition___id___seq
BY edgar.transition.id; OWNED
In this design, each transition specifies the state that is
transitioned from (state_from
) and the state that is
transitioned to (state_to
). After considering the many
options, I think that using a SHA-256 hash of the
(single) snapshot JSON file might be a good choice to reference states.
Critically, identical database states have the same hash no matter how
they were created. For example, the database state reached by a
checkpoint migration must be the same as the state reached by the
equivalent sequence of individual migrations, and this state should be
referenced consistently. The first transition always migrates from the
zero state.
I use a command
column instead of a file
column because I would like to record database upgrades as well as
migrations. (See Codd
Experiment 2: PostgreSQL Upgrade.) The idea is to use
machine-readable command strings to specify how that transition was
performed. Here are some example values:
apply("0002-init-schema.sql")
specifies that migration file0002-init-schema.sql
was applied.upgrade_checkpoint("14.7", "15.2", "0121-add-arrow.checkpoint.sql")
specifies that the database was upgraded from version14.7
to version15.2
using the0121-add-arrow.checkpoint.sql
checkpoint migration.upgrade("14.7", "15.2")
specifies that the database was upgraded from version14.7
to version15.2
without any significant changes to the schema.
The file_hash
column can be used to specify the SHA-256 hash of the
migration file, when applicable. This information can be used for
debugging, and it could also be used to provide more informative error
messages.
Note that I use transition
as the table name because it
represents a transition between states, which may not involve a
migration. I considered using step
as the surrogate key,
but I stuck with id
because PostgreSQL sequences are
not necessarily sequential.
The Codd table uses a unique index on the migration timestamp, which
is the timestamp prepended to the migration filename when it is added in
development, to determine order. Note that this means that the table
cannot register the same migration being applied twice, but that makes
sense because the table must match the list of applied migrations. In
this design, the id
column determines the order, and the
table registers the full history of transitions. It is possible to apply
a migration, apply the inverse, and then apply the same migration again
after resolving an issue that is unrelated to the migration. This table
retains that history.
For the same reason, there is no unique index for the command. This design retains the full history of transitions, and the same transition may be applied more than once, as described above.
Snapshots depend on the database version, as different versions include different low-level details. Snapshots created using the same migrations with different versions of the database therefore have different hash identifiers. Transitions that use upgrade commands specify these states, so transitions are continuous even across database upgrades.
The software should also cache migration information. The hash of each migration can be specified. In addition, the states that each migration transitions from and to can be specified for a given database version. This should be configured separately, because it would be useful to commit this file to the repository even when the snapshot directory is not committed. When snapshots are not committed to the repository, hashes in this file could be used to verify that snapshots that are created on demand perfectly match the snapshots created previously.
The following is a draft example, using dummy data.
[ { "filename": "0001-check-db-config.sql",
"file_hash": "d7666b963c2969b0014937aae55472eea5098ff21ed3bea8a2e1f595f62856c1",
"state_from":
[ { "database": "pg-14",
"snapshot_hash": "d914176fd50bd7f565700006a31aa97b79d3ad17cee20c8e5ff2061d5cb74817"
},
{ "database": "pg-15",
"snapshot_hash": "d914176fd50bd7f565700006a31aa97b79d3ad17cee20c8e5ff2061d5cb74817"
}
],
"state_to":
[ { "database": "pg-14",
"snapshot_hash": "3b0bff2192e5f86be6411daa9199bfdbeb87d6ce6aff66f33a8501c97f1df1af"
},
{ "database": "pg-15",
"snapshot_hash": "761b73bbbddf57580a95202dd163ea2b8bff3905ac7c7ae34f50421d64a26c28"
}
]
},
...
]
Note that the file should be formatted with whitespace so that diff
s are
meaningful and users can easily inspect them. Users may even edit the
file in order to remove information that is no longer needed, but the
software should provide cache management commands to make this easy. For
example, a command could be used to remove state information for
pg-14
after all environments have been upgraded to
pg-15
.
Test Scripts
I wrote about verify scripts, but I realized that checks that should always be run (even in production) could simply be included in the migrations themselves. I think that test scripts would be much more useful. The software could run test scripts only when migrations are performed in test mode. I create such tests all the time, but I currently test manually, and this feature would provide a framework for doing so as part of migration development.
Test scripts could be used to insert test data before a migration and then check the data in the database after the migration. For example, some migrations may work fine on an empty database but fail with a constraint violation in a populated database. Test scripts can ensure that the database is populated when testing migrations.
A test script with extension .test.pre.sql
could be run
before the migration (or migration step) with the same name. A test
script with extension .test.post.sql
could be run after the
migration (or migration step) with the same name.
Dead Ends
In Codd Migration Inverses, I described a possible scenario where an unforeseen issue causes developers to abort using a (multi-step) migration and create a different (multi-step) migration to use instead. In this case, the migration index numbers overlap. As I clarified in an edit, the filename conventions are not sufficient to create the digraph in this case. The subsequent migration needs to know which state it is based on!
This issue could be resolved by adding more syntax to migration
filenames. I think such cases are rare, however, and I have never
witnessed one in my career. The migrations directory generally stores
all migrations, but I think that removing aborted migrations is
preferably to complicating the syntax. The migrations could be moved to
an aborted
subdirectory, so that they stay in the
repository.