Codd Design Ideas

Codd is a utility for managing PostgreSQL migrations, written in Haskell. I quite like it, and I have been thinking about the design space. This blog entry describes my current design ideas.

Codd Overview

Codd is schema migration software that maintains an ordered list of applied SQL migrations. Each database includes a table that registers which of these migrations have been applied to that database. Codd maintains a low-level representation of the database and associated roles, which is retrieved from the database, not computed from the SQL migration files. Migrations are generally tested before being put into production, so Codd knows the expected state of the database after applying a migration and can check that the database state matches the expected state within the migration transaction.

Please see the Codd repository for more details about Codd features.

Terminology

The word schema is used for more than one concept. I use it in the general sense (reference). It is (IMHO unfortunately) also used by PostgreSQL to describe namespaces (reference). If I ever talk about this use, I specify PostgreSQL schema.

Codd represents the state of a database (and associated roles) at a given time using JSON, either organized hierarchically by key using many files or in a single JSON file. Migrations are the arrows (morphisms) between these database states. In this blog entry, I use indices to distinguish different states and migrations, and I call the JSON representation a snapshot to emphasize that it is the state at a specific time.

The initial state (\(S_0\)) is empty. The first migration (\(m_1\)) migrates a database from \(S_0\) to \(S_1\), the second migration (\(m_2\)) migrates a database from \(S_1\) to \(S_2\), etc.

Schema migration must manage databases across a number of different types of environments. I use the following terminology:

Production environments: used by customers, contain sensitive data, have SLOs/SLAs
Demonstration environments: used by sales staff and/or potential customers, may contain sensitive data, minimizing issues and downtime is “best effort”
Staging environments: used to test before pushing to production or demonstration environments
Testing environments: used by developers to test during development
Development environments: local environments each used by single developers during development

State Representation

With normal usage, Codd maintains a single snapshot, which represents the state of the database with all of the applied migrations. When managing the filesystem using Git, one can inspect previous states of the database by checking out previous commits, but this requires users to search for an appropriate commit and makes it difficult to reference multiple states.

Note that Codd snapshots are low-level representations. Each snapshot is specific to the version of PostgreSQL that is used to take the snapshot. See the Codd Experiment 2: PostgreSQL Upgrade blog entry for details.

It would be interesting to instead maintain the state of the database across all migrations. This is analogous to how Git is designed. The full history of changes is stored in a Git repository, while the Git workspace contains the state at a specific point in time. With such a design, database migration software can reference all states, which makes many of the following features possible.

A full history of database state across all applied migrations should work across multiple PostgreSQL versions. The software should therefore use a general representation that provides a “view” for each supported version of PostgreSQL.

Migration Tracking

Codd maintains migrations that have been applied to any database, on the filesystem. A migration is applied using the codd add command. The migration is appended to the list and the snapshot is updated only if the migration is successful. Migrations are ordered by the time that they were added, and the codd up command is used to synchronize a database with the current list of applied migrations.

A benefit of this design is that users do not need to worry about including index numbers in migration filenames. A drawback is that migrations that have not been applied must be managed separately. See the Codd Experiment blog entry for details about how I am currently doing this.

I have been thinking about not tracking applied migrations in this way. The migrations directory could instead contain all migration files whether they have been applied or not. Users need to worry about index numbers, but this is necessary to support other features anyway.

A basic migration file is named like NNNN-DESCRIPTION.sql. NNNN is a zero-padded, four-digit index, starting from 1. Other types of migrations are described below. For example, file 0001-init-schema.sql is a first migration that initializes the database schema.

A migration file may be changed while it is still under development, but it should not be changed once it has been used in a production or demonstration environment. I recommend setting the file as read-only (chmod 0444) as a way to remind developers that a migration should no longer be changed.

Verify Scripts

One feature that some schema migration tools have is verify scripts. A verify script is used to verify that a migration is successful. It is run in the same transaction as a migration, and any error raised causes a rollback.

A verify script has extension .verify.sql and corresponds to the migration file that has the same name but with a .sql extension. For example, file 0001-init-schema.verify.sql is a verify script for the above example migration.

Codd checks that the database state matches the expected state, so verify scripts do not need to check schema changes. Migrations often have to deal with data, however, and verify scripts can be useful for verifying critical data.

Migration Inverses

A migration inverse reverts a migration. The term inverse is borrowed from mathematics: a migration composed with its inverse is congruent to the identity function. In the following diagram, migration \(m_2\) migrates from state \(S_1\) to state \(S_2\). Migration \(m_2^{-1}\) is the corresponding inverse, which migrates from state \(S_2\) back to \(S_1\).

Migration inverses are critical for minimizing downtime when there is an unexpected issue, which may not even be related to the database migration. Some companies even require developing and testing migration inverses for all migrations to released services before applying them to production or demonstration environments.

A migration inverse has extension .inverse.sql and corresponds to the migration that has the same name but with a .sql extension. For example, file 0001-init-schema.inverse.sql is an inverse for the above example migration. A migration inverse may have a verify script. For example, file 0001-init-schema.inverse.verify.sql is a verify script for this inverse.

Developers must check that migration inverses are correct. For example, the inverse in the above diagram must satisfy the following laws, where \(1_{S_1}\) represents the identity function for \(S_1\), etc.

\[ 1_{S_1} \cong m_2^{-1} \circ m_2 \]

\[ 1_{S_2} \cong m_2 \circ m_2^{-1} \]

Due to the way that states are constructed, it should be sufficient to just check the first one. Checking requires snapshots for both database states. If the software maintains the state of the database across all migrations, then a command can be provided to make checking inverses very easy.

This cannot be easily done with Codd because only the latest snapshot is stored. Users can check inverses by copying the snapshot and using a temporary development database, however, as follows.

At state \(S_1\), make a copy of the current schema directory \(s_a\) as \(s_b\).
Update the CODD_EXPECTED_SCHEMA_DIR environment variable to point to \(s_b\).
Apply migration \(m_1\). This updates schema directory \(s_b\) to represent state \(S_2\).
Apply migration inverse \(m_1^{-1}\). This updates schema directory \(s_b\) to represent the new state.
Update the CODD_EXPECTED_SCHEMA_DIR environment variable to point to \(s_a\), which still represents state \(S_1\).
Run codd verify-shema to check the new state against state \(S_1\). The inverse reverted the migration if verification is successful.
Cleanup by removing the added migration and inverse from the CODD_MIGRATION_DIRS directory, remove schema directory \(s_b\), and discard the temporary database.

Multi-Step Migrations

IMHO, two things can make database migrations challenging. One is “data migration,” where changes to the schema necessitate transformation of data in order to maintain database constraints. The other is upgrading a distributed system without downtime. Simple migrations can be done using a single step, but complex migrations almost always require multiple steps, usually because software that uses the database must also be upgraded.

See the Multi-Step Migration section of the Codd Experiment blog post for an example of renaming a column. If software is using the old column name, then simply renaming the column causes that software to fail. In that example, the database is migrated to an intermediate state that is compatible with both old and new versions of the software, the software is upgraded to the new version (perhaps using a phased/gradual rollout), and finally the database is migrated to the target state.

Such migrations must be carefully prepared and tested when the cost of production issues is high. One must be especially careful with destructive migrations. For example, destructive commands are often isolated in a separate step that is only applied after backups have been taken and it is certain that the target data is no longer needed. The above example does not remove any data and only requires two steps, while a destructive migration may use a third step to perform deletions or drop tables.

The following diagram illustrates a two-step migration from state \(S_1\) to state \(S_2\), via intermediate state \(S_{1.1}\). Migrations \(m_{2.1}\) and \(m_{2.2}\) may have inverses \(m_{2.1}^{-1}\) and \(m_{2.2}^{-1}\), respectively. Optionally, migration \(m_2\) transitions from state \(S_1\) to state \(S_2\) without an intermediate state, with \(m_2^{-1}\) as the inverse.

Multi-step migration files are named like NNNN.S-DESCRIPTION.sql. S is the migration step number, starting from 1 for each index. For example, files 0002.1-rename-employee-name.sql and 0002.2-rename-employee-name.sql could be used to implement a two-step migration to rename a column. Inverses for these migrations would be in files 0002.1-rename-employee-name.inverse.sql and 0002.2-rename-employee-name.inverse.sql, respectively. Each of these may have a verify script, if necessary. File 0002-rename-employee-name.sql may migrate without steps, with file 0002-rename-employee-name.inverse.sql as the inverse.

Developers must check that migration inverses for each step are correct. For example, the inverses for the steps in the above diagram must satisfy the following laws.

\[ 1_{S_1} \cong m_{2.1}^{-1} \circ m_{2.1} \]

\[ 1_{S_{1.1}} \cong m_{2.1} \circ m_{2.1}^{-1} \]

\[ 1_{S_{1.1}} \cong m_{2.2}^{-1} \circ m_{2.2} \]

\[ 1_{S_2} \cong m_{2.2} \circ m_{2.2}^{-1} \]

If a migration without steps is provided, developers must check that it is congruent to using the steps. With the above example diagram, the following must be satisfied.

\[ m_2 \cong m_{2.2} \circ m_{2.1} \]

If the migration without steps has an inverse, developers must check that is is correct and is congruent to using the step inverses. With the above example diagram, the following must be satisfied.

\[ 1_{S_1} \cong m_2^{-1} \circ m_2 \]

\[ 1_{S_2} \cong m_2 \circ m_2^{-1} \]

\[ m_2^{-1} \cong m_{2.1}^{-1} \circ m_{2.2}^{-1} \]

The number of required checks increases with the number of steps. If the software maintains the state of the database across all migrations, then this can all be automated.

Checkpoints

A checkpoint is a migration that is equivalent to all previous migrations.

In this diagram, migration \(m_3^0\) is a checkpoint that is equivalent to the first three migrations.

\[ m_3^0 \cong m_3 \circ m_2 \circ m_1 \]

Checkpoints are named like NNNN-DESCRIPTION.checkpoint.sql. For example, file 0003-add-phone-table.checkpoint.sql is equivalent to the first three migrations. Note that checkpoints can only be applied to \(S_0\), the empty state. Also note that checkpoints should not be for individual steps. (There is no .S in a checkpoint file name.)

Developers must check that checkpoints are valid. If the software maintains the state of the database across all migrations, then this can be automated. It can be checked manually using Codd in a manner that is similar to the above inverse check.

One use of checkpoints is to clean up the database before release. There may be many changes during early development, and migrations may add columns in a non-optimal order. A checkpoint may be used to initialize a database that is equivalent to all previous migrations except that it changes the column order for good alignment (tutorial). Automated checking of checkpoints should therefore have an option to ignore column order.

Migration Management

A table in the database should track the full migration history, including applications of any inverses. Note that entries may differ in different databases for a number of reasons. Inverses may be needed in some database and not others, and different databases may be initialized using different checkpoints.

The final entry indicates the expected state of the database, and the expected state is \(S_0\) (the empty state) when there are no entries. This state determines which migrations are valid to apply to that database.

A database in state \(S_0\) should be confirmed empty before applying any migrations. Valid migrations include the following.

A first migration (Example: 0001-init-schema.sql)
A first migration step (Example: 0001.1-init-schema.sql)
A checkpoint (Example: 0121-use-beam-conventions.checkpoint.sql)

As an example of migrating from a different state, consider a database that is in state \(S_3\). The software should confirm that the database state matches the expected state before apply any migrations. This can only be done if the software maintains the state of the database across all migrations. Valid migrations include the following.

A migration with index 4 (Example: 0004-add-company.sql)
A first migration step with index 4 (Example: 0004.1-add-company.sql)
A migration inverse with index 3 (Example: 0003-add-phone-table.inverse.sql)
A last migration step inverse with index 3 (Example: 0003.2-rename-employee-name.inverse.sql)

For any of these migrations, verify scripts can be run as part of the transaction, if they exist. The software should also check that the final database state matches the target state, as part of the transaction.

User Interface

What kind of user interface could implement the above ideas? I use program name edgar to describe proposed commands. This distinguishes them from actual codd commands and hopefully avoids any confusion. (Edgar F. Codd invented the relational model.)

Database Commands

These commands are for working with a specific database.

edgar database status checks the status of a database. It first checks if the table that tracks migrations even exists. When it exists, it uses entries of that table to determine the expected state, as described above. Once the expected state is determined, the actual database state can be compared. This command must never modify the database.

This command is roughly equivalent to codd verify-schema, but it is much more versatile. Notably, it can check the status of databases that are not at the most recent state.

edgar database migrations lists migrations that have been applied to the database. It first checks if the table that tracks migrations even exists. If it exists, it lists the applied migrations. Command options can be used to determine how much detail is displayed. This command must never modify the database.

There is no Codd equivalent to this command, but one could easily be implemented.

edgar database migrate MIGRATION migrates the database. It first checks if the table that tracks migrations even exists. When it exists, it uses entries of that table to determine the expected state, as described above. Once the expected state is determined, the actual database state can be compared, and the command exits with an error if the database state does not match the expected state. The on-disk migrations form a graph of states and transitions, and the software must determine a path from the database state to the specified migration (target state). If there is no path, the command exits with an error. If there is a path, then the migration(s) are applied to the database. When the path traverses states that have multi-step migrations and a non-multi-step migration is available, the non-multi-step migration should be used by default. When the initial state is \(S_0\) and a checkpoint can be used, the most recent applicable checkpoint should be used by default. When a migration is applied for the first time, the new state is saved. This command should have a --dry-run option so that users can see the migration plan without performing any migrations. An --interactive option could be provided to allow users to choose to proceed or not. A --multi-step option could also be used to force use of multi-step migrations. A --no-checkpoint option could be used to avoid use of checkpoints.

This command is roughly equivalent to codd up, but it is much more versatile. Notably, it can be used to migrate to states that are not the most recent. It can also migrate “backwards” using inverses.

edgar database snapshot SNAPSHOT creates a snapshot of a database. This snapshot may not correspond to any valid state. This command must never modify the database.

This command is equivalent to codd write-schema.

Verify Commands

These commands are for verifying inverse and checkpoint validity. These commands could be implemented using temporary databases.

None of these commands are possible to implement in Codd because Codd only maintains the current/latest snapshot.

edgar verify inverse INVERSE_MIGRATION verifies a migration inverse.

edgar verify checkpoint CHECKPOINT_MIGRATION verifies a checkpoint migration. Option --ignore-column-order can be specified to ignore column order.

State Commands

These commands are for working with database states and snapshots. They do not use any database.

edgar state list lists the available states. This is \(S_0\) and a state for each migration that has been applied, including intermediate states for migration steps. Command options can be used to determine how much detail is displayed, such as listing migrations that are possible from each state.

A similar command could be implemented for Codd based on the list of applied migrations, but it would be pointless.

edgar state graph creates a graph that illustrates states and available migrations, perhaps writing Graphviz source code. Command options/arguments could be used to render a subset of the graph, select which types of migrations to display, etc.

This command is not relevant to Codd because Codd does not distinguish between different types of migrations.

edgar state export STATE SNAPSHOT exports a state to a snapshot. This makes it easy for users to inspect any state of the database.

This command is not relevant to Codd because Codd only tracks the current/latest state. Codd users instead use Git to checkout a revision of the repository that has a snapshot for the target state.

edgar state convert SOURCE_SNAPSHOT DEST_SNAPSHOT converts between snapshot formats. I considered using commands pack to convert from a hierarchical snapshot to a JSON snapshot and unpack to convert from a JSON snapshot to a hierarchical snapshot, but using convert leaves room for adding more snapshot formats in the future. Future formats could possibly include archives of hierarchical JSON, YAML, etc.

A similar command could be implemented in Codd.

edgar state diff STATE|SNAPSHOT STATE|SNAPSHOT compares states/snapshots. This is useful during development and for debugging issues.

A similar command that just compares snapshots could be implemented in Codd. Codd users should use Git to reference previous snapshots.

edgar state search SNAPSHOT searches for states that are most similar to the specified snapshot. Note that is uses comparison, not equality, so that users can find the most similar states.

This command is not possible to implement in Codd because Codd only tracks the current/latest state.

Conclusion

This is a pretty good initial draft of many features that I would love to have. It may serve as inspiration in the development of Codd.

Some features be implemented while others may never be implemented. Storing the state of the database across all migrations is required to implement the most useful features, but that would be a major change. Now that I have a description of the interface that I want, perhaps I can start to think about how to implement scripts that use Codd to achieve the same goals!

Author

Travis Cardwell

Published

March 29, 2023