Codd Experiment 4: Migration Conflicts
Codd is a utility for managing PostgreSQL migrations, written in Haskell. Codd maintains applied transactions and a snapshot of the database state with all transactions applied. In this blog post, I experiment with migration conflicts. In a large project with many developers, migration conflicts happen quite often, particularly during early stages of development.
Test Project
I am organizing this experiment like the one described in the Codd Experiment blog entry.
I manage all migrations in a migrations
directory and link
to applied migrations in the Codd-managed directory, stored under a
var
directory. See the blog entry for details.
$ mkdir /tmp/codd-conflict
$ cd /tmp/codd-conflict
$ mkdir -p bin config migrations var/codd-{migrations,schema}
$ wget -O bin/codd \
https://github.com/mzabani/codd/releases/download/v0.1.1/codd
$ chmod 0755 bin/codd
This experiment simulates migration merge conflicts between two
developers, Alice and Bob. File config/codd-alice.sh
configures Alice’s development environment.
export CODD_CONNECTION=postgres://postgres:password@127.0.0.1/postgres
export CODD_EXPECTED_SCHEMA_DIR=var/codd-schema
export CODD_MIGRATION_DIRS=var/codd-migrations
export CODD_TXN_ISOLATION=serializable
File config/codd-bob.sh
configures Bob’s development
environment. The only difference is that it uses a different host port,
required since I run the servers simultaneously on the same system.
export CODD_CONNECTION=postgres://postgres:password@127.0.0.1:5433/postgres
export CODD_EXPECTED_SCHEMA_DIR=var/codd-schema
export CODD_MIGRATION_DIRS=var/codd-migrations
export CODD_TXN_ISOLATION=serializable
Test Servers
For this experiment, I use postgres
containers to run PostgreSQL servers. The following command runs Alice’s
development server on host port 5432.
$ docker run \
--detach \
--name "pg_codd_alice" \
--publish "127.0.0.1:5432:5432" \
--env POSTGRES_DB="postgres" \
--env POSTGRES_USER="postgres" \
--env POSTGRES_PASSWORD="password" \
"postgres:11-alpine"
The following command runs Bob’s development server on host port 5433.
$ docker run \
--detach \
--name "pg_codd_bob" \
--publish "127.0.0.1:5433:5432" \
--env POSTGRES_DB="postgres" \
--env POSTGRES_USER="postgres" \
--env POSTGRES_PASSWORD="password" \
"postgres:11-alpine"
Project Initialization
Alice performs project initialization. The following command configures Alice’s environment.
$ source config/codd-alice.sh
This experiment uses Git. The following command initializes the repository.
$ git init
For simplicity, I initialize this test project directly in the
main
branch. The bin
directory is just used to
make this experiment easily repeatable, so I do not include it in the
repository.
$ echo bin > .gitignore
$ git add .gitignore config
$ git commit -m 'Initial commit'
Migration migrations/0001-init-codd.sql
is a special
no-op migration that is required to make Codd initialize the
codd_schema.sql_migrations
table correctly. See the Codd
Initialization section for details.
-- codd: no-txn
SELECT 1;
Alice adds and links this migrations. Note that I replace Codd filename timestamps with ellipsis in the commands below. If you are following along, you will have different timestamps.
$ ./bin/codd add migrations/0001-init-codd.sql
$ ln -s \
../var/codd-migrations/...-0001-init-codd.sql \
migrations/0001-init-codd.sql
Previous migrations must never be modified. The following command makes this migration read-only.
$ chmod 0444 var/codd-migrations/*
The following commands create the initial commit, including this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0001-init-codd'
Migration migrations/0002-add-employee.sql
adds an
employee
table. Note that this migration is added in a
separate Git commit so that Git retains a history of every snapshot. If
this migration had been added in the above commit, then the snapshot
representing the state of a database after the first migration has been
applied would not be stored.
CREATE TABLE employee (
id SERIAL NOT NULL,
NOT NULL,
name TEXT CONSTRAINT employee_pkey PRIMARY KEY (id)
);
Alice adds this migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0002-add-employee.sql
$ ln -s \
../var/codd-migrations/...-0002-add-employee.sql \
migrations/0002-add-employee.sql
$ chmod 0444 var/codd-migrations/*
The following commands create a commit that includes this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0002-add-employee'
Merge Conflict
Codd represents the current schema (“snapshot”) in
var/codd-schema
using hierarchically organized JSON files.
For example, the following shows this hierarchy for the current
snapshot.
$ find var/codd-schema -type f | sort
var/codd-schema/db-settings
var/codd-schema/roles/postgres
var/codd-schema/schemas/public/objrep
var/codd-schema/schemas/public/sequences/employee_id_seq
var/codd-schema/schemas/public/tables/employee/cols/id
var/codd-schema/schemas/public/tables/employee/cols/name
var/codd-schema/schemas/public/tables/employee/constraints/employee_pkey
var/codd-schema/schemas/public/tables/employee/indexes/employee_pkey
var/codd-schema/schemas/public/tables/employee/objrep
Git merge conflicts only occur when there are multiple changes to the same JSON file. For example, changes to the same column results in a merge conflict, while changing different columns or adding different new columns would not result in conflicts.
I think that actual merge conflicts will be quite rare in practice, but I test one in this section anyway.
Alice (1)
Alice works on a new feature in a feature branch. The following
command creates a branch named alice
.
$ git checkout -b alice
The project requires a contact email for all employees. A migration
cannot simply add a non-nullable column to a table that has existing
rows, because existing rows would violate the NOT NULL
constraint if they have no email
value. There are various
ways to deal with this.
This migration allows the email
column to be null.
Software can then ensure that any new rows include an email address, and
any existing employees can be asked to register an email address. A
later migration can change the column constraints after all rows have an
email address.
For the sake of the example, the hypothetical software searches by email address. A index is added to support this case, and note that it is a partial index because the column is nullable.
Migration migrations/0003-add-employee-email.sql
adds an
email
column to the employee
table, as well as
the partial index.
ALTER TABLE employee
ADD COLUMN email TEXT;
CREATE INDEX employee_email_index
ON employee(email)
WHERE email IS NOT NULL;
Alice adds this migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0003-add-employee-email.sql
$ ln -s \
../var/codd-migrations/...-0003-add-employee-email.sql \
migrations/0003-add-employee-email.sql
$ chmod 0444 var/codd-migrations/*
The following commands create a commit (in the alice
branch) that includes this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0003-add-employee-email'
Alice then does further work on the feature. To add another commit, I represent this with a simple addition of a README.
$ echo "Alice was here!" > README.md
$ git add README.md
$ git commit -m 'Add README'
Bob (1)
Simultaneously, Bob works in a branch named bob
, based
on the main
branch.
$ git checkout main
$ git checkout -b bob
The following commands load Bob’s configuration and synchronizes
their development database. Note that the database state is that of the
main
branch, not the alice
branch.
$ source config/codd-bob.sh
$ codd up
For the sake of the example, Bob also adds an email migration.
Migration migrations/0003-add-employee-email.sql
is similar
to Alice’s migration, expect that it configures the index differently.
Bob’s index forces contact email addresses to be unique, and uses
lower
so support case-insensitive searching. Note that it
also uses migration index 3, as it is a migration from state 2 to state
3.
ALTER TABLE employee
ADD COLUMN email TEXT;
CREATE UNIQUE INDEX employee_email_index
ON employee(lower(email))
WHERE email IS NOT NULL;
Note that I would normally name this index
employee_lower_email_unique
, but I do not use this name
because I want a merge conflict.
Bob adds this migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0003-add-employee-email.sql
$ ln -s \
../var/codd-migrations/...-0003-add-employee-email.sql \
migrations/0003-add-employee-email.sql
$ chmod 0444 var/codd-migrations/*
The following commands create a commit (in the bob
branch) that includes this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0003-add-employee-email'
Bob then does further work on the feature. This work may include conflicts with Alice’s work, so I represent this with a simple addition of a README.
$ echo "Bob was here!" > README.md
$ git add README.md
$ git commit -m 'Add README'
Bob finishes development before Alice and merges to the
main
branch.
$ git checkout main
$ git merge bob
Alice (2)
Alice inspects the new commits to main
and decides to
incorporate them in the alice
branch.
$ git checkout alice
$ source config/codd-alice.sh
Before doing anything, Alice’s version of the email migration is in
the Codd migration directory, and there is a link to it in the
migrations
directory. The Codd snapshot directory contains
the state of the database with this migration.
When rebasing on main
, Git merges Bob’s migration and
snapshot with Alice’s at a line level. Since the same index is
configured differently, there are merge conflicts.
$ git rebase main
File migrations/0003-add-employee-email.sql
has a merge
conflict. It is a link to a migration file in the
var/codd-migrations
directory. The link from the
main
branch, merged from the bob
branch,
points to Bob’s migration, while the link from the alice
brand points to Alice’s migration. The actual migration files in
var/codd-migrations
do not have merge conflicts because
Codd timestamps them and they have different timestamps.
File
var/codd-schema/schemas/public/tables/employee/indexes/employee_email_index
also has a merge conflict. The JSON is minimized, so the full lines are
marked as different. This is not very user friendly. (I created issue #137 about
this.) The following commands show the differences. I change to the
directory just so the command is not too long.
$ cd var/codd-schema/schemas/public/tables/employee/indexes/
$ diff \
<(sed -n 2p employee_email_index | python -m json.tool) \
<(sed -n 4p employee_email_index | python -m json.tool)
$ cd -
There are two differences. The definition is different, and the
main
version is unique while the alice
version
is not.
5c5
< "definition": "CREATE UNIQUE INDEX employee_email_index ON public.employee USING btree (lower(email)) WHERE (email IS NOT NULL)",
---
> "definition": "CREATE INDEX employee_email_index ON public.employee USING btree (email) WHERE (email IS NOT NULL)",
18c18
< "unique": true
---
> "unique": false
How can Alice resolve the conflicts? First, the duplicate migrations
that are not actually conflicting need to be resolved. Alice and Bob
discuss and decide that a unique constraint should not be used, but
lower
is a good idea. Alice needs to change the migration
to the following.
ALTER TABLE employee
ADD COLUMN email TEXT;
CREATE INDEX employee_email_index
ON employee(lower(email))
WHERE email IS NOT NULL;
This is a combinations of both migrations, and neither Alice’s nor Bob’s snapshot match the resulting state. Perhaps it is best to go back to the previous state and add the updated migration.
The following commands remove both versions of the migration.
$ rm migrations/0003-add-employee-email.sql
$ git rm var/codd-migrations/*-0003-add-employee-email.sql
Alice uses git log
to find the commit hash of the commit
that has the desired snapshot.
$ git log
The following commands restore the snapshot from that commit.
$ rm -rf var/codd-schema
$ git checkout ...hash... var/codd-schema
Alice’s development database needs to be reset as well.
$ docker stop pg_codd_alice | xargs docker rm
$ docker run \
--detach \
--name "pg_codd_alice" \
--publish "127.0.0.1:5432:5432" \
--env POSTGRES_DB="postgres" \
--env POSTGRES_USER="postgres" \
--env POSTGRES_PASSWORD="password" \
"postgres:11-alpine"
$ ./bin/codd up
Alice now saves the updated migration to
migrations/0003-add-employee-email.sql
, adds it, links it,
and sets it read-only.
$ ./bin/codd add migrations/0003-add-employee-email.sql
$ ln -s \
../var/codd-migrations/...-0003-add-employee-email.sql \
migrations/0003-add-employee-email.sql
$ chmod 0444 var/codd-migrations/*
The updated files can now be added to the Git index.
$ git add \
var/codd-migrations/...-0003-add-employee-email.sql \
migrations/0003-add-employee-email.sql \
var/codd-schema/schemas/public/tables/employee/indexes/employee_email_index
Alice continues the rebase.
$ git rebase --continue
The next commit, representing changes to the software also has a conflict.
<<<<<<< HEAD
Bob was here!
=======
Alice was here!
>>>>>>> 97de627 (Add README)
Alice resolves the conflict and finishes the rebase.
$ echo 'Alice and Bob were here!' > README.md
$ git add README.md
$ git rebase --continue
Alice finishes development and merges to the main
branch.
$ git checkout main
$ git merge alice
Bob (2)
Bob, as well as any other developer that used the main
branch after Bob merged the changes, must delete and recreate their
development databases in order to use the updated migration.
$ source config/codd-bob.sh
Bob updates the bob
branch.
$ git checkout bob
$ git rebase main
Bob resets the pg_codd_bob
database.
$ docker stop pg_codd_bob | xargs docker rm
$ docker run \
--detach \
--name "pg_codd_bob" \
--publish "127.0.0.1:5433:5432" \
--env POSTGRES_DB="postgres" \
--env POSTGRES_USER="postgres" \
--env POSTGRES_PASSWORD="password" \
"postgres:11-alpine"
$ ./bin/codd up
No Merge Conflict
It is pretty rare for developers to simultaneously change the same column, index, etc. It is much more common for developers to simultaneously work on disjoint migrations, though. In this case, Git does not register a merge conflict, but the developer still needs to resolve the migration conflict.
At a minimum, a migration index needs to be changed. This is a disadvantage to using migration indices. Why use them? Developers need to organize multi-step migrations and migration inverses. Multi-step migrations are not applied at the same time. (It they can be applied at the same time, then there is no need for a multi-step migration!) Migration inverses are created to handle upgrade problems, which may be completely unrelated to the database. They are usually not used, but it is very important to have them available. Migration indices provide an easy way to organize all migrations in the desired order.
Alice (3)
Alice is up first.
$ git checkout alice
$ source config/codd-alice.sh
Migration migrations/0004-add-employee-address.sql
adds
an address
column to the employee
table.
ALTER TABLE employee
ADD COLUMN address TEXT;
Alice adds this migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0004-add-employee-address.sql
$ ln -s \
../var/codd-migrations/...-0004-add-employee-address.sql \
migrations/0004-add-employee-address.sql
$ chmod 0444 var/codd-migrations/*
The following commands create a commit (in the alice
branch) that includes this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0004-add-employee-address'
Bob (3)
Bob is up next.
$ git checkout bob
$ source config/codd-bob.sh
Migration migrations/0004-add-employee-phone.sql
adds a
phone
column to the employee
table. Note that
it also uses migration index 4, as it is a migration from state 3 to
state 4.
ALTER TABLE employee
ADD COLUMN phone TEXT;
Bob adds this migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0004-add-employee-phone.sql
$ ln -s \
../var/codd-migrations/...-0004-add-employee-phone.sql \
migrations/0004-add-employee-phone.sql
$ chmod 0444 var/codd-migrations/*
The following commands create a commit (in the bob
branch) that includes this migration and the resulting snapshot.
$ git add migrations var
$ git commit -m 'Add migration 0004-add-employee-phone'
Bob finishes development and merges to the main
branch.
$ git checkout main
$ git merge bob
Alice (4)
Alice now needs to merge those changes into the alice
branch.
$ git checkout alice
$ source config/codd-alice.sh
There are no conflicts, but just merging does not work. There are two issues:
- A migration index needs to be changed. Bob’s migration was merged first, so Alice’s migration index should be changed from 4 to 5.
- Both migrations add a column to the same table. Though the snapshots
could be merged by Git without conflict, there is a database conflict:
the new columns cannot both have the same column order number. Since
Alice’s migration is second, the
address
column order number must be changed.
Alice needs to do an interactive rebase to edit the commit.
$ git rebase -i main
What is the best way to handle the snapshot change? This change is so
simple that it could be done by hand, but that could be error-prone in
more complex migrations. Setting up the main
state and
adding the updated migration is safer.
Alice makes a copy of the address migration using the new migration index, and then removes the old migration.
$ cp \
var/codd-migrations/*-0004-add-employee-address.sql \
migrations/0005-add-employee-address.sql
$ git rm var/codd-migrations/*-0004-add-employee-address.sql
$ git rm migrations/0004-add-employee-address.sql
Alice uses git log
to find the commit hash of the
main
commit, which has the desired snapshot.
$ git log
The following commands restore the snapshot from that commit.
$ rm -rf var/codd-schema
$ git checkout ...hash... var/codd-schema
Alice’s development database needs to be reset as well.
$ docker stop pg_codd_alice | xargs docker rm
$ docker run \
--detach \
--name "pg_codd_alice" \
--publish "127.0.0.1:5432:5432" \
--env POSTGRES_DB="postgres" \
--env POSTGRES_USER="postgres" \
--env POSTGRES_PASSWORD="password" \
"postgres:11-alpine"
$ ./bin/codd up
Alice now adds the updated migration, links it, and sets it read-only.
$ ./bin/codd add migrations/0005-add-employee-address.sql
$ ln -s \
../var/codd-migrations/...-0005-add-employee-address.sql \
migrations/0005-add-employee-address.sql
$ chmod 0444 var/codd-migrations/*
Checking git status
, the address
column is
indeed updated. The git diff
command shows that the
order
value is incremented as expected. The updated files
can now be added to the Git index.
$ git add migrations var
Alice commits these changes, updating the commit message to use the new migration index, and finishes the rebase.
$ git commit --amend
$ git rebase --continue
Alice merges to the main
branch.
$ git checkout main
$ git merge alice
Bob (4)
In the merge conflict example, Bob needed to reset the
pg_codd_bob
database to use the updated migration. In this
case, the updated migration was never applied to that database, so doing
so is not necessary. Bob can use codd up
to synchronize
it.
$ git checkout bob
$ git rebase main
$ source config/codd-bob.sh
$ codd up
Cleanup
The following commands stop the containers, remove the test project, and exit the shell that includes Codd configuration.
$ docker stop pg_codd_alice pg_codd_bob | xargs docker rm
$ cd /tmp
$ rm -rf codd-conflict
$ exit
Conclusion
Resolving migration conflicts using Codd is a bit more complicated that I had anticipated, even with the trivial migrations that I used for testing. I am afraid that this will be error-prone, particularly for changes that Git does not register conflicts for.
My actual project uses Nix and Gargoyle, and the development databases are managed by the software that we are developing. This makes the above steps even more tricky than simply using Docker. My coworkers might not be very happy with this way of resolving migration conflicts.
Perhaps resolving migration conflicts could be simplified via tooling. Perhaps I should develop some scripts…