Haskell Monorepo GitHub Actions (Part 3)
I improved my work-in-progress GitHub Actions workflow
yesterday, and I realized that I should think through various caching
examples in detail. This blog entry documents these thoughts as well as
something I learned about the stack --snapshot
option.
Cabal
haskell-actions/setup
Example
The model cabal workflow with caching example runs the following four steps before the project is built and tested.
- Configure the build generates a
plan.json
file that specifies the build plan, including the versions of all transitive dependencies. - Restore cached dependencies loads existing cache. It
searches for key
${os}-ghc-${ghc-version}-cabal-${cabal-version}-plan-${hash}
, where${hash}
is the hash of theplan.json
file. If no exact match is found, it searches for key${os}-ghc-${ghc-version}-cabal-${cabal-version}
. - Install dependencies installs the project dependencies only if there was no exact match.
- Save cached dependencies saves a new cache with the above full key only if there was no exact match.
A single path is stored in the cache:
steps.setup.outputs.cabal-store
, which is
~/.cabal/store
in Linux. This directory contains the build
artifacts of the dependencies, including documentation. It does not
include the build artifacts of boot libraries, which are already
provided by the GHC install. It does not include the build artifacts of
the project itself.
Since the cache is saved before the project is built and tested, repeated runs of the workflow use the cache even if project tests fail.
The hash in cache keys captures any change in the generated
plan.json
file. This includes:
- Changes of direct dependency versions caused by an update to a
.cabal
file orcabal*.project
file - Changes of transitive dependency versions caused by updates on
Hackage that are within the configured dependency constraints, perhaps
including Hackage revisions since
plan.json
includes hashes (not just versions)
Cache keys do not include a timestamp. When there is no exact match
and a cache for a different plan.json
is loaded, that cache
may include older versions of dependencies that are no longer needed.
The third step above installs the new dependencies, and the fourth step
above saves the new cache, which includes those older dependencies. The
size of the cache increases over time.
This example works fine when testing a “monorepo” that includes
multiple Haskell packages. The plan.json
file is created
using cabal build all
, so it includes the plan for all of
the packages. No changes are needed to test a single configuration.
actions/cache
Example
The Cabal
example uses a single actions/cache
step instead of
separate actions/cache/restore
and
actions/cache/save
steps. When run before the project build
and test steps, it restores cache then and saves cache at the end of the
job.
Three paths are stored in the cache: ~/.cabal/packages
,
~/.cabal/store
, and dist-newstyle
.
The ~/.cabal/packages
directory contains the source
tarballs of packages as well as the Hackage index. A test confirms that
the source tarball is not needed when the package build artifacts are in
~/.cabal/store
, so it does not look like this needs to be
cached. The Hackage index is pretty large, and it does not seem
appropriate to cache it when updating Hacking in the
haskell-actions/setup
step (or manually). There is no point
in updating it is replaced by old data from a cache!
The dist-newstyle
directory contains the build artifacts
for the project itself. During development, keeping this directory can
greatly speed up builds, since modules that have not been changed do not
need to be rebuilt. I wonder why the haskell-actions/setup
example does not cache this. Perhaps it is not cached to ensure that the
whole project is rebuilt during every test? Perhaps the large size
mitigates the benefits?
This example uses key ${os}-${ghc-version}-${hash}
,
where ${hash}
is a hash of all *.cabal
,
cabal.project
, and cabal.project.freeze
files.
If no exact matches are found, it searches for key
${os}-${ghc-version}
. The hash captures any change
to those files. This includes:
- Changes of direct dependency versions
- Changes of package modules
- Unrelated configuration changes such as a change of
description
- Comment and whitespace changes
It does not capture changes of transitive dependency
versions cause by updates on Hackage. The cached content is a function
of the project state as well as the state of Hackage, but this hash only
takes the project state into account. Note that this was not an issue
with the haskell-actions/setup
example because the
plan.json
file takes both into account.
Cache keys do not include a timestamp, so the size of the cache increases over time.
This example works for a repository with a single Haskell package,
but it does not work for a “monorepo” that includes multiple Haskell
packages. Hashing all .cabal
files
(**/*.cabal
) seems like an appropriate change to handle
this case. When testing multiple configurations, the appropriate
cabal*.project
and cabal*.project.freeze
files
need to be hashed.
Stack
actions/cache
Example
The Stack
example caches ~/.stack
and ~/.stack-work
separately.
The ~/.stack
directory contains the Hackage index as
well as an SQLite database of the
index. In a test, the size of my ~/.stack
after performing
stack update
but before doing any builds is more than
1.8GB! Updating this database is relatively time-consuming, however, so
maybe it is worthwhile to cache this. After building the project, the
build artifacts of the dependencies are stored in the
~/.stack/snapshots
, including documentation. It does not
include the build artifacts of boot libraries, which are already
provided by the GHC install. It does not include the build artifacts of
the project itself.
The ~/.stack
cache uses key
${os}-stack-global-${hash1}-${hash2}
, where
${hash1}
is a hash of stack.yaml
and
${hash2}
is a hash of package.yml
. If no exact
matches are found, it searches for key ${os}-stack-global-
.
The hash captures any change to those files. This includes:
- Changes of direct dependency versions
- Changes of transitive dependency versions, because these are
specified by the resolver configured in
stack.yaml
- Changes of package modules
- Unrelated configuration changes such as a change of
description
- Comment and whitespace changes
The ~/.stack-work
directory contains the build artifacts
for the project itself. It is the equivalent to
dist-newstyle
when using Cabal. It is interesting that the
Cabal example does not cache this while the Stack example does.
Is there a reason?
The ~/.stack-work
cache uses key
${os}-stack-work-${hash1}-${hash2}-${hash3}
, where
${hash1}
is the hash of stack.yaml
,
${hash2}
is the hash of package.yaml
, and
${hash3}
is the hash of all .hs
source files.
The hash captures any change to those files. This includes:
- Changes of direct dependency versions
- Changes of transitive dependency versions, because these are
specified by the resolver configured in
stack.yaml
- Changes of package modules
- Changes of
.hs
source code, including comment and whitespace changes - Unrelated configuration changes such as a change of
description
- Comment and whitespace changes
It does not capture changes of other types of files, which may be loaded into the source using Template Haskell.
Cache keys do not include a timestamp, so the size of the cache increases over time.
It is interesting that this example uses cache keys with separate hashes, while the Cabal example uses a single hash. Separate hashes allows developers to see which configuration differs when comparing cache keys, but I doubt that is very useful. A benefit of using a single hash is that cache keys are shorter. Is there a good reason to use separate hashes?
This example works for a repository with a single Haskell package,
but it does not work for a “monorepo” that includes multiple Haskell
packages. Hashing all package.yaml
files
(**/package.yaml
) seems like an appropriate change to
handle this case. When testing multiple configurations, the appropriate
stack*.yaml
files need to be hashed.
While Cabal creates a single dist-newstyle
directory in
the project root, Stack creates a .stack-work
in each
package subdirectory as well as in the project root. A
.stack-work
in a package subdirectory contains the build
artifacts (dist
) for that package, while the
.stack-work
in the project root contains the installed
packages (install
). Perhaps it is worthwhile to cache these
separately, and that is indeed what Simon
Michael does in the hledger
workflow (source).
haskell-actions/setup
Example
The haskell-actions/setup
documentation does not yet provide a Stack example with caching. There
is a Document how
to cache stack issue about creating an example, and it links to a Blessed
recipe how to use stack on github actions, in particular caching?
issue in the stack
repo.
One idea is to use stack ls dependencies
to get
dependency versions similar to the plan.json
file used in
the Cabal example. The text output shows packages with version numbers,
so it does not capture Hackage revisions. The JSON output includes
additional details, but it still does not capture Hackage revisions.
Another idea is to use stack*.yaml.lock
files to get
dependency versions. These files include hashes, so perhaps
they capture Hackage revisions… It seems that
stack*.yaml.lock
files depend on what is (already) in
~/.stack/snapshots
, however, making them problematic for
this purpose.
Another idea is to cache parts of ~/.stack
separately.
This is not as big of an issue when preventing Stack from installing
GHC, however.
There is some discussion on hashing *.cabal
files and
package.yaml
files. While it is not ideal because it
captures unrelated changes (including comment and whitespace changes),
it at least captures more changes than hashes of
stack*.yaml.lock
files.
There is some discussion of “manual overrides.” This is just a
revision string that is included in cache keys and can be bumped in
order to force creation of new caches. I prefer to use timestamps to
automate periodic creation of new caches, and one can use
gh cache delete --all
to remove problematic caches when
necessary.
--snapshot
Option
I learned that the stack
--snapshot
option overrides the resolver configured in the
stack.yaml
file. This allows you to test many snapshots
without having to clutter your project with many
stack-*.yaml
files. Nice!
I may be able to do this to clean up some of my projects, but many of
my projects require different configuration for different snapshots. For
example, some configuration requires extra-deps
that pin
specific versions of packages that are not in the snapshot. Also,
optparse-applicative
has different dependencies depending
on the version, and Cabal can handle this automatically while Stack
requires manual configuration.