Haskell Monorepo GitHub Actions (Part 3)
I improved my work-in-progress GitHub Actions workflow
yesterday, and I realized that I should think through various caching
examples in detail. This blog entry documents these thoughts as well as
something I learned about the stack --snapshot option.
Cabal
haskell-actions/setup
Example
The model cabal workflow with caching example runs the following four steps before the project is built and tested.
- Configure the build generates a
plan.jsonfile that specifies the build plan, including the versions of all transitive dependencies. - Restore cached dependencies loads existing cache. It
searches for key
${os}-ghc-${ghc-version}-cabal-${cabal-version}-plan-${hash}, where${hash}is the hash of theplan.jsonfile. If no exact match is found, it searches for key${os}-ghc-${ghc-version}-cabal-${cabal-version}. - Install dependencies installs the project dependencies only if there was no exact match.
- Save cached dependencies saves a new cache with the above full key only if there was no exact match.
A single path is stored in the cache:
steps.setup.outputs.cabal-store, which is
~/.cabal/store in Linux. This directory contains the build
artifacts of the dependencies, including documentation. It does not
include the build artifacts of boot libraries, which are already
provided by the GHC install. It does not include the build artifacts of
the project itself.
Since the cache is saved before the project is built and tested, repeated runs of the workflow use the cache even if project tests fail.
The hash in cache keys captures any change in the generated
plan.json file. This includes:
- Changes of direct dependency versions caused by an update to a
.cabalfile orcabal*.projectfile - Changes of transitive dependency versions caused by updates on
Hackage that are within the configured dependency constraints, perhaps
including Hackage revisions since
plan.jsonincludes hashes (not just versions)
Cache keys do not include a timestamp. When there is no exact match
and a cache for a different plan.json is loaded, that cache
may include older versions of dependencies that are no longer needed.
The third step above installs the new dependencies, and the fourth step
above saves the new cache, which includes those older dependencies. The
size of the cache increases over time.
This example works fine when testing a “monorepo” that includes
multiple Haskell packages. The plan.json file is created
using cabal build all, so it includes the plan for all of
the packages. No changes are needed to test a single configuration.
actions/cache Example
The Cabal
example uses a single actions/cache step instead of
separate actions/cache/restore and
actions/cache/save steps. When run before the project build
and test steps, it restores cache then and saves cache at the end of the
job.
Three paths are stored in the cache: ~/.cabal/packages,
~/.cabal/store, and dist-newstyle.
The ~/.cabal/packages directory contains the source
tarballs of packages as well as the Hackage index. A test confirms that
the source tarball is not needed when the package build artifacts are in
~/.cabal/store, so it does not look like this needs to be
cached. The Hackage index is pretty large, and it does not seem
appropriate to cache it when updating Hacking in the
haskell-actions/setup step (or manually). There is no point
in updating it is replaced by old data from a cache!
The dist-newstyle directory contains the build artifacts
for the project itself. During development, keeping this directory can
greatly speed up builds, since modules that have not been changed do not
need to be rebuilt. I wonder why the haskell-actions/setup
example does not cache this. Perhaps it is not cached to ensure that the
whole project is rebuilt during every test? Perhaps the large size
mitigates the benefits?
This example uses key ${os}-${ghc-version}-${hash},
where ${hash} is a hash of all *.cabal,
cabal.project, and cabal.project.freeze files.
If no exact matches are found, it searches for key
${os}-${ghc-version}. The hash captures any change
to those files. This includes:
- Changes of direct dependency versions
- Changes of package modules
- Unrelated configuration changes such as a change of
description - Comment and whitespace changes
It does not capture changes of transitive dependency
versions cause by updates on Hackage. The cached content is a function
of the project state as well as the state of Hackage, but this hash only
takes the project state into account. Note that this was not an issue
with the haskell-actions/setup example because the
plan.json file takes both into account.
Cache keys do not include a timestamp, so the size of the cache increases over time.
This example works for a repository with a single Haskell package,
but it does not work for a “monorepo” that includes multiple Haskell
packages. Hashing all .cabal files
(**/*.cabal) seems like an appropriate change to handle
this case. When testing multiple configurations, the appropriate
cabal*.project and cabal*.project.freeze files
need to be hashed.
Stack
actions/cache Example
The Stack
example caches ~/.stack and ~/.stack-work
separately.
The ~/.stack directory contains the Hackage index as
well as an SQLite database of the
index. In a test, the size of my ~/.stack after performing
stack update but before doing any builds is more than
1.8GB! Updating this database is relatively time-consuming, however, so
maybe it is worthwhile to cache this. After building the project, the
build artifacts of the dependencies are stored in the
~/.stack/snapshots, including documentation. It does not
include the build artifacts of boot libraries, which are already
provided by the GHC install. It does not include the build artifacts of
the project itself.
The ~/.stack cache uses key
${os}-stack-global-${hash1}-${hash2}, where
${hash1} is a hash of stack.yaml and
${hash2} is a hash of package.yml. If no exact
matches are found, it searches for key ${os}-stack-global-.
The hash captures any change to those files. This includes:
- Changes of direct dependency versions
- Changes of transitive dependency versions, because these are
specified by the resolver configured in
stack.yaml - Changes of package modules
- Unrelated configuration changes such as a change of
description - Comment and whitespace changes
The ~/.stack-work directory contains the build artifacts
for the project itself. It is the equivalent to
dist-newstyle when using Cabal. It is interesting that the
Cabal example does not cache this while the Stack example does.
Is there a reason?
The ~/.stack-work cache uses key
${os}-stack-work-${hash1}-${hash2}-${hash3}, where
${hash1} is the hash of stack.yaml,
${hash2} is the hash of package.yaml, and
${hash3} is the hash of all .hs source files.
The hash captures any change to those files. This includes:
- Changes of direct dependency versions
- Changes of transitive dependency versions, because these are
specified by the resolver configured in
stack.yaml - Changes of package modules
- Changes of
.hssource code, including comment and whitespace changes - Unrelated configuration changes such as a change of
description - Comment and whitespace changes
It does not capture changes of other types of files, which may be loaded into the source using Template Haskell.
Cache keys do not include a timestamp, so the size of the cache increases over time.
It is interesting that this example uses cache keys with separate hashes, while the Cabal example uses a single hash. Separate hashes allows developers to see which configuration differs when comparing cache keys, but I doubt that is very useful. A benefit of using a single hash is that cache keys are shorter. Is there a good reason to use separate hashes?
This example works for a repository with a single Haskell package,
but it does not work for a “monorepo” that includes multiple Haskell
packages. Hashing all package.yaml files
(**/package.yaml) seems like an appropriate change to
handle this case. When testing multiple configurations, the appropriate
stack*.yaml files need to be hashed.
While Cabal creates a single dist-newstyle directory in
the project root, Stack creates a .stack-work in each
package subdirectory as well as in the project root. A
.stack-work in a package subdirectory contains the build
artifacts (dist) for that package, while the
.stack-work in the project root contains the installed
packages (install). Perhaps it is worthwhile to cache these
separately, and that is indeed what Simon
Michael does in the hledger
workflow (source).
haskell-actions/setup
Example
The haskell-actions/setup
documentation does not yet provide a Stack example with caching. There
is a Document how
to cache stack issue about creating an example, and it links to a Blessed
recipe how to use stack on github actions, in particular caching?
issue in the stack repo.
One idea is to use stack ls dependencies to get
dependency versions similar to the plan.json file used in
the Cabal example. The text output shows packages with version numbers,
so it does not capture Hackage revisions. The JSON output includes
additional details, but it still does not capture Hackage revisions.
Another idea is to use stack*.yaml.lock files to get
dependency versions. These files include hashes, so perhaps
they capture Hackage revisions… It seems that
stack*.yaml.lock files depend on what is (already) in
~/.stack/snapshots, however, making them problematic for
this purpose.
Another idea is to cache parts of ~/.stack separately.
This is not as big of an issue when preventing Stack from installing
GHC, however.
There is some discussion on hashing *.cabal files and
package.yaml files. While it is not ideal because it
captures unrelated changes (including comment and whitespace changes),
it at least captures more changes than hashes of
stack*.yaml.lock files.
There is some discussion of “manual overrides.” This is just a
revision string that is included in cache keys and can be bumped in
order to force creation of new caches. I prefer to use timestamps to
automate periodic creation of new caches, and one can use
gh cache delete --all to remove problematic caches when
necessary.
--snapshot Option
I learned that the stack --snapshot
option overrides the resolver configured in the
stack.yaml file. This allows you to test many snapshots
without having to clutter your project with many
stack-*.yaml files. Nice!
I may be able to do this to clean up some of my projects, but many of
my projects require different configuration for different snapshots. For
example, some configuration requires extra-deps that pin
specific versions of packages that are not in the snapshot. Also,
optparse-applicative has different dependencies depending
on the version, and Cabal can handle this automatically while Stack
requires manual configuration.