Skip to main content

Haskell Monorepo GitHub Actions (Part 3)

I improved my work-in-progress GitHub Actions workflow yesterday, and I realized that I should think through various caching examples in detail. This blog entry documents these thoughts as well as something I learned about the stack --snapshot option.

Cabal

haskell-actions/setup Example

The model cabal workflow with caching example runs the following four steps before the project is built and tested.

  1. Configure the build generates a plan.json file that specifies the build plan, including the versions of all transitive dependencies.
  2. Restore cached dependencies loads existing cache. It searches for key ${os}-ghc-${ghc-version}-cabal-${cabal-version}-plan-${hash}, where ${hash} is the hash of the plan.json file. If no exact match is found, it searches for key ${os}-ghc-${ghc-version}-cabal-${cabal-version}.
  3. Install dependencies installs the project dependencies only if there was no exact match.
  4. Save cached dependencies saves a new cache with the above full key only if there was no exact match.

A single path is stored in the cache: steps.setup.outputs.cabal-store, which is ~/.cabal/store in Linux. This directory contains the build artifacts of the dependencies, including documentation. It does not include the build artifacts of boot libraries, which are already provided by the GHC install. It does not include the build artifacts of the project itself.

Since the cache is saved before the project is built and tested, repeated runs of the workflow use the cache even if project tests fail.

The hash in cache keys captures any change in the generated plan.json file. This includes:

  • Changes of direct dependency versions caused by an update to a .cabal file or cabal*.project file
  • Changes of transitive dependency versions caused by updates on Hackage that are within the configured dependency constraints, perhaps including Hackage revisions since plan.json includes hashes (not just versions)

Cache keys do not include a timestamp. When there is no exact match and a cache for a different plan.json is loaded, that cache may include older versions of dependencies that are no longer needed. The third step above installs the new dependencies, and the fourth step above saves the new cache, which includes those older dependencies. The size of the cache increases over time.

This example works fine when testing a “monorepo” that includes multiple Haskell packages. The plan.json file is created using cabal build all, so it includes the plan for all of the packages. No changes are needed to test a single configuration.

actions/cache Example

The Cabal example uses a single actions/cache step instead of separate actions/cache/restore and actions/cache/save steps. When run before the project build and test steps, it restores cache then and saves cache at the end of the job.

Three paths are stored in the cache: ~/.cabal/packages, ~/.cabal/store, and dist-newstyle.

The ~/.cabal/packages directory contains the source tarballs of packages as well as the Hackage index. A test confirms that the source tarball is not needed when the package build artifacts are in ~/.cabal/store, so it does not look like this needs to be cached. The Hackage index is pretty large, and it does not seem appropriate to cache it when updating Hacking in the haskell-actions/setup step (or manually). There is no point in updating it is replaced by old data from a cache!

The dist-newstyle directory contains the build artifacts for the project itself. During development, keeping this directory can greatly speed up builds, since modules that have not been changed do not need to be rebuilt. I wonder why the haskell-actions/setup example does not cache this. Perhaps it is not cached to ensure that the whole project is rebuilt during every test? Perhaps the large size mitigates the benefits?

This example uses key ${os}-${ghc-version}-${hash}, where ${hash} is a hash of all *.cabal, cabal.project, and cabal.project.freeze files. If no exact matches are found, it searches for key ${os}-${ghc-version}. The hash captures any change to those files. This includes:

  • Changes of direct dependency versions
  • Changes of package modules
  • Unrelated configuration changes such as a change of description
  • Comment and whitespace changes

It does not capture changes of transitive dependency versions cause by updates on Hackage. The cached content is a function of the project state as well as the state of Hackage, but this hash only takes the project state into account. Note that this was not an issue with the haskell-actions/setup example because the plan.json file takes both into account.

Cache keys do not include a timestamp, so the size of the cache increases over time.

This example works for a repository with a single Haskell package, but it does not work for a “monorepo” that includes multiple Haskell packages. Hashing all .cabal files (**/*.cabal) seems like an appropriate change to handle this case. When testing multiple configurations, the appropriate cabal*.project and cabal*.project.freeze files need to be hashed.

Stack

actions/cache Example

The Stack example caches ~/.stack and ~/.stack-work separately.

The ~/.stack directory contains the Hackage index as well as an SQLite database of the index. In a test, the size of my ~/.stack after performing stack update but before doing any builds is more than 1.8GB! Updating this database is relatively time-consuming, however, so maybe it is worthwhile to cache this. After building the project, the build artifacts of the dependencies are stored in the ~/.stack/snapshots, including documentation. It does not include the build artifacts of boot libraries, which are already provided by the GHC install. It does not include the build artifacts of the project itself.

The ~/.stack cache uses key ${os}-stack-global-${hash1}-${hash2}, where ${hash1} is a hash of stack.yaml and ${hash2} is a hash of package.yml. If no exact matches are found, it searches for key ${os}-stack-global-. The hash captures any change to those files. This includes:

  • Changes of direct dependency versions
  • Changes of transitive dependency versions, because these are specified by the resolver configured in stack.yaml
  • Changes of package modules
  • Unrelated configuration changes such as a change of description
  • Comment and whitespace changes

The ~/.stack-work directory contains the build artifacts for the project itself. It is the equivalent to dist-newstyle when using Cabal. It is interesting that the Cabal example does not cache this while the Stack example does. Is there a reason?

The ~/.stack-work cache uses key ${os}-stack-work-${hash1}-${hash2}-${hash3}, where ${hash1} is the hash of stack.yaml, ${hash2} is the hash of package.yaml, and ${hash3} is the hash of all .hs source files. The hash captures any change to those files. This includes:

  • Changes of direct dependency versions
  • Changes of transitive dependency versions, because these are specified by the resolver configured in stack.yaml
  • Changes of package modules
  • Changes of .hs source code, including comment and whitespace changes
  • Unrelated configuration changes such as a change of description
  • Comment and whitespace changes

It does not capture changes of other types of files, which may be loaded into the source using Template Haskell.

Cache keys do not include a timestamp, so the size of the cache increases over time.

It is interesting that this example uses cache keys with separate hashes, while the Cabal example uses a single hash. Separate hashes allows developers to see which configuration differs when comparing cache keys, but I doubt that is very useful. A benefit of using a single hash is that cache keys are shorter. Is there a good reason to use separate hashes?

This example works for a repository with a single Haskell package, but it does not work for a “monorepo” that includes multiple Haskell packages. Hashing all package.yaml files (**/package.yaml) seems like an appropriate change to handle this case. When testing multiple configurations, the appropriate stack*.yaml files need to be hashed.

While Cabal creates a single dist-newstyle directory in the project root, Stack creates a .stack-work in each package subdirectory as well as in the project root. A .stack-work in a package subdirectory contains the build artifacts (dist) for that package, while the .stack-work in the project root contains the installed packages (install). Perhaps it is worthwhile to cache these separately, and that is indeed what Simon Michael does in the hledger workflow (source).

haskell-actions/setup Example

The haskell-actions/setup documentation does not yet provide a Stack example with caching. There is a Document how to cache stack issue about creating an example, and it links to a Blessed recipe how to use stack on github actions, in particular caching? issue in the stack repo.

One idea is to use stack ls dependencies to get dependency versions similar to the plan.json file used in the Cabal example. The text output shows packages with version numbers, so it does not capture Hackage revisions. The JSON output includes additional details, but it still does not capture Hackage revisions.

Another idea is to use stack*.yaml.lock files to get dependency versions. These files include hashes, so perhaps they capture Hackage revisions… It seems that stack*.yaml.lock files depend on what is (already) in ~/.stack/snapshots, however, making them problematic for this purpose.

Another idea is to cache parts of ~/.stack separately. This is not as big of an issue when preventing Stack from installing GHC, however.

There is some discussion on hashing *.cabal files and package.yaml files. While it is not ideal because it captures unrelated changes (including comment and whitespace changes), it at least captures more changes than hashes of stack*.yaml.lock files.

There is some discussion of “manual overrides.” This is just a revision string that is included in cache keys and can be bumped in order to force creation of new caches. I prefer to use timestamps to automate periodic creation of new caches, and one can use gh cache delete --all to remove problematic caches when necessary.

--snapshot Option

I learned that the stack --snapshot option overrides the resolver configured in the stack.yaml file. This allows you to test many snapshots without having to clutter your project with many stack-*.yaml files. Nice!

I may be able to do this to clean up some of my projects, but many of my projects require different configuration for different snapshots. For example, some configuration requires extra-deps that pin specific versions of packages that are not in the snapshot. Also, optparse-applicative has different dependencies depending on the version, and Cabal can handle this automatically while Stack requires manual configuration.