Hackage Metadata (Part 2)
I wrote about Hackage Metadata earlier this year, summarizing my understanding of the metadata maintained by the Hackage package repository about Haskell packages. I would like to be able to process this information programmatically, and this blog entry logs my progress toward doing so.
What information do I need?
- Package versions - I need to know the released versions of each package that is available on Hackage.
- Package revisions - I need to know the available revisions of each version of each package. Since revisions are identified using sequential numbers, knowing the current revision for each package is sufficient.
- Preferred and deprecated versions - I need to know the preferred/deprecated version ranges for each package.
- Package deprecation - I would like to know which packages have been deprecated, though this is not absolutely essential.
- Candidate packages - I would like to know which candidate packages are available, though it is not essential, and I am not sure if I will even make use of this information yet.
I would really like to have timestamps for all of the above information, though this is not absolutely essential.
Hackage Index Tarball
I investigated the content of a 01-index.tar Hackage
index tarball. This index contains the .cabal files of
released packages, organized by package name and version, as well as
some metadata (hashes, file sizes, etc.) about package files. When a
package has one or more revisions, the modified .cabal file
for the latest revision is provided, and the revision number is
specified a custom field named x-revision.
$ grep revision bm/0.1.0.2/bm.cabal
x-revision: 2Information about preferred/deprecated versions is included in a
preferred-versions file when applicable.
$ cat mtl/preferred-versions ; echo
mtl <2.1 || >2.1 && <2.3 || >2.3To investigate deprecated packages, I looked at the highlighting-kate
package. None of the versions are deprecated, and the whole package is
deprecated in favor of the skylighting
package. I was unable to find any information in the index regarding
this deprecation, unfortunately. The index does not include information
about candidate packages either, as expected.
The Hackage index contains the information that I require. It does not contain package deprecation information or candidate package information, but perhaps I could get that information from another source.
hackage-db
The first package that I looked at is hackage-db,
a library that makes it very easy to load a Hackage index tarball and
query the data using a Map
interface. I confirmed that I can query the package revision using the
following test code. The full
code is available on GitHub.
newtype Revision = Revision Int
  deriving (Eq, Ord, Show)
defaultRevision :: Revision
defaultRevision = Revision 0
lookupRevision
  :: HackageDB
  -> PackageName
  -> Version
  -> Either String Revision
lookupRevision db packageName version = do
    packageData <- maybe (Left "package not found") Right $
      Map.lookup packageName db
    versionData <- maybe (Left "version not found") Right $
      Map.lookup version packageData
    let mRevisionString
          = List.lookup revisionField
          . PD.customFieldsPD
          . GPD.packageDescription
          $ DB.cabalFile versionData
    case mRevisionString of
      Just revisionString -> case readMaybe revisionString of
        Just revision -> pure $ Revision revision
        Nothing -> Left $ "unable to parse revision: " ++ revisionString
      Nothing -> pure defaultRevision
  where
    revisionField :: String
    revisionField = "x-revision"The package data just has information about specific versions, so this API does not provide information about preferred/deprecated version ranges, unfortunately.
type PackageData = Map Version VersionDataCabal and
cabal-install
The hackage-db
package is implemented using the Cabal
library, so I took a look at that library next. It does not include
information about preferred/deprecated version ranges, but that is
expected.
The cabal command takes preferred/deprecated version
ranges into account when creating build plans, so I looked at the cabal-install
package next. I found the packagePreferences
in the SourcePackageDb type. Unfortunately, the version of
cabal-install that I am testing with does not expose such
functionality in a library. The repository HEAD exposes a
library, but a comment indicates that doing so is temporary, so I
probably should not rely on it.
Pantry
I looked at the pantry
package next. Stack uses
Pantry to manage packages. Pantry stores package information in a SQLite database, so I
investigated the database to see what information it contains.
The package_name table indexes package names.
sqlite> .schema package_nameCREATE TABLE IF NOT EXISTS "package_name" (
  "id" INTEGER PRIMARY KEY,
  "name" VARCHAR NOT NULL,
  CONSTRAINT "unique_package_name" UNIQUE ("name")
);The version table indexes version strings, unrelated to
packages.
sqlite> .schema versionCREATE TABLE IF NOT EXISTS "version" (
  "id" INTEGER PRIMARY KEY,
  "version" VARCHAR NOT NULL,
  CONSTRAINT "unique_version" UNIQUE ("version")
);The hackage_cabal table contains package version
information, including the revision!
sqlite> .schema hackage_cabalCREATE TABLE IF NOT EXISTS "hackage_cabal" (
  "id" INTEGER PRIMARY KEY,
  "name" INTEGER NOT NULL
    REFERENCES "package_name"
      ON DELETE RESTRICT
      ON UPDATE RESTRICT,
  "version" INTEGER NOT NULL
    REFERENCES "version"
      ON DELETE RESTRICT
      ON UPDATE RESTRICT,
  "revision" INTEGER NOT NULL,
  "cabal" INTEGER NOT NULL
    REFERENCES "blob"
      ON DELETE RESTRICT
      ON UPDATE RESTRICT,
  "tree" INTEGER NULL
    REFERENCES "tree"
      ON DELETE RESTRICT
      ON UPDATE RESTRICT,
  CONSTRAINT "unique_hackage" UNIQUE ("name","version","revision")
);SELECT pn.name, v.version, hc.revision
  FROM hackage_cabal AS hc
  JOIN package_name AS pn
    ON hc.name = pn.id
  JOIN version AS v
    ON hc.version = v.id
  WHERE pn.name = 'bm';| name | version | revision | 
|---|---|---|
| bm | 0.1.0.2 | 0 | 
| bm | 0.1.0.2 | 1 | 
| bm | 0.1.0.2 | 2 | 
The preferred_versions table contains the preferred
version ranges.
sqlite> .schema preferred_versionsCREATE TABLE IF NOT EXISTS "preferred_versions" (
  "id" INTEGER PRIMARY KEY,
  "name" INTEGER NOT NULL REFERENCES "package_name"
    ON DELETE RESTRICT
    ON UPDATE RESTRICT,
  "preferred" VARCHAR NOT NULL,
  CONSTRAINT "unique_preferred" UNIQUE ("name")
);SELECT pn.name, pv.preferred
  FROM preferred_versions AS pv
  JOIN package_name AS pn
    ON pv.name = pn.id
  WHERE pn.name = 'mtl';| name | preferred | 
|---|---|
| mtl | mtl <2.1 || >2.1 && <2.3 || >2.3 | 
Hackage Server API
The Hackage Server API could provide a way to retrieve information that is not included in the Hackage index tarball. Indeed, a candidates endpoint is documented.
$ curl \
    -H 'Accept: application/json' \
    https://hackage.haskell.org/packages/candidates/ \
  > candidates.json
$ du -h candidates.json
1.4M    candidates.json
$ jq length candidates.json
5332
$ jq '.[0]' candidates.json{
  "candidates": [
    {
      "sha256": "e1766e75168c967a60a1940a89fec96576f2eb75a4f375fe07a7fb7e59db839d",
      "version": "0.1.0.0"
    }
  ],
  "name": "2captcha-haskell"
}I have noticed that people tend to forget to clean up their candidate packages, but the candidates data is even larger than I expected!
A deprecated endpoint is also documented under the
versions feature.
$ curl \
    -H 'Accept: application/json' \
    https://hackage.haskell.org/packages/deprecated \
  > deprecated.json
$ du -h deprecated.json
76K     deprecated.json
$ jq length deprecated.json
1168
$ jq '.[0]' deprecated.json{
  "deprecated-package": "2captcha",
  "in-favour-of": [
    "captcha-2captcha"
  ]
}