Locales in Haskell
There are a number of features that I have yet to implement in my website software because I do not need them for this website. One that I think should be included in the first public release, however, is decent localization support. In particular, the website software should be able to format dates and times according to the configured locale, for display on web pages as well as in RSS feed item descriptions.
I have long considered creating a package that defines a broad selection of locales, for use by the website software as well as any other Haskell software. I was away from my computer yesterday morning, and I sketched out some ideas for the package using pen and paper during some waiting time. I then did some research when I got back on my computer, and the situation in Haskell turns out to be more complicated than I expected. This blog entry serves as a summary of my current thoughts.
Date and Time Formatting
The time package has locale support, and the TimeLocale data type is used to specify the information required for formatting and parsing dates and times. The library only includes a default locale, using English. I do not know of any existing Haskell libraries that define a broad selection of locales that would be appropriate for localized software.
Each TimeLocale includes the following information:
- week day names, full (“Monday”) and abbreviated (“Mon”)
- month names, full (“October”) and abbreviated (“Oct”)
- AM/PM symbols
- common formatting strings
- known time zones, specifying locale-specific abbreviations
(
UTC+09:00
is known as “JST” in Japan and “KST” in Korea)
The required translations can be found in the Unicode Common Locale Data Repository (CLDR). This data is re-distributed as part of the International Components for Unicode (ICU). The Haskell text-icu package provides bindings to the ICU library, but it does not include such locale information.
Common formatting strings are included in the CLDR as well. Based on the locales that I have investigated so far, I am hopeful that I can automatically select the formatting strings to be included in TimeLocale, but I cannot be certain that it is possible until I implement it.
The CLDR does not include sufficient information to define the known time zones. The time zones are listed, including the abbreviated names, but the UTC offsets are not specified. The timezone-series and timezone-olson libraries can be used to get this information, however.
Note that the time zones that are included in TimeLocale are generally only used when parsing a time string. When formatting a time string that includes time zone names, one generally uses a ZoneSeriesTime or ZonedTime, both of which include the specific time zone. It is clear why this is necessary when considering locales that include bordering time zones that have daylight saving time. Given only the UTC offset, it is impossible to distinguish between the standard time of one time zone and the summer time of a bordering time zone, so the time zone must be specified.
Locale Identifiers
There are various ways to specify a locale, according to different standards. At first, I worried that I might have to support three of them in the website software:
- A locale identifier is used to select the appropriate locale for date and time formatting. It is most natural for me to use POSIX locale syntax: a language code optionally followed by an underscore and a territory code. The language code is two or three letters (normalized to lowercase) as defined by the ISO 639-1 and ISO 639-2 standards. The territory code is two letters (normalized to uppercase) as defined by the ISO 3166-1 alpha-2 standard or a three-digit territory code as defined in BCP47.
- The locale is (usually) specified in the lang attribute of the html element in HTML pages, where the BCP47 syntax is used. BCP47 identifiers (called “language tags”) consist of one or more subtags separated by hyphens. It is able to represent much more subtle differences of language than POSIX locales, but it is therefore much more complicated.
- The locale is (usually) specified in the
language
element of RSS feeds. Here, either an RSS-specific language code or a language code as defined in RFC 1766 may be used, as was defined for HTML 4. (RSS is old.)
I have since realized that I can/should use only BCP47 in the website software. I doubt it will causes issues in the RSS feeds, and it would work fine for locale selection. The library should support POSIX locale identifiers as well, however, since that is what is used on POSIX platforms. Thankfully, it is easy to convert a POSIX locale identifier to a BCP47 language tag.
Locale Selection
When using POSIX locale identifiers, locale selection is generally implemented in a straightforward manner. First, the identifier is looked up in a mapping from identifier to locale. If there is a match, the locale is selected. Otherwise, if the identifier includes a territory code, then a new identifier is created using just the language code, and that new identifier is looked up in the mapping. If there is a match, the locale is selected. Otherwise, the default locale is used.
BCP47 provides a much more powerful (and complicated) method of matching language tags. In addition, the CLDR includes language distance data that approximates the similarity of different languages, and this data can be used to select the most appropriate locale when the requested language is not supported.
Existing Haskell Packages
The bcp47 package provides an API for parsing BCP47 language tags as well as a data structure for performing lookups and matches. Unfortunately, it does not follow the BCP47 language specifications very well. Critically, it does not support all assigned language tags, including some very common ones, so I cannot use it in my library. In addition, there are many things about the implementation that I do not like:
- No effort is taken to keep the memory size down, which results in heavier executables. Even best practices are not followed.
- The library has dependencies that are only used to define instances! For example, aeson is required in order to define instances.
- The primary module in the library exports values used for testing, including “nonsense” tags!
- I suspect that the implementation of lookups and matches is not correct in some cases, even for the subset of language tags that are supported. I have not created a test case to confirm this suspicion, however, so I may be wrong.
- The tests are very insufficient.
(I hope I do not sound like I am picking on the library. There is some nice code in the implementation as well!)
The iso639 package provides a data type with a constructor for each language code assigned in ISO 639-1 (two-letter codes). It does not support ISO 639-2 (three-letter codes), so it is not sufficient for use with POSIX locale identifiers, much less BCP47 language tags.
The country package provides an API for working with ISO 3166-1 country codes, and includes values for assigned countries. It does not support numeric region codes, however, so it is not sufficient for use with POSIX locale identifiers, much less BCP47 language tags.
The languagetag-bcp47 package provides an API for parsing and analyzing BCP47 language tags, according to the standards. The implementation looks really nice, but it is a recent project and is not finished yet. In a recent issue, somebody asked if there are plans to release the library on Hackage, and the author replied that he has no such plans yet since the library is not finished and he has not had time to work on it lately.
The README notes the following:
Note that matching based on language ranges (basic or extended) has not yet been added to the package. The remainder of the standard, however, is fully supported.
I initially assumed that matching language tags is not yet implemented at all, but I then discovered that lookup is already implemented. I wrote some quick test code to try it out, and it works fine.
I would like to note some interesting implementation details. The
Subtag
data type is implemented as a newtype
of Word64
! Each subtag is a maximum of 8 letters long,
requiring only 56 bits when using a compact encoding of 7-bit ASCII.
Remaining bits include an encoding of the subtag length as well as some
flags, and care was taken to provide the same ordering as the
Text
representation. The MaybeSubtag
data type
provides Maybe
functionality using a newtype
of Subtag
where the zero value represents
Nothing
. In the implementation of the Normal
language tag, these two types are used with UNPACK
pragmas
to minimize memory usage. Very nice!
If I use languagetag-bcp47 in my locales library, then I would need to get the source from GitHub since the package is not in Hackage. To ensure that the library does not break if the repository disappears, I would use a fork of the repository instead of the upstream library. If/when the package is eventually released to Hackage, I can start using Hackage instead. I would then like to remove the fork, but I would not be able to because doing so would break old versions of the locales library.
Alternatively, I could “vendor” the languagetag-bcp47
library by copying it to the locales library repository (until the
library is released to Hackage), removing these concerns. In this case,
I can remove the Registry
that contains all of the
registered tag information. (This module is quite large and results in
slow compilation.)
Library Plans
For now, I am going to put off development of the locales library once again, in hopes that languagetag-bcp47 is finished and released by the time I need it. In the meantime, I will update the website software to use BCP47 language tags to specify the locale, using a single setting to load the locale as well as specify the language in HTML documents and RSS feeds.