Folding Soft Line Breaks

In FeedPipe Item Descriptions, I described my idea to support inline Markdown in item description metadata. The formatted text can be used in the RSS feed as well as in the HTML page content, and a non-formatted version can be used in the HTML metadata. I wrote a function that transforms the formatted text to plain text, but there was one complication: the handling of soft line breaks. The lines of a paragraph need to be joined, but doing so depends on the language of the content. I hoped that such functionality would be provided by the International Components for Unicode (ICU) library, but it is not.

I wrote a quick prototype of a function that folds fragments of text based on the Unicode block of neighboring characters. This prototype inserts a space between fragments unless both of the neighboring characters are in Unicode blocks for languages that do not separate words with spaces. This is an overly-simplified heuristic, and I chose to err on the side of adding space because languages without spaces just look poorly formatted when spaces are added, while languages with spaces can be very difficult to read without them. The prototype works as designed, but it requires classifying all Unicode blocks!

foldText :: [TL.Text] -> TL.Text
foldText = foldr1 go
  where
    go :: TL.Text -> TL.Text -> TL.Text
    go tL tR = case (lastCharBlock tL, firstCharBlock tR) of
      (Just blockL, Just blockR)
        | blockL `elem` noSpaceBlocks &&
          blockR `elem` noSpaceBlocks -> tL <> tR
        | otherwise -> tL <> " " <> tR
      (Nothing, _r) -> tR
      (_l, Nothing) -> tL

    lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode
    lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc
    firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons

    noSpaceBlocks :: [TIC.BlockCode]
    noSpaceBlocks =
      [ TIC.CJKSymbolsAndPunctuation
      , TIC.Hiragana
      , TIC.Katakana
      -- TODO
      ]

I posted a message to the Tokyo Linux Users Group mailing list to see if anybody has any other ideas. The topic is not Linux related, but many members are interested in such language topics as well. I received quite a bit of feedback, which has been very valuable in thinking about the issue!

A TLUG member who I do not know had an interesting idea that I had not thought of: join lines with a space only when the text contains a space. Some text such as Japanese text that references the English title of a book (that includes spaces) would not be processed correctly, but such a description could be written on one (long) line to avoid the issue. I find the idea interesting because it is a simple heuristic that does not require use of ICU.

One thing I realized is that the problem that I am trying to solve can be quite difficult to understand, even among people who are familiar with a language that separates words with spaces (English) as well as a language that does not (Japanese). I think that simplicity of documentation should weigh highly when evaluating the implementation options.

Based on the feedback, I am currently considering an option that was not in my initial list, which makes use of the different types of YAML block scalar syntax.

When using a language that separates words with spaces, users can use a folding block scalar, which joins lines with a space in between.

description: >
  This is an
  English example.

When this YAML is parsed, the value is This is an English example. with no newline characters.

When using a language that does not separate words with spaces, users can use a literal block scalar, which keeps all but the trailing newline.

description: |
  これは日本語の
  例です。

When this YAML is parsed, the value is これは日本語の\n例です。 with a newline in the middle of the text. FeedPipe can fold lines by joining them without inserting a space (mconcat . unlines). The value of the English example stays the same since it does not include newline characters, while the value of the Japanese example becomes これは日本語の例です。 as desired.

If there are any tricky situations, the text can always be written on one (long) line to avoid folding. The downside to this fallback is that long lines are difficult to edit.

This option is pretty straightforward to document. I do not have to mention Unicode or ICU at all!

Author

Travis Cardwell

Published

August 12, 2021