Folding Soft Line Breaks
In FeedPipe
Item Descriptions, I described my idea to support inline Markdown in
item description
metadata. The formatted text can be used
in the RSS feed as well as in the HTML page content, and a non-formatted
version can be used in the HTML metadata. I wrote a function that
transforms the formatted text to plain text, but there was one
complication: the handling of soft line breaks. The lines of a paragraph
need to be joined, but doing so depends on the language of the content.
I hoped that such functionality would be provided by the International Components for
Unicode (ICU) library, but it is not.
I wrote a quick prototype of a function that folds fragments of text based on the Unicode block of neighboring characters. This prototype inserts a space between fragments unless both of the neighboring characters are in Unicode blocks for languages that do not separate words with spaces. This is an overly-simplified heuristic, and I chose to err on the side of adding space because languages without spaces just look poorly formatted when spaces are added, while languages with spaces can be very difficult to read without them. The prototype works as designed, but it requires classifying all Unicode blocks!
foldText :: [TL.Text] -> TL.Text
= foldr1 go
foldText where
go :: TL.Text -> TL.Text -> TL.Text
= case (lastCharBlock tL, firstCharBlock tR) of
go tL tR Just blockL, Just blockR)
(| blockL `elem` noSpaceBlocks &&
`elem` noSpaceBlocks -> tL <> tR
blockR | otherwise -> tL <> " " <> tR
Nothing, _r) -> tR
(Nothing) -> tL
(_l,
firstCharBlock :: TL.Text -> Maybe TIC.BlockCode
lastCharBlock,= fmap (TIC.blockCode . snd) . TL.unsnoc
lastCharBlock = fmap (TIC.blockCode . fst) . TL.uncons
firstCharBlock
noSpaceBlocks :: [TIC.BlockCode]
=
noSpaceBlocks TIC.CJKSymbolsAndPunctuation
[ TIC.Hiragana
, TIC.Katakana
, -- TODO
]
I posted a message to the Tokyo Linux Users Group mailing list to see if anybody has any other ideas. The topic is not Linux related, but many members are interested in such language topics as well. I received quite a bit of feedback, which has been very valuable in thinking about the issue!
A TLUG member who I do not know had an interesting idea that I had not thought of: join lines with a space only when the text contains a space. Some text such as Japanese text that references the English title of a book (that includes spaces) would not be processed correctly, but such a description could be written on one (long) line to avoid the issue. I find the idea interesting because it is a simple heuristic that does not require use of ICU.
One thing I realized is that the problem that I am trying to solve can be quite difficult to understand, even among people who are familiar with a language that separates words with spaces (English) as well as a language that does not (Japanese). I think that simplicity of documentation should weigh highly when evaluating the implementation options.
Based on the feedback, I am currently considering an option that was not in my initial list, which makes use of the different types of YAML block scalar syntax.
When using a language that separates words with spaces, users can use a folding block scalar, which joins lines with a space in between.
description: >
This is an English example.
When this YAML is parsed, the value is
This is an English example.
with no newline characters.
When using a language that does not separate words with spaces, users can use a literal block scalar, which keeps all but the trailing newline.
description: |
これは日本語の 例です。
When this YAML is parsed, the value is
これは日本語の\n例です。
with a newline in the middle of
the text. FeedPipe can fold lines by joining them without inserting a
space (mconcat . unlines
). The value of the English example
stays the same since it does not include newline characters, while the
value of the Japanese example becomes
これは日本語の例です。
as desired.
If there are any tricky situations, the text can always be written on one (long) line to avoid folding. The downside to this fallback is that long lines are difficult to edit.
This option is pretty straightforward to document. I do not have to mention Unicode or ICU at all!