Editing Japanese PDFs (Part 2)
I wrote about my attempts to edit Japanese PDFs in the Editing Japanese PDFs blog entry yesterday, having the most luck with programmatically editing documents using MuPDF. This blog entry is an update on the topic.
Inkscape
Christian Horn let me know that he has had success editing Japanese PDFs using Inkscape! I tried opening one of my documents in Inkscape and was impressed to see that it loads!
When importing a PDF, the software provides the following font options.
- Draw missing fonts
- Substitute missing fonts
- Keep missing fonts’ names
- Delete missing font text
- Draw all text
- Delete all text
The “draw” terminology is what I know of as “outline” in Illustrator terminology. One cannot draw/outline fonts that are missing, so I assume that the substitute fonts are drawn/outlined.
The import dialog box provides a list of the fonts in the PDF. This is helpful!
One document that I worked with this week has the following fonts. I
got this information from the Evince (now called “Document
Viewer”) properties, which uses the Poppler PDF rendering
library. This output seems the most reliable. I also use the pdffonts
utility to check fonts, but it does not work well with some Japanese
documents and decodes font names incorrectly (“mojibake”). The Evince output matches the pdffonts
output, without the decoding issues.
Name | Status | Substitute |
---|---|---|
MS-Mincho | ||
MS-Mincho | ||
MSMincho | embedded subset | |
MSMincho | embedded subset | |
MS-Mincho | embedded subset | |
MS-Mincho | embedded subset | |
Century | embedded subset | |
Century | embedded subset | |
KozGoPr6N-Medium | not embedded | Droid Sans |
Helvetica | standard 14 font | NimbusSans-Regular |
KozMinPr6N-Regular | not embedded | Droid Sans |
MSGothic | not embedded | Droid Sans |
MSMincho | not embedded | Droid Sans |
ZapfDingbats | not embedded | D050000L |
KozMinPr6N-Regular | not embedded | Droid Sans |
KozMinPr6N-Regular | embedded subset | |
KozMinPr6N-Regular | embedded subset | |
MS明朝 | embedded subset | |
MSMincho | embedded subset | |
MSMincho | embedded subset | |
MS-Mincho | embedded subset | |
MS明朝 | embedded subset | |
Century | embedded subset |
This is a bit of a mess, which is not surprising given that the document metadata indicates that the document was created using Word 2007. (Unfortunately, Microsoft Office is very popular in Japan.) There are a lot of repeated fonts, which likely represent different subsets.
The Inkscape import dialog displays the following fonts. A substitute is listed even for embedded fonts, and I suspect that these are the fonts used to display text when not outlining.
Name | Substitute |
---|---|
MSMincho | Droid Sans |
KozMinPr6N-Regular | Droid Sans |
Droid Sans | |
KozMinPr6N-Regular | Droid Sans |
Century | C059 |
Droid Sans | |
MSMincho | Droid Sans |
MS-Mincho | Droid Sans |
MSMincho | Droid Sans |
MS明朝 | Droid Sans |
MS明朝 | Droid Sans |
Droid Sans | |
Century | C059 |
Droid Sans | |
Droid Sans | |
MS-Mincho | Droid Sans |
I decided to investigate one more document, which only uses embedded fonts. It was created using InDesign and is much cleaner. (Kudos to my local government for using InDesign!)
Name | Status | Substitute |
---|---|---|
GothicMB101Pro-Medium | embedded subset | |
RyuminPr6-Regular | embedded subset | |
RyuminPr6-Regular | embedded subset | |
GothicMB101Pro-Medium | embedded subset | |
RyuminPr6-Regular | embedded subset | |
RyuminPr6-Regular | embedded subset | |
DFKaiShoStd-W9 | embedded subset | |
GothicMB101Pro-DeBold | embedded subset |
The Inkscape import dialog displays the following fonts.
Name | Substitute |
---|---|
RyuminPr6-Regular | Roboto |
GothicMB101Pro-Medium | Gentium Plus |
RyuminPr6-Regular | Roboto |
DFKaiShoStd-W9 | DejaVu Sans |
GothicMB101Pro-Medium | Gentium Plus |
RyuminPr6-Regular | Roboto |
RyuminPr6-Regular | Roboto |
GothicMB101Pro-Medium | Gentium Plus |
When exporting to PDF, Inkscape provides the following font options.
- Embed fonts
- Convert text to paths
- Omit text in PDF and create LaTeX file
When importing the document created with Word using the “draw missing fonts” option, it outlines all of the text. The document contains many empty groups, but it looks fine. Opening the exported file takes a long time since the text is outlined, and it takes a long time when re-rendering when scrolling. The content looks good, however, and can be printed.
When importing the document created with Word using the “Keep missing fonts’ names” option, the text is not outlined and can be edited. The text is indeed displayed with the listed substitutes. The document contains many empty groups, but it looks fine. The fonts are changed in the exported file, however. I do not think that the “Keep missing fonts’ names” option is viable.
When importing multiple PDF pages, each page is loaded in a separate “artboard” (Illustrator terminology) with content in separate layers. I confirmed that the exported PDFs show the pages correctly.
To work with a document, I would do the following.
- Import using the “draw missing fonts” option.
- Lock the imported layers (one per page).
- For each imported layer, add a new layer below the layer, draw a white rectangle to use as the background of the page, and lock the layer.
- For each imported layer, add a new layer above the layer for added content.
When adding content, one can make use of the alignment features of Inkscape. This is really convenient, and Inkscape greatly outshines Acrobat Pro when it comes to alignment!
Overall, I am very impressed. Inkscape is definitely a great option for editing Japanese PDFs!
MuPDF
I prepared another document this morning using PyMuPDF. This
document has tables with similar text entered in each row. It was very
convenient to use a for
loop to place the new content!
I noticed, however, that the size of the edited PDFs are quite large! The source documents weigh hundreds of kilobytes while the edited documents weigh tens of megabytes. This is not an issue when you just need to print the edited document, but it can be a problem when you need to share it. For example, some government sites limit the size of PDF files that can be uploaded.
The cause of this issue is that MuPDF is embedding large fonts in the
document, not subsets. My edited documents embed the Charis SIL
Regular and Droid Sans Fallback Regular fonts. I mentioned yesterday
that I have been unable to figure out how to configure usage of a
standard font when using Page.insert_htmlbox
.
Note that I have had no issues with Page.insert_text
,
but I prefer to use Page.insert_htmlbox
in order to take advantage of alignment and font selection (when
Japanese text contains half-width numbers for example). I really hope
that I am able to figure out how to configure usage of standard fonts
and avoid embedding large fonts.