Editing Japanese PDFs
I occasionally need to prepare PDF documents by adding information to provided documents, almost always in Japanese. For many years, I have used Adobe Acrobat Pro for this, but I am currently experimenting with alternatives. I just submitted a number of documents today that I prepared using MuPDF.
Motivation
I have been an Adobe customer for quite some time, purchasing software in boxes long ago and now paying by subscription. I mostly use Illustrator and Photoshop, but having Acrobat Pro has been very convenient. I am thinking about terminating my subscription, however, for two reasons.
One reason is the price. I think that the subscription cost is fair for folks who make heavy use of the software. I currently do not use it enough to justify the cost, however.
Another reason is that the software requires Windows or macOS. I quite dislike (and distrust) Windows, so most of my experience with Adobe software has been on Mac Pro and MacBook Pro computers running macOS. Both my Mac Pro and MacBook Pro computers had critical hardware issues, however, and did not last long. I reluctantly switched to Windows 10, running on a ThinkPad. My experience running Windows 10 has been abysmal. Windows 10 will reach end of support in just over one year, and the system occasionally asks me to upgrade to Windows 11 even though it also notifies me that the system specs are not sufficient to run Windows 11. I would like to quit using Windows (again), but I am also very hesitant to invest in more Apple hardware after being burnt twice. Unfortunately, Adobe software does not run on Linux.
I am therefore thinking about terminating my Adobe subscription sometime next year. Doing so will cause a fair amount of inconvenience, limiting the projects that I can do to what is possible with open source software. I can always subscribe again if my need for Adobe software increases enough to justify the cost.
Challenges
I prepared three PDF documents this week. Two of them were created with “fillable” PDF forms, while one of them was not. Even Firefox is able to fill PDF forms these days, so I tried that out first. It worked well, but unfortunately PDF forms are often problematic.
The first issue I ran into was entering my name. The form field is not configured for non-Japanese names, a very common issue in Japan. It works fine with (a small number of) kanji, but I am required to enter my full name in English so that it matches my passport and other official identification documents. The name looks fine when filling in the form, but the result has kerning configuration that causes the letters to overlap.
The second issue I ran into was entering my profession. Different professions have different tax rates and licensing requirements, so values should be from a list of specified professions. The form field is configured to use a large font. Most professions written in kanji fit within the allocated space, but loan words (such as “programmer,” 「プログラマー」 in Japanese) that are written in katakana are too long.
When using Acrobat Pro to fill these PDF forms, the saved files had the same issues. Interestingly, Acrobat Pro was able to print without the issues, however.
In my experience using Acrobat Pro to prepare Japanese documents, I usually do not use “fillable” PDF forms even when they are available due to such issues. I usually have to edit the document itself. Note that this is unavoidable when preparing the many documents that were not created with “fillable” PDF forms.
I looked into software that can edit PDF documents on Linux. I tried LibreOffice Draw, but it failed to handle the Japanese fonts. I tried Okular, but I was unable to edit the document. (Perhaps the document is “protected.”) Note that Adobe’s free online PDF editor is not able to do what I need.
MuPDF
After quickly giving up on graphical editors, I experimented with MuPDF. I would like to use Haskell for the task, but I was unable to find a viable library for using MuPDF in Haskell. I found an HsMuPDF project, but it is far from sufficient and is not maintained. Perhaps it will provide a good starting point if I decide to prepare my own bindings in the future.
For my initial experiment, I decided to use the PyMuPDF library in Python. I wrote a small library and have a separate file for each document. There is very little boilerplate. I ended up submitting the edited documents created using MuPDF, as they are even better than those created using Acrobat Pro!
My initial experiment is not fit for release, as it is just a quick hack, but here are some thoughts about it.
- I am a huge fan of plain text. When documents need to be revised over time, plain text source code is less error-prone than making changes in a GUI, and it can be managed using version control.
Using a high-level programming language makes it possible to compute values as well as check assertions. Computed values are used directly, avoiding the need to worry about making typos while entering numbers into a GUI. (To be fair, I generally prepare files from which I copy-and-paste into Acrobat Pro in order to avoid such mistakes.)
On a related note, I considered using TikZ and XeTeX but chose MuPDF and Python for two reasons. First, I sometimes need to edit existing content, not just overlay new content. Second, I find Python to be more ergonomic than TeX, making it easier to implement abstractions.
The source code for each document contains Markdown comments that include my notes, links to references, and proofreading checklists. I convert them using LiterateX using a command like the following.
literatex --no-code -i doc.py -o doc.md
The library implements a
PageContext
class that manages configuration and defines an API for all of the things that I might add to a page. I really want aReader
monad andlocal
, but for now I implemented a context manager that at least allows me to scope configuration changes. For example, the following creates a context with a different font size configured, for use within the block.with ctx.local(fontsize=8) as ctx8: ...
The
PageContext
class has adebug
setting. When set toTrue
, all additions to the page are highlighted. This makes it very easy to see what has been added, and it also helps with placement of items during development.One major frustration with Acrobat Pro is maintaining consistency. For example, the size of fonts is not displayed, so you have to visually judge if different text has the same font size. This can be surprisingly difficult sometimes. With PyMuPDF, you specify font size in points, so it is trivial to confirm that different text has the same font size.
Another major frustration with Acrobat Pro is alignment. I do not know of any way to exactly align items with that software. Visually trying to align text by dragging the mouse is a waste of time and inexact. In my initial experiment, I can easily align added items with one another, which is a definite improvement. I am also able to center or right-align text within a bounding box, which is another definite improvement. I do not yet have a way to align new items with existing items, however. Perhaps I will implement a way to do that in a future iteration of the experiment.
Placement is done by specifying coordinates. Perhaps I could use Emacs and
cart.el
to make this easier, but I am just determining coordinates via iterative updates for now.
One document had a form select box that caused problems. I resolved that issue by simply deleting all of the form widgets from the page.
def delete_widgets(page): for widget in list(page.widgets()): page.delete_widget(widget)
That problematic form select box is used to select the era of a date. Upon deleting it, I discovered that it had been added on top of text that hard-coded the previous era. I added a feature to my library that allows content to “cover” existing content (using a white rectangle).
- The initial experiment uses
Page.insert_htmlbox
to in order to do text alignment. ThePageContext
API provides simple options and generates the CSS. One issue that I ran into is that I am unable to configure usage of a builtin font. I wish that there was more documentation.
I am happy to have made so much progress after spending only a little time with MuPDF.