A Spectre is Haunting Unicode

This post is part of collections on Code and Computers and Japanese Language Technology.

In 1978 Japan's Ministry of Economy, Trade and Industry established the encoding that would later be known as JIS X 0208, which still serves as an important reference for all Japanese encodings. However, after the JIS standard was released people noticed something strange - several of the added characters had no obvious sources, and nobody could tell what they meant or how they should be pronounced. Nobody was sure where they came from. These are what came to be known as the ghost characters (幽霊文字).

Be careful what you write. via the NDL

For a long time the ghost characters remained an unexplained and mostly forgotten curiosity, but in 1997 an investigation was launched to discover where they had come from. While all characters in the JIS standard were supposed to have a record of their sources, even when it existed it wasn't very specific, typically just listing the document it was sourced from.

You'd think that listing the source would make tracking down the origins of the characters easy, but it's important to clarify what counts as a "source" - one of the more common sources for the ghost characters was the "Overview of National Administrative Districts" (国土行政区画総覧), a comprehensive list of place names in Japan. You might, as I initially did, imagine this to be a kind of atlas, an oversize book with at most a few hundred pages. It turns out the latest edition is a seven volume set with each volume having roughly nine hundred pages. Imagine tracking down a single character without a page reference.

Despite the difficulty, the investigation into the ghost characters was successful in discovering their origins - mostly. By interviewing the catalogers involved in the creation of the standard, the investigators established that some characters were inadvertently invented as mistakes in the cataloging process. For example, 妛 was an error introduced while trying to record "山 over 女". "山 over 女" occurs in the name of a particular place and was thus suitable for inclusion in the JIS standard, but because they couldn't print it as one character yet, 山 and 女 were printed separately, cut out, and pasted onto a sheet of paper, and then copied. When reading the copy, the line where the two little pieces of paper met looked like a stroke and was added to the character by mistake. The original character (𡚴) was not added to JIS or Unicode until much later and doesn't display on most sites for me.

The core ghost characters: 妛挧暃椦槞蟐袮閠駲墸壥彁

In the end only one character had neither a clear source nor any historical precedent: 彁. The most likely explanation is that it was created as a misreading of the 彊 character, but no specific incident was uncovered.

Following the general adoption of the JIS standards these characters all made their way into Unicode, which has its own separate set of ghost characters introduced during CJK unification.

To sum up - in 1978 a series of small mistakes created some characters out of nothing. The errors went undiscovered just long enough to be set in stone, and now these ghosts are, at least in potential, a part of every computer on the planet, lurking in the dark corners of character tables.

At this rate they'll presumably be with humanity forever. Ψ

References / related links: