A Field Guide to Japanese Mojibake

This post is part of a collection on Japanese Language Technology.

When you open a document with an encoding different than the one it was created with, it's not possible to display the original text, and instead a garbled mess of corrupted characters are printed out. These are called "mojibake" in Japanese, and the word has also been borrowed into English.

While mojibake aren't readable by humans, it turns out that different kinds of mojibake have different visual textures, and with a little experience you can guess the original encoding of a document just by looking at it. This page is a visual guide to how mojibake look in different encodings commonly used for Japanese.

Besides the visual samples, a brief explanation of how each encoding has been used historically is included. At the end there are some technical notes about how this article was made to break as intended in your browser.

Uncorrupted sample text:

吾輩は猫である。名前はまだない。
エンコードの設定を間違えると文字が化けてしまう。
東京タワーの高さは333mです。

UTF-8

The most common encoding on the web today, the adoption of UTF-8 has been slower in Japan than in other areas but in recent years it has finally become prevalent.

As SJIS:

This combination is pretty common, and DailyPortalZ has an entertaining article where they find the meaning of the obscure characters that appear here. 繧繝 ungen in particular refers to a pattern of colored stripes used in textiles.

The colored strip at the bottom of this picture of Yoshimitsu Ashikaga is an ungen border on a tatami mat. The ungen border on a tatami was originally reserved for the Emperor and people of the highest rank, and today is commonly seen on hinadan doll sets used at Hinamatsuri. via Wikipedia.

In the 2021 anime Urasekai Picnic, this kind of mojibake is used to provoke a sense of foreboding.

As EUC-JP:

As ISO-2022-JP:

Shift JIS

Historically one of the most popular encodings, and at one time the main choice for Japanese web pages, Shift JIS (or SJIS) has largely been replaced by UTF-8 in modern applications. However, you're still likely to encounter it in legacy applications; in particular, email for feature phones (ガラケー) uses SJIS.

Served as UTF-8:

Served as EUC-JP:

Served as ISO-2022-JP:

Characters in SJIS are one or two bytes; ASCII characters are mostly carried over, but SJIS is not ASCII safe, and the second byte of two-byte characters has no special prefix, so it's not unusual for it to dissolve into a pile of unreadable escape characters.

Perhaps the most common source of mojibake with SJIS is a small number of popular but unusual characters that aren't in the encoding. The standout among these is 髙, a variant of 高. Called "hashigo-daka" (ladder "taka") because of the connected lines, it's a common character in names, and some people consistently use the variant to write their names. 﨑 tatsu-saki, a variant of the common 崎 character where the upper-right is 立 (stand) instead of 大 (big), is a similar but slightly less common case.

SJIS is also at least partly responsible for standardizing the confusion over the backslash and the yen sign, which continues to affect Japanese locales to this day.

There are also a number of platform specific variations to SJIS, which you can read more about at Wikipedia.

EUC-JP

Originally developed for Unix systems, EUC-JP went out of style in much the same way as SJIS. It shares some base standards with ISO-2022-JP (introduced below), while having a simpler encoding system. Its usage was similar to SJIS but generally not as widespread.

Served as UTF-8:

Served as SJIS:

Served as ISO-2022-JP:

The main interesting pattern here is that a many half-width katakana turn up when EUC-JP is interepreted as SJIS. This is because SJIS encodes half-width katakana in single bytes using non-ASCII byte values.

ISO-2022-JP

ISO-2022-JP is a general-purpose encoding, but it ended up not widely used outside email, where it may still be encountered sometimes. The main reason it's suitable for email is that it's ASCII safe. However the biggest problem with the encoding it that it has many minor variants which may not be supported equally on all platforms, so mail that looks fine on one machine may be partly a mess on another. Commonly used characters that cause problems include "circle numbers" like ①.

Served as UTF-8:

Served as SJIS:

Served as EUC-JP:

Notice anything odd about the samples above? They're all the same! That's a side effect of ASCII safety. Because there are no characters that are interepreted as escapes by other encodings, ISO-2022-JP looks the same for the three encodings used here.

Technical Notes

It was harder than I thought it would be to create native mojibake like you should see in this article. Most technical guidance is about avoiding mojibake, so intentionally creating them, especially with many types in one place, turned out to be uncharted territory.

It's not possible for one part of a normal webpage, like a div, to have a different encoding from the rest of a page. But iframes don't have that restriction, so I used them. However that was only half a solution - there are different ways to specify the encoding of a document, but the only one I found to be effective was changing encoding headers returned by the webserver. I was able to modify those roughly following instructions from the W3C with some modifications.

In particular, the AddType option didn't work for me, I had to use the following style:

<Files "example.html">
  ForceType 'text/html; charset=UTF-8'
</Files>

If you've found this guide of use, feel free to let me know, or check out my upcoming book Introduction to Japanese Natural Language Processing, which has a short section on mojibake as part of a wealth of other topics. You might also enjoy my article about ghost characters. Ψ

2021-10-31T12:24:53+09:00