An Overview of Japanese Tokenizer Dictionaries

2019-10-15T17:39:46+09:00

There are a number of dictionaries available for tokenizing Japanese; as of 2019 you should generally use UniDic, though it's worth understanding what the differences between the dictionaries are. All of these work with MeCab.

JumanDic

JumanDic is one of the dictionaries included with MeCab. The version of the dictionary there is very old and when it was acquired is not entirely clear. Juman itself is a tokenizer that predates MeCab; it used to be distributed commercially, but since around 2012 it's available free from Kyoto University. The last update there is from 2014, though the version of the dictionary in MeCab seems to be from 2012.

IPADic

IPADic is the other dictionary distributed with MeCab, and besides UniDic is the most popular dictionary. The IPA is the Information-technology Promotion Agency, a government operated corporation. (Note: the odd capitalization is not an error.) The dictionary was originally constructed for ChaSen, a predecessor to MeCab by the same author, using a corpus and part of speech tagset developed by the IPA, with some modifications. The source paper for the IPA tagset is referred to as THiMCO97, though I haven't been able to track down that reference.

IPADic hasn't been updated since at least 2007, but some people prefer it to Unidic because the tokenization is more coarse-grained and often less surprising. For example, UniDic breaks 図書館 "library" into two words, 図書 "document" and 館 "building", while IPADic treats it as one word.

NAIST-jdic

NAIST-jdic was a project started to replace IPADic when it was discovered IPADic had potential licensing issues. Not much later the licensing issues were fixed, so NAIST-jdic never achieved wide usage, though you may still see it referred to in old documents. Looking at releases on their OSDN site the project was active from 2008 to 2011.

UniDic

UniDic is a dictionary developed and maintained by NINJAL, the National Institute for Japanese Language and Linguistics, who also supervise Universal Dependencies for Japanese. While the status of maintenance was poor for many years, in the last year or so it's gotten a proper homepage and regular, if small, updates.

Unidic also offers dictionaries for spoken and historical language, so you can use the same tools that work with modern written Japanese on those if you need to.

One issue with UniDic is that the size ballooned over the past few versions. v2.1.2, which was the latest version for many years, was 135MB zipped. v2.2.0 was 439MB zipped, and the most recent v2.3.0 is 2.2GB. It's not clear that the large increase in size translates to any overall improvement in tokenization quality, and NINJAL doesn't have release notes that clarify what changed.

Another issue with UniDic is that the tokenization is usually more fine-grained than people expect and often seemingly inconsistent. I touched on the fine-grainedness in the IPADic section; an example of inconsistency is that いちご狩り "strawberry picking" is one word, while きのこ狩り "mushroom picking" is two words. UniDic has clearly defined rules that explain this behavior, so it's consistent in that respect, but the rationale for the rules is not clear to a typical user (myself included) and frankly this is a poor decision I hope they will change in the future.

Despite the recent challenges, since UniDic is the only dictionary actively and manually maintained, and since the maintainers oversee the Univeral Dependencies project for Japanese, it's the best choice for a base dictionary at time of writing.

Neologd

Neologd is a project to build a large tokenizer dictionary by using various web sources. The version using IPADic as a base is the most popular, though there is also a Unidic based version. The name comes from "neologism dictionary."

Sometimes when people hear about Neologd they assume it will solve all their problems, but it has a number of issues (many helpfully listed in the README) that limit the situations where it's the right choice.

  • the increased dictionary size makes memory usage significant (at least 1.5GB)
  • very coarse grained tokenization
  • entries are not manually verified, resulting in mixed quality
  • sources for new dictionary entries are not clear

Coarse grained tokenization is an important issue even if the quality in Neologd was perfect. Neologd tokenizes titles as a single word, so if you want to search for "銀河鉄道の夜" and search "銀河鉄道" you won't get any hits with a typical tf-idf style index. This makes it unsuitable for use as the sole tokenization dictionary for a search engine. This has been noted in presentations by the creator but isn't emphasized in the README.

Neologd is useful if you're doing analysis of existing data, where coarse-grained tokenization is great for concept extraction, though you still have to watch out for quality issues.

Summary

Use UniDic - it's actively maintained and seems to be heading in a mostly good direction. Unlike tokenizers, where long abandonment of MeCab has led to lots of new developments, there aren't many efforts to build new dictionaries from scratch. That said there are a few efforts, usually tied to individual tokenizer projects, which I'll discuss in another post. Ψ