Fast Japanese Tokenization with a Single Pip Install

This post is part of a collection on Natural Language Processing.

I just released a version of fugashi with support for installing UniDic directly through pip. With this release you can have a fully functional and fast Japanese tokenizer running after just one command. Here's how you can install it:

pip install fugashi[unidic-lite]

Note this will take up roughly 250MB on disk after installation. Since wheels are provided for Linux, OSX, and Win64 you shouldn't need a C compiler or anything else to get this working. Special thanks for this release goes to Aki Ariga for help testing on Windows.

Update 2020-08-15: I've written a guide to using fugashi - check it out! See also my post about tokenizer dictionaries for a guide to the differences between common options.

While there are other packages you can install through pip that will give you a working tokenizer in one command, like Janome, their ease of use comes at the cost of speed, sometimes by orders of magnitude.

In order to fit a dictionary under PyPI's limit of 60MB I had to use an old version of UniDic from 2013. That said, if you want to use the latest UniDic that's an option too, it just has an extra step:

pip install fugashi[unidic]
python -m unidic download

That will download the latest version of UniDic I've packaged, currently 2.3.0, which takes up 1GB on disk. This doesn't fit in PyPI, but uses Github Release artifacts in a style similar to spacy-models.

If you have an open source machine learning or other project and would like to add Japanese support, or if you have Japanese support but it's hard to get working, please feel free to contact me about improving it.

There are a few more convenience features I'd like to add to fugashi, like a command-line mode for environments where fugashi is installed but MeCab isn't, but for the most part I think it's ready for a 1.0 release at this point. Going forward I'm looking into ways to get away from MeCab entirely. Hopefully there'll be progress on that front before too long. Ψ

2020-04-15T18:57:18.731+0900