Fugashi, a Cython Wrapper for MeCab
In mid-October I released Fugashi, a Cython-based wrapper for MeCab. It's not the future of tokenization for Japanese, but it serves an important purpose now, and I think that in most cases it's the best choice for Japanese tokenization in Python today. This will explain why I wrote it and some of the decisions in designing the API.
So Long, mecab-python3
For many years my go-to Python wrapper for MeCab was mecab-python3. Unfortunately it has an unresolved memory issue, recent changes released with no particular fanfare ignored user settings and caused confusion, and the package seems to be unmaintained. As a result many projects that rely on mecab-python3 have pegged their version to 0.7, which was released in 2014. Even leaving aside maintenance issues, the only reason the code uses SWIG is because it's based on files distributed with MeCab that were last updated in 2013.
Poring over multiple versions of SWIG documentation I figured there had to be a better way, and that's when Fugashi development started.
The first thing on my mind in designing Fugashi was using it in spaCy. MeCab has a ton of features, but spaCy only requires UniDic support and node-based parsing. In fact, in my use of MeCab over the years I've rarely needed more than those features anyway, so I decided to support only the necessary parts of the MeCab API to start with.
That settled, there were a few issues with the MeCab API I wanted to fix. One was that node-based parsing returns a linked-list data structure. Working with this in C is completely unremarkable, but it's more awkward in Python, so it wasn't a hard decision to wrap that in a list.
Another change was wrapping the token features in a named tuple. MeCab dictionaries include token information as CSV data, and when parsing the raw CSV string is returned. MeCab provides a printf-like interface for formatting fields, but you still have to refer to them by number, which is tedious and error prone. Since UniDic provides an official list of field names it was easy to add those to Fugashi.
Smaller changes include a few convenience wrappers for things like telling whether a token is an unk, getting the whitespace between tokens, and getting the four-field UniDic part of speech tag as one item. None of these are very complicated, but they're the sort of thing that isn't clearly documented and most people learn by trial and error over time. MeCab never made any convenience wrappers like this because it insisted on presenting a highly generic interface, but since I had a clear use case in mind it was easy to make friendly additions to the API.
As a result of these decisions, my changes to switch spaCy from using mecab-python3 to Fugashi actually reduced the total lines of code. In other applications I've found that the ability to put the results of tokenization right into list comprehensions has made my code clearer and easier to write, so I think this can be counted as a success.
When writing Fugashi I was only interested in supporting UniDic. I still think there's no need for IPADic support (see my article on tokenizer dictionaries for why), but I hadn't considered the possibility of supporting Korean. UniDic will probably remain the default, but if I can support Korean and add a generic dictionary interface without complicating the implementation too much I might as well.
If you need Korean support now, or you must use an obsolete or unusual dictionary, or if you don't have a C compiler, I can recommend natto-py. It relies on the cffi project, which makes it easier to install but slower than Fugashi. In a basic tokenization benchmark I wrote natto-py is roughly four times slower than Fugashi.
Ultimately, I think that the future of Japanese tokenization won't be using MeCab, and the sooner it comes the better. But at the time of writing MeCab is still the best tokenizer available for Japanese, and in Python Fugashi is typically the best wrapper you can use.
If you use Fugashi in your project, or need support for a MeCab feature not currently in Fugashi, I'd love to hear about it - feel free to open an issue. Ψ