List of Japanese NLP tools
I haven’t tried out all of these so I don’t have comments for everything, but hopefully this list will come in useful for someone.
Morphological analyzers/tokenizers
- Itadaki: a Japanese processing module for OpenOffice. I’ve done a tiny bit of work and issue documentation on a fork here, and someone forked that to work with a Japanese/German dictionary here.
- GoSen: Uses sen as a base, and is part of Itadaki; a pure Java version of ChaSen. See my previous post on where to download it from.
- MeCab: This page also contains a comparison of MeCab, ChaSen, JUMAN, and Kakasi.
- ChaSen
- JUMAN
- Cabocha: Uses support vector machines for morphological and dependency structure analysis.
- Gomoku
- Igo
- Kuromoji: Donated to Apache and used in Solr. Looks nice.
Corpora
- Hypermedia Corpus
- TüBa-J/S: Japanese treebank from university of Tübingen. Not as heavily annotated as I’d hoped. You have to send them an agreement to download it, but it’s free.
- GSK: Not free, but very cheap.
- LDC: Expensive unless your institution is a member
Other lexical resources
- Kakasi: Gives readings for kanji compounds.
- WordNet: Stil under development by NiCT. The sense numbers are cross-indexed with those in the English WordNet, so it could be useful for translation. Also, there are no verb frames like there are in English.
- LCS Database: From Okayama University
- Framenet: Unfortunately you can only do online browsing.
- Chakoshi: Online collocation search engine.