This repository contains the source code of a script which is used to generate a frequency dictionary for use with Yomichan. It uses the data from Balanced Corpus of Contemporary Written Japanese (BCCWJ), supporting both short and long unit words. The generated dictionary file does not contain part-of-speech information, as Yomichan does not currently support this.
- https://clrd.ninjal.ac.jp/bccwj/en/
- https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html
- https://link.springer.com/article/10.1007/s10579-013-9261-0
This script uses a component from Yomichan's implementation,
specifically the JapaneseUtil
class from japanese-util.js.
This file must be manually copied into the same directory as main.js in order for the script to work.
A node script is used to generate the dictionary data:
node main.js path/to/bccwj-data.tsv ./output [long-unit-words] [min-frequency]
[long-unit-words]
(optional) -true
if using the long unit words (LUW) list;false
otherwise.[min-frequency]
(optional) - Integer representing the minimum number of occurrences. Default is0
.
The data can then be added to a .zip archive using any software. The example below uses the 7z command line executable to generate the archive:
7z a -tzip -mx=9 -mm=Deflate -mtc=off -mcu=on BCCWJ-SUW.zip ./output/*.json