Skip to content

Web crawler for creating personal copies of Japanese dictionaries

License

Notifications You must be signed in to change notification settings

epistularum/jitenbot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jitenbot

Jitenbot is a program for scraping Japanese dictionary websites and compiling the scraped data into compact dictionary file formats.

Supported Dictionaries

Supported Output Formats

Examples

Jitenon Kokugo (web | yomichan)

jitenon_kokugo

Jitenon Yoji (web | yomichan)

yoji

Jitenon Kotowaza (web | yomichan)

kotowaza

Shinmeikai 8e (print | yomichan)

smk8

Daijirin 4e (print | yomichan)

daijirin2

Sanseidō 8e (print | yomichan)

sankoku8

Various (GoldenDict)

goldendict

Usage

usage: jitenbot [-h] [-p PAGE_DIR] [-m MEDIA_DIR] [-i MDICT_ICON]
                [--no-mdict-export] [--no-yomichan-export]
                [--validate-yomichan-terms]
                {jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}

Convert Japanese dictionary files to new formats.

positional arguments:
  {jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}
                        name of dictionary to convert

options:
  -h, --help            show this help message and exit
  -p PAGE_DIR, --page-dir PAGE_DIR
                        path to directory containing XML page files
  -m MEDIA_DIR, --media-dir MEDIA_DIR
                        path to directory containing media folders (gaiji,
                        graphics, audio, etc.)
  -i MDICT_ICON, --mdict-icon MDICT_ICON
                        path to icon file to be used with MDict
  --no-mdict-export     skip export of dictionary data to MDict format
  --no-yomichan-export  skip export of dictionary data to Yomichan format
  --validate-yomichan-terms
                        validate JSON structure of exported Yomichan
                        dictionary terms

See README.md for details regarding media directory structures

Web Targets

Jitenbot will scrape the target website and save the pages to the user cache directory. As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, a complete crawl of a target website may take several days.

HTTP request headers (user agent string, etc.) may be customized by editing the config.json file created in the user config directory.

Monokakido Targets

These digital dictionaries are available for purchase through the Monokakido Dictionaries app on MacOS/iOS. Under ideal circumstances, Jitenbot would be able to automatically fetch all the data it needs from this app's data directory1 on your system. In its current state of development, Jitenbot unfortunately requires you to find and assemble the necessary data yourself. The files must be organized into a particular folder structure (defined below) and then passed to Jitenbot via the corresponding command line arguments.

Some of the folders in the app's data directory1 contain encoded files that must be unencoded using golddranks' monokakido tool. These folders are indicated by a reference mark (※) in the notes below.

smk8 files

Since Yomichan does not support audio files from imported dictionaries, the audio/ directory may be omitted to save filesize space in the output ZIP file if desired.

.
├── media
│   ├── audio (※)
│   │   ├── 00001.aac
│   │   ├── 00002.aac
│   │   ├── 00003.aac
│   │   ├── ...
│   │   └── 82682.aac
│   ├── Audio.png
│   └── gaiji
│       ├── 1d110.svg
│       ├── 1d15d.svg
│       ├── 1d15e.svg
│       ├── ...
│       └── xbunnoa.svg
└── pages (※)
    ├── 0000000000.xml
    ├── 0000000001.xml
    ├── 0000000002.xml
    ├── ...
    └── 0000064581.xml
daijirin2 files

The graphics/ directory may be omitted to save space if desired.

.
├── media
│   ├── gaiji
│   │   ├── 1D10B.svg
│   │   ├── 1D110.svg
│   │   ├── 1D12A.svg
│   │   ├── ...
│   │   └── vectorOB.svg
│   └── graphics (※)
│       ├── 3djr_0002.png
│       ├── 3djr_0004.png
│       ├── 3djr_0005.png
│       ├── ...
│       └── 4djr_yahazu.png
└── pages (※)
    ├── 0000000001.xml
    ├── 0000000002.xml
    ├── 0000000003.xml
    ├── ...
    └── 0000182633.xml
sankoku8 files
.
├── media
│   ├── graphics
│   │   ├── 000chouchou.png
│   │   ├── ...
│   │   └── 888udatsu.png
│   ├── svg-accent
│   │   ├── アクセント.svg
│   │   └── 平板.svg
│   ├── svg-frac
│   │   ├── frac-1-2.svg
│   │   ├── ...
│   │   └── frac-a-b.svg
│   ├── svg-gaiji
│   │   ├── aiaigasa.svg
│   │   ├── ...
│   │   └── 異体字_西.svg
│   ├── svg-intonation
│   │   ├── 上昇下降.svg
│   │   ├── ...
│   │   └── 長.svg
│   ├── svg-logo
│   │   ├── denshi.svg
│   │   ├── ...
│   │   └── 重要語.svg
│   └── svg-special
│       └── 区切り線.svg
└── pages (※)
    ├── 0000000001.xml
    ├── ...
    └── 0000065457.xml

Attribution

Adobe-Japan1_sequences.txt is provided by The Adobe-Japan1-7 Character Collection.

The Yomichan term-bank schema definition dictionary-term-bank-v3-schema.json is provided by the Yomichan project.

Many thanks to epistularum for providing thoughtful feedback regarding the implementation of the MDict export functionality.

Footnotes

  1. /Library/Application Support/AppStoreContent/jp.monokakido.Dictionaries/Products/ 2

About

Web crawler for creating personal copies of Japanese dictionaries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.6%
  • CSS 15.5%
  • Other 0.9%