project.yml

title: "Parsing the _Jingdian Shiwen_"
description: |
  [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://direct-phonology-jdsw-scriptsvisualize-0px83h.streamlit.app/)
  
  This project is an attempt to convert the annotations compiled by the Tang dynasty scholar [Lu Deming (陸德明)](https://en.wikipedia.org/wiki/Lu_Deming) in the [_Jingdian Shiwen_ (经典释文)](https://en.wikipedia.org/wiki/Jingdian_Shiwen) into a structured form that separates phonology, glosses, and references to secondary sources. A [spaCy](https://spacy.io/) pipeline is configured to parse and tag the annotations, and [prodigy](https://prodi.gy/) is used for guided annotation of the training data. The project is part of a broader effort to build a linguistic model of [Old Chinese (上古漢語)](https://en.wikipedia.org/wiki/Old_Chinese) that incoporates phonology.

  ## Data
  The _Jingdian Shiwen_ comprises Lu's annotations on most of the ["Thirteen Classics" (十三經)](https://en.wikipedia.org/wiki/Thirteen_Classics) of the Confucian tradition, as well as some Daoist texts. We use the edition of the _Jingdian Shiwen_ found in the [_Collectanea of the Four Categories_ (四部叢刊)](http://www.chinaknowledge.de/Literature/Poetry/sibucongkan.html), which includes high-quality lithographic reproductions of many ancient texts. The annotations given in the _Jingdian Shiwen_ are paired with the source texts to which they apply; for this we predominantly use the definitive (正文) editions published by the [Kanseki Repository](https://www.kanripo.org/).

  |work|title|source|_Jingdian Shiwen_ chapters (卷)|
  |-|-|-|-|
  |周易|[_Book of Changes_](https://en.wikipedia.org/wiki/I_Ching)|[KR1a0001](https://github.com/kanripo/KR1a0001)|2
  |尚書|[_Book of Documents_](https://en.wikipedia.org/wiki/Book_of_Documents)|[KR1b0001](https://github.com/kanripo/KR1b0001)|3-4|
  |毛詩|[_Mao Commentary_](https://en.wikipedia.org/wiki/Mao_Commentary) on the [_Book of Odes_](https://en.wikipedia.org/wiki/Classic_of_Poetry)|[KR1c0001](https://github.com/kanripo/KR1c0001)|5-7|
  |周禮|[_Rites of Zhou_](https://en.wikipedia.org/wiki/Rites_of_Zhou)|[KR1d0001](https://github.com/kanripo/KR1d0001)|8-9|
  |儀禮|[_Etiquette and Ceremonial_](https://en.wikipedia.org/wiki/Etiquette_and_Ceremonial)|CH1e0873*|10|
  |禮記|[_Book of Rites_](https://en.wikipedia.org/wiki/Book_of_Rites)|[KR1d0052](https://github.com/kanripo/KR1d0052)|11-14|
  |春秋左傳|[_Commentary of Zuo_](https://en.wikipedia.org/wiki/Zuo_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|[KR1e0001](https://github.com/kanripo/KR1e0001)|15-20|
  |春秋公羊傳|[_Commentary of Gongyang_](https://en.wikipedia.org/wiki/Gongyang_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|CH1e0877*|21|
  |春秋穀梁傳|[_Commentary of Guliang_](https://en.wikipedia.org/wiki/Guliang_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|[KR1e0008](https://github.com/kanripo/KR1e0008)|22|
  |孝經|[_Classic of Filial Piety_](https://en.wikipedia.org/wiki/Classic_of_Filial_Piety)|[KR1f0001](https://github.com/kanripo/KR1f0001)|23|
  |論語|[_Analects of Confucius_](https://en.wikipedia.org/wiki/Analects)|[KR1h0004](https://github.com/kanripo/KR1h0004)|24|
  |老子|[_Laozi_](https://en.wikipedia.org/wiki/Tao_Te_Ching)|[KR5c0057](https://github.com/kanripo/KR5c0057)|25|
  |莊子|[_Zhuangzi_](https://en.wikipedia.org/wiki/Zhuangzi_(book))|[KR5c0126](https://github.com/kanripo/KR5c0126)|26-28|

  *This data is sourced with permission from the [China Ancient Texts (CHANT) database](https://www.cuhk.edu.hk/ics/rccat/en/database.html).
  
  We omit chapter 1 of the _Jingdian Shiwen_, corresponding to the [_Erya_ (爾雅)](https://en.wikipedia.org/wiki/Erya). All digital sources have been preprocessed to remove punctuation, whitespace, and non-Chinese characters. Kanseki Repository data is generously licensed CC-BY.
  
  After processing, the labeled output data is saved in JSON-lines (`.jsonl`) format, to be used for machine learning, natural language processing, and other computational applications.

  ## Annotating
  To annotate training data, you need to have spacy installed in your python environment:
  ```sh
  pip install spacy
  ```
  You also need a copy of [prodigy](https://prodi.gy/). Once you have the appropriate wheel, install it with:
  ```sh
  # example: prodigy version 1.11.8 for python 3.10 on windows
  pip install prodigy-1.11.8-cp310-cp310-win_amd64.whl
  ```
  Then, verify the project assets are downloaded:
  ```sh
  spacy project assets
  ```
  Install python dependencies needed for annotation:
  ```sh
  spacy project run install
  ```
  Then, choose a task (see "commands" below). Invoke it with e.g.:
  ```sh
  # annotate data by correcting predictions
  spacy project run annotate
  ```

# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
  corpus: "annotations-large"
  embedding: "tok2vec" # tok2vec, trf
  suggester: "ngram" # ngram, span_finder
  transformer_model_name: "KoichiYasuoka/roberta-classical-chinese-base-char"
  gpu_id: -1
  config_file: "configs/${vars.suggester}/config_${vars.embedding}.cfg"
  spancat_model: "training/${vars.embedding}_${vars.suggester}/model-best"

# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "configs", "data", "metrics", "packages", "scripts", "training"]

# Assets that should be downloaded or available in the directory. Remote assets
# will be downloaded if not present locally.
assets:
  - dest: "assets/docs.csv"
    description: "Table mapping each chapter in a source text to its location in the _Jingdian Shiwen_"
    url: 
    
  - dest: "assets/variants.json"
    description: "Equivalency table for graphic variants of characters"
    url: 

  - dest: "assets/treebank"
    description: "Universal Dependencies treebank for Classical Chinese"
    git:
      repo: "https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto"
      branch: master
      path: ""

# Workflows are series of commands that are run in order and often depend on 
# each other.
# workflows: []

# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
  - name: "install"
    help: "Install dependencies"
    script:
      - "python -m pip install -r requirements.txt"

  - name: "annotate-spans"
    help: "Annotate spans by correcting predictions based on heuristics"
    script:
      - "python -m prodigy jdsw.spans.correct spans assets/${vars.corpus}.jsonl -F scripts/recipes/spancat.py"

  - name: "export"
    help: "Export training data from prodigy's database for use with spaCy"
    script:
      - "python -m prodigy db-out spans data"
      - "python -m prodigy data-to-spacy data --lang zh --config ${vars.config_file} --spancat spans"
    deps:
      - "${vars.config_file}"
    outputs:
      - "data/spans.jsonl"
      - "data/dev.spacy"
      - "data/train.spacy"

  - name: "train"
    help: "Train a spaCy pipeline"
    script:
      - "python -m spacy train ${vars.config_file} --output training/${vars.embedding}_${vars.suggester}/ --paths.train data/train.spacy --paths.dev data/dev.spacy --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"
    deps:
      - "${vars.config_file}"
      - "data/train.spacy"
      - "data/dev.spacy"
    outputs:
      - "${vars.spancat_model}"

  - name: "eval"
    help: "Evaluate the trained spaCy pipeline's accuracy and speed using test data"
    script:
      - "python -m spacy benchmark accuracy ${vars.spancat_model} data --gpu-id ${vars.gpu_id} --output metrics/accuracy.json"
      - "python -m spacy benchmark speed ${vars.spancat_model} data --gpu-id ${vars.gpu_id}"