LuoXiaoxi-cxq / Ancient-Chinese-Similar-Paragraphs Public

Notifications You must be signed in to change notification settings
Fork 1
Star 0

Find similar paragraphs in ancient Chinsese corpus. Also, try to train semantically meaningful sentence embeddings of ancient Chinsese.

0 stars 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
data		data
preprocess		preprocess
result/clustering		result/clustering
README.md		README.md
cluster.py		cluster.py
eval.py		eval.py
function.py		function.py
plain_approach.py		plain_approach.py
train.py		train.py

Repository files navigation

README

dependencies

Python 3.7

torch==1.10.0+cu113

transformers==4.28.1

pandas==1.3.4

numpy==1.21.4

tqdm==4.65.0

python_docx==0.8.11

scikit_learn==1.0.2

sentence_transformers==2.2.2

zhconv==1.4.3

structure of my project

./data/ contains all datasets used in training, evaluation and clustering
./preprocess/
- make_ancient_modern_Chinese_parallel.py pre-processes Ancient-Modern Chinese dataset
- make_traditional_simplified_character_parallel.py creates Traditional-Simplified Character Parallel Corpus
- crawl_get_content.py crawl parallel sentence groups from ctext
- ctext_preprocess.py pre-processes the parallel sentence groups crawled from ctext
cluster.py uses fine-tuned models to cluster ancient Chinese parallel sentences
eval.py evaluates fine-tuned models on two metrics we defined
function.py defines aiding functions
train.py uses three datasets to fine-tune the model
plain_approach.py implements a plain algorithm to cluster ancient Chinese parallel sentences
./finetuned_model/ If you want to use our fine-tuned model, or want to run eval.py or evaluate.py, download the models from this pku disk link and put them in this directory.

About

Find similar paragraphs in ancient Chinsese corpus. Also, try to train semantically meaningful sentence embeddings of ancient Chinsese.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%