- 2024/09/26 WenMind Benchmark paper has been accepted by NeurIPS 2024.
WenMind is a comprehensive benchmark dedicated for evaluating Large Language Models (LLMs) in Chinese Classical Literature and Language Arts (CCLLA). WenMind covers the sub-domains of Ancient Prose, Ancient Poetry, and Ancient Literary Culture, comprising 4,875 question-answer pairs, spanning 42 fine-grained tasks (as shown in the figure 1), 3 question formats (Fill-in-the-Blank questions, Multiple-Choice questions and Question-and-Answer questions), and 2 evaluation scenarios (domain-oriented and capability-oriented).
Figure 1: Overview of WenMind Benchmark, which covers 3 sub-domains and 42 fine-gained tasks.
You can obtain the complete WenMind evaluation dataset from WenMind Benchmark folder on GitHub.
{
"id": 2464,
"domain": "ancient literary culture",
"capability": "knowledge",
"question_format": "QA",
"coarse_grained_task_zh": "成语",
"coarse_grained_task_en": "idiom",
"fine_grained_task_zh": "成语解释",
"fine_grained_task_en": "idiom explanation",
"question": "解释下面成语的意思:\n暮去朝来",
"answer": "黄昏过去,清晨又到来。形容时光流逝。"
}
The following is an explanation of the various fields in the data samples:
-
id
: The unique identifier for the data sample, used to distinguish different samples. -
domain
: The domain to which the data sample belongs, including ancient prose, ancient poetry and ancient literary culture. -
capability
: The type of capability of the data sample, including knowledge, understanding and generation. -
question_format
: The format of the question, indicating the type of question in the sample, including FB, MCQ and QA. -
coarse_grained_task_zh
: The Chinese name of the coarse-grained task classification. Describes the coarse-grained task category of the sample, with a total of 26 categories. -
coarse_grained_task_en
: The English name of the coarse-grained task classification. Corresponds tocoarse_grained_task_zh
, describing the coarse-grained task category of the sample, with a total of 26 categories. -
fine_grained_task_zh
: The Chinese name of the fine-grained task classification. Describes the fine-grained task category of the sample, with a total of 42 categories. -
fine_grained_task_en
: The English name of the fine-grained task classification. Corresponds tofine_grained_task_zh
, describing the fine-grained task category of the sample, with a total of 42 categories. -
question
: The actual content of the question. The question to be answered in the sample. -
answer
: The answer to the corresponding question. Provides a detailed response to the question.
- Task Description: Correct word order for inverted sentences.
- Capability: Understanding
- Scale: 18
- Task Description: Answer the omitted information in the elliptical sentence.
- Capability: Understanding
- Scale: 32
- Task Description: Identify the inversion type of inverted sentences.
- Capability: Understanding
- Scale: 7
- Task Description: Identify the sentence's syntactic type.
- Capability: Understanding
- Scale: 43
- Task Description: Translate classical Chinese into modern Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Translate modern Chinese into classical Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Extract named entities from Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Add punctuation to Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Select theme categories based on Classical Chinese sentences.
- Capability: Understanding
- Scale: 200
- Task Description: Explain the words and phrases in Classical Chinese sentences.
- Capability: Understanding
- Scale: 100
- Task Description: Read Classical Chinese texts and answer related questions.
- Capability: Understanding
- Scale: 100
- Task Description: Answer the usage of function words in classical Chinese sentences.
- Capability: Understanding
- Scale: 100
- Task Description: Identify whether a character is a homophone.
- Capability: Understanding
- Scale: 200
- Task Description: Distinguish between different meanings of the same character.
- Capability: Understanding
- Scale: 200
- Task Description: Writing in classical Chinese.
- Capability: Generation
- Scale: 100
- Task Description: Answer appreciation questions based on ancient poetry.
- Capability: Understanding
- Scale: 150
- Task Description: Conduct a free and detailed analysis of ancient poetry.
- Capability: Understanding
- Scale: 100
- Task Description: Compose a poem based on the theme.
- Capability: Generation
- Scale: 30
- Task Description: Compose a ci based on the theme.
- Capability: Generation
- Scale: 50
- Task Description: Compose a qu based on the theme.
- Capability: Generation
- Scale: 20
- Task Description: Answer the complete content of ancient poetry according to the title and author.
- Capability: Knowledge
- Scale: 200
- Task Description: Answer the title and author according to the content of ancient poetry.
- Capability: Knowledge
- Scale: 200
- Task Description: Write the next sentence according to the previous sentence in the ancient poem.
- Capability: Knowledge
- Scale: 100
- Task Description: Write the previous sentence according to the next sentence in the ancient poem.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide ancient poetry sentences that meet the requirements.
- Capability: Knowledge
- Scale: 30
- Task Description: Judge the genre of ancient poetry.
- Capability: Knowledge
- Scale: 120
- Task Description: Translate ancient poetry into modern Chinese.
- Capability: Understanding
- Scale: 200
- Task Description: Judge the sentiment contained in ancient poetry.
- Capability: Understanding
- Scale: 200
- Task Description: Translate ancient poetry into English.
- Capability: Understanding
- Scale: 50
- Task Description: Provide a detailed introduction of the poet.
- Capability: Knowledge
- Scale: 110
- Task Description: Provide the meanings of the imagery.
- Capability: Knowledge
- Scale: 185
- Task Description: Create the following couplet based on the previous one.
- Capability: Generation
- Scale: 100
- Task Description: Write a couplet based on the theme.
- Capability: Generation
- Scale: 100
- Task Description: Write HengPi based on the content of a couplet.
- Capability: Generation
- Scale: 100
- Task Description: Provide the synonym for the idiom.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide the source of the idiom.
- Capability: Knowledge
- Scale: 100
- Task Description: Extract idioms from ancient Chinese sentences and provide their meanings.
- Capability: Knowledge
- Scale: 100
- Task Description: Provide the meaning of idioms.
- Capability: Knowledge
- Scale: 100
- Task Description: Guess the answer based on clues or clever hints.
- Capability: Knowledge
- Scale: 100
- Task Description: Complete the second half of the proverb based on the first half.
- Capability: Knowledge
- Scale: 100
- Task Description: Answer questions about ancient Chinese phonetics and rhymes.
- Capability: Knowledge
- Scale: 100
- Task Description: Answer questions about Sinology.
- Capability: Knowledge
- Scale: 130
The construction pipeline of WenMind includes data collection and data processing, as illustrated in Figure 2.
Figure 2: Construction pipeline of WenMind Benchmark.
Table 1 provides the statistics of the WenMind dataset.
Table 1: The statistics of the WenMind Benchmark. "Q" represents "Question" and "A" represents "Answer".
Domain | Tasks | #Q | Max. #Q | Min. #Q | Avg. Q Tokens | Avg. A Tokens |
---|---|---|---|---|---|---|
Ancient Prose | 15 | 1,900 | 200 | 7 | 107.51 | 62.12 |
Ancient Poetry | 16 | 1,845 | 200 | 20 | 73.42 | 94.93 |
Ancient Literary Culture | 11 | 1,130 | 100 | 100 | 26.68 | 14.26 |
Overall | 42 | 4,875 | 200 | 7 | 75.87 | 63.44 |
For open-source models, we perform inference locally, only requiring the model path and the output file path for the answers.
--model_path The path to the model, defaults to loading from huggingface
--output_path The file path for the model's answer output, defaults to {model_name}_result.json
e.g.
CUDA_VISIBLE_DEVICES=0,1 python Evaluation_Code/Inference/Test_Baichuan2-7B-Chat.py \
--model_path baichuan-inc/Baichuan2-7B-Chat \
--output_path Baichuan2-7B-Chat_result.json
For GPT-3.5 and GPT-4 models, provide two parameters: api_base
and api_key
.
For ERNIE-3.5 and ERNIE-4.0 models, provide two parameters: api_key
and secret_key
.
For Spark models, provide three parameters: api_key
, secret_key
, and appid
.
Refer to the official documentation of each API model for details.
e.g.
python Test_ERNIE-3.5-8K-0329.py \
--API_KEY {api_key} \
--SECRET_KEY {secret_key} \
--output_path {output_path}
Step 1: Check whether the LLM response file is consistent with the format of the JSON/LLM_Response_Examples.json
file.
Step 2: Open the Evaluation_Code/LLM_Scoring.py
file, input the API_KEY
and SECRET_KEY
for the scoring model ERNIE-3.5, replace LLM_response_path
with the storage path of the LLM response file, replace LLM_score_path
with the path where the scoring results will be saved, and replace LLM_prompt_path
with the storage path of JSON/Task_Score_Prompt.json
.
Step 3: Run the following command to obtain the scoring results:
python Evaluation_Code/LLM_Scoring.py
Step 1: Check whether the scoring file is consistent with the format of the JSON/LLM_Score_Examples.json
file.
Step 2: Open the Evaluation_Code/Calculate_Score.py
file and replace LLM_score_path
with the storage path of the scoring file.
Step 3: Run the following command to obtain the model's score:
python Evaluation_Code/Calculate_Score.py
Table 2: Results of all evaluated models on different domains and capabilities.
- SCUT-C2MChn
- WYWEB
- Daizhige
- ACLUE
- Websites-A Related to Ancient Poetry
- Websites-B Related to Ancient Poetry
- Sou Yun
- THU-FSPC
- Han Dian
The work is licensed under a MIT License.
The WenMind benchmark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.