This is an evaluation harness for the benchmark described in T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step.
[Paper] [Project Page] [LeaderBoard] [HuggingFace]
Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool-utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability.
- [2024.02.22] Release new data and 1/5 subset(both Chinese and English) and code for faster inference! ๐๐๐ The leaderboard will be updated soon! We also provide template examples for reference~
- [2024.01.08] Release ZH Leaderboard and
ZH data, where the questions and answer formats are in Chinese. ๏ผๅ ฌๅธไบไธญๆ่ฏๆตๆฐๆฎ้ๅๆฆๅ๏ผโจโจโจ - [2023.12.22] Paper available on ArXiv. ๐ฅ๐ฅ๐ฅ
- [2023.12.21] Release the test scripts and data for T-Eval. ๐๐๐
- Support Batch Inference. NOTE: Some models (ChatGLM, Qwen, InternV1) does not support batch inference.
- Change the role of function response from
system
tofunction
. - Merge consecutive same role conversations.
- Provide template configs for open-sourced models.
- Provide dev set for T-Eval, reducing the evaluation time.
- Optimize the inference pipeline of huggingface model provided by Lagent, which will be 3x faster. (Please upgrade Lagent to v0.2)
- Support inference on Opencompass.
NOTE: These TODOs will be started after 2024.2.1 Thanks for your patience!
$ git clone https://github.com/open-compass/T-Eval.git
$ cd T-Eval
$ pip install -r requirements.txt
$ git clone https://github.com/InternLM/lagent.git
$ cd lagent && pip install -e .
We support both API-based models and HuggingFace models via Lagent.
We provide both google drive & huggingface dataset to download test data:
- Google Drive
[EN data] (English format) [ZH data] (Chinese format)
T-Eval Data
- HuggingFace Datasets
You can also access the dataset through huggingface via this link.
from datasets import load_dataset
dataset = load_dataset("lovesnowbest/T-Eval")
After downloading, please put the data in the data
folder directly:
- data/
- instruct_v2.json
- plan_json_v2.json
...
- Set your OPENAI key in your environment.
export OPENAI_API_KEY=xxxxxxxxx
- Run the model with the following scripts
# test all data at once
sh test_all_en.sh api gpt-4-1106-preview gpt4
# test ZH dataset
sh test_all_zh.sh api gpt-4-1106-preview gpt4
# test for Instruct only
python test.py --model_type api --model_path gpt-4-1106-preview --resume --out_name instruct_gpt4.json --out_dir work_dirs/gpt4/ --dataset_path data/instruct_v2.json --eval instruct --prompt_type json
- Download the huggingface model to your local path.
- Modify the
meta_template
json according to your tested model. - Run the model with the following scripts
# test all data at once
sh test_all_en.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE
# test ZH dataset
sh test_all_zh.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE
# test for Instruct only
python test.py --model_type hf --model_path $HF_PATH --resume --out_name instruct_$HF_MODEL_NAME.json --out_dir data/work_dirs/ --dataset_path data/instruct_v1.json --eval instruct --prompt_type json --model_display_name $HF_MODEL_NAME --meta_template $META_TEMPLATE
Once you finish all tested samples, a detailed evluation results will be logged at $out_dir/$model_display_name/$model_display_name_-1.json
(For ZH dataset, there is a _zh
suffix). To obtain your final score, please run the following command:
python teval/utils/convert_results.py --result_path $out_dir/$model_display_name/$model_display_name_-1.json
T-Eval adopts multi-conversation style evaluation to gauge the model. The format of our saved prompt is as follows:
[
{
"role": "system",
"content": "You have access to the following API:\n{'name': 'AirbnbSearch.search_property_by_place', 'description': 'This function takes various parameters to search properties on Airbnb.', 'required_parameters': [{'name': 'place', 'type': 'STRING', 'description': 'The name of the destination.'}], 'optional_parameters': [], 'return_data': [{'name': 'property', 'description': 'a list of at most 3 properties, containing id, name, and address.'}]}\nPlease generate the response in the following format:\ngoal: goal to call this action\n\nname: api name to call\n\nargs: JSON format api args in ONLY one line\n"
},
{
"role": "user",
"content": "Call the function AirbnbSearch.search_property_by_place with the parameter as follows: 'place' is 'Berlin'."
}
]
where role
can be ['system', 'user', 'assistant'], and content
must be in string format. Before infering it by a LLM, we need to construct it into a raw string format via meta_template
. meta_template
examples are provided at meta_template.py:
[
dict(role='system', begin='<|System|>:', end='\n'),
dict(role='user', begin='<|User|>:', end='\n'),
dict(
role='assistant',
begin='<|Bot|>:',
end='<eoa>\n',
generate=True)
]
You need to specify the begin
and end
token based on your tested huggingface model at meta_template.py and specify the meta_template
args in test.py
, same as the name you set in the meta_template.py
. As for OpenAI model, we will handle that for you.
More detailed and comprehensive benchmark results can refer to ๐ T-Eval official leaderboard !
You can submit your inference results (via running test.py) to this email. We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. A sample structure of your submission should be like:
$model_display_name/
instruct_$model_display_name/
query_0_1_0.json
query_0_1_1.json
...
plan_json_$model_display_name/
plan_str_$model_display_name/
...
T-Eval is built with Lagent and OpenCompass. Thanks for their awesome work!
If you find this project useful in your research, please consider cite:
@article{chen2023t,
title={T-Eval: Evaluating the Tool Utilization Capability Step by Step},
author={Chen, Zehui and Du, Weihua and Zhang, Wenwei and Liu, Kuikun and Liu, Jiangning and Zheng, Miao and Zhuo, Jingming and Zhang, Songyang and Lin, Dahua and Chen, Kai and others},
journal={arXiv preprint arXiv:2312.14033},
year={2023}
}
This project is released under the Apache 2.0 license.