Skip to content

Frontier papers in the evaluation methodologies of language models.

License

Notifications You must be signed in to change notification settings

CSLiJT/awesome-lm-evaluation-methodologies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Awesome LM Evaluation Methodologies


How To Use Me

  1. In this webpage, press ctrl+F (for Windows)/command + F(for Mac)
  2. Enter the keyword you want to search
  3. Read the paper from its link.

Evaluation Methodologies

Author Title Proceeding Link
Jiatong Li, et al. PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations NeurIPS 2024 https://arxiv.org/abs/2405.19740
Jingnan Zheng, et al. ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation NeurIPS 2024 https://arxiv.org/abs/2405.14125
Jinhao Duan, et al. GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations NeurIPS 2024 https://arxiv.org/abs/2402.12348
Felipe Maia Polo, et al. Efficient multi-prompt evaluation of LLMs NeurIPS 2024 https://arxiv.org/abs/2405.17202
Fan Lin, et al. IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation NeurIPS 2024 https://arxiv.org/abs/2409.18892
Jinjie Ni, et al. MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures NeurIPS 2024 https://nips.cc/virtual/2024/poster/96545
Percy Liang, et al. Holistic Evaluation of Language Models TMLR https://arxiv.org/abs/2211.09110
Felipe Maia Polo, et al. tinyBenchmarks: evaluating LLMs with fewer examples ICML 2024 https://openreview.net/forum?id=qAml3FpfhG
Miltiadis Allamanis, et al. Unsupervised Evaluation of Code LLMs with Round-Trip Correctness ICML 2024 https://icml.cc/virtual/2024/poster/33761
Wei-Lin Chiang, et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference ICML 2024 https://arxiv.org/abs/2403.04132
Yonatan Oren, et al. Proving Test Set Contamination in Black-Box Language Models ICLR 2024 https://arxiv.org/abs/2310.17623
Kaijie Zhu, et al. DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks ICLR 2024 https://arxiv.org/abs/2309.17167
Seonghyeon Ye, et al. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets ICLR 2024 https://openreview.net/forum?id=CYmF38ysDa
Shahriar Golchin, et al. Time Travel in LLMs: Tracing Data Contamination in Large Language Models ICLR 2024 https://openreview.net/forum?id=2Rwq6c3tvr
Gati Aher, et al. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies ICML 2023 https://proceedings.mlr.press/v202/aher23a/aher23a.pdf

Evaluation Benchmarks

Author Title Proceeding Link
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt Measuring Massive Multitask Language Understanding ICLR 2021 https://arxiv.org/abs/2009.03300
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models NeurIPS 2023 https://arxiv.org/abs/2305.08322
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang SafetyBench: Evaluating the Safety of Large Language Models ACL 2024 https://aclanthology.org/2024.acl-long.830/
Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yuan Yao, Yangqiu Song PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models ACL 2024 https://aclanthology.org/2024.acl-long.4/

Survey Papers

Author Title Proceeding Link
Yupeng Chang, et al. A Survey on Evaluation of Large Language Models TIST https://dl.acm.org/doi/full/10.1145/3641289
Zishan Guo, et al. Evaluating Large Language Models: A Comprehensive Survey Preprint (arxiv) https://arxiv.org/abs/2310.19736
Zhuang Ziyu, et al. Through the Lens of Core Competency: Survey on Evaluation of Large Language Models CCL 2023 https://aclanthology.org/2023.ccl-2.8/
Isabel O. Gallegos, et al. Bias and Fairness in Large Language Models: A Survey CL 2024 https://aclanthology.org/2024.cl-3.8/

Releases

No releases published

Packages

No packages published