layout |
---|
mydefault |
Linyi Li Shijie Geng *Zhenwen Li *Yibo He *Hao Yu
*Ziyue Hua Guanghan Ning Siwei Wang Tao Xie Hongxia Yang
Simon Fraser University Rutgers University Peking University
ByteDance Inc The Hong Kong Polytechnic University
(* denotes to equal contribution)
<div class="is-size-5 publication-authors">
<!-- <span class="author-block"><sup>1</sup>Caption of Quận 5, Ho Chi Minh City, Vietnam</span> -->
<!-- <span class="author-block"><sup>2</sup>Peking University,</span> -->
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/abs/2404.07940" class="external-link button is-normal is-rounded is-dark" target="_blank">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- Dataset Link. -->
<span class="link-block">
<a href="https://github.com/infi-coder/infibench-evaluation-harness/" class="external-link button is-normal is-rounded is-dark" target="_blank">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
<!-- <a href="https://github.com/infi-coder/inficoder-eval-framework"
class="external-link button is-normal is-rounded is-dark" target='_blank'>
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Evaluation Repo</span>
</a> -->
</span>
<!-- Slides Link. -->
<span class="link-block">
<a href="https://cs.sfu.ca/~linyi/res/pub/infibench_slides.pdf" class="external-link button is-normal is-rounded is-dark" target="_blank">
<span class="icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" style="fill:white"><!--!Font Awesome Free 6.7.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free Copyright 2024 Fonticons, Inc.--><path d="M187.7 153.7c-34 0-61.7 25.7-61.7 57.7 0 31.7 27.7 57.7 61.7 57.7s61.7-26 61.7-57.7c0-32-27.7-57.7-61.7-57.7zm143.4 0c-34 0-61.7 25.7-61.7 57.7 0 31.7 27.7 57.7 61.7 57.7 34.3 0 61.7-26 61.7-57.7 .1-32-27.4-57.7-61.7-57.7zm156.6 90l-6 4.3V49.7c0-27.4-20.6-49.7-46-49.7H76.6c-25.4 0-46 22.3-46 49.7V248c-2-1.4-4.3-2.9-6.3-4.3-15.1-10.6-25.1 4-16 17.7 18.3 22.6 53.1 50.3 106.3 72C58.3 525.1 252 555.7 248.9 457.5c0-.7 .3-56.6 .3-96.6 5.1 1.1 9.4 2.3 13.7 3.1 0 39.7 .3 92.8 .3 93.5-3.1 98.3 190.6 67.7 134.3-124 53.1-21.7 88-49.4 106.3-72 9.1-13.8-.9-28.3-16.1-17.8zm-30.5 19.2c-68.9 37.4-128.3 31.1-160.6 29.7-23.7-.9-32.6 9.1-33.7 24.9-10.3-7.7-18.6-15.5-20.3-17.1-5.1-5.4-13.7-8-27.1-7.7-31.7 1.1-89.7 7.4-157.4-28V72.3c0-34.9 8.9-45.7 40.6-45.7h317.7c30.3 0 40.9 12.9 40.9 45.7v190.6z"/></svg>
</span>
<span>Slides</span>
</a>
</span>
<!-- Slides Link. -->
<span class="link-block">
<a href="https://openreview.net/forum?id=E8EAeyTxOy#discussion" class="external-link button is-normal is-rounded is-dark" target="_blank">
<span class="icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" style="fill:white"><!--!Font Awesome Free 6.7.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free Copyright 2024 Fonticons, Inc.--><path d="M224 96a160 160 0 1 0 0 320 160 160 0 1 0 0-320zM448 256A224 224 0 1 1 0 256a224 224 0 1 1 448 0z"/></svg>
</span>
<span>OpenReview</span>
</a>
</span>
</div>
</div>
<div class="column has-text-centered">
<a href="https://neurips.cc/virtual/2024/poster/97797" target="_blank">Appear at NeurIPS 2024 Datasets and Benchmarks Track</a>
</div>
</div>
</div>
</div>
</div>
Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
InfiBench comprises 234 carefully picked high-quality Stack Overflow questions, covering 15 programming languages, and largely following the natural question distribution of Stack Overflow.
<img src="static/images/data_domain_stats.png">
<p>
<br>
We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria.
Specifically, the InfiBench framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.
</p>
<img src="static/images/data_examples.png">
<p>
<br>
Below is the question type, metric type, and length statistics.
</p>
<center>
<img style="width: 350px" src="static/images/general_statistics.png">
</center>
</div>
</div>
</div>
<!--/ Example. -->
</div>
You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.
You are a professional assistant for programmers. By default, questions and answers are in Markdown format.
{system prompt}\n{content prompt}
We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.
For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 234 questions in the benchmark, the full score is 234, and we by default report the percentage score (achieved score divided by the full score which is 234). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.
{% for item in site.data.leaderboard.records %} {% if item.link != null %} {% else %} {% endif %} {% if item.size != null %} {% else %} {% endif %} {% if item.score_std != null %} {% else %} {% endif %} {% endfor %}
Rank | Model Name | # Params. (in B) | Context Length | Full Set Score | Full Set Std | ||||
---|---|---|---|---|---|---|---|---|---|
{{ item.rank }} | {% if item.locked %}{% endif %}{{ item.title }} | {% if item.locked %}{% endif %}{{ item.title }} | {{ item.size }} | / | {{ item.ctx_length }} | {{ item.score }} | {{ item.score_std }} | {{ item.comment }} |
</div>
</div>
</div>
We only support Linux environment yet.
Reference Docker file (contributed by Kaixin Li).
- Convert or save your model weights in Hugging Face Transformers format.
- Clone our code repository.
- Follow the short tutorial to generate responses and evaluate on InfiBench!
You can also give us feedback in the issue section of our code repository:
@inproceedings{
li2024infibench,
title={InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models},
author={Linyi Li and Shijie Geng and Zhenwen Li and Yibo He and Hao Yu and Ziyue Hua and Guanghan Ning and Siwei Wang and Tao Xie and Hongxia Yang},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=E8EAeyTxOy}
}
In addition to InfiBench, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
- BigCodeBench
- Big Code Models Leaderboard
- Chatbot Arena Leaderboard
- CrossCodeEval
- ClassEval
- CRUXEval
- Code Lingua
- Evo-Eval
- EffiBench
- HumanEval.jl - Julia version HumanEval with EvalPlus test cases
- LiveCodeBench
- MHPP
- NaturalCodeBench
- RepoBench
- SWE-bench
- SWE-bench Verified
- TabbyML Leaderboard
- TestEval
- EvalPlus
This website is adapted from ds1000-code-gen.github.io and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This means you are free to borrow the source code of this website, we just ask that you link back to this page in the footer.