layout
mydefault

<title>InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs</title> <script type="text/javascript" charset="utf8" src="https://code.jquery.com/jquery-3.6.0.slim.min.js"></script> <script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.11.3/js/jquery.dataTables.js"></script> <script src="https://cdn.datatables.net/1.11.3/js/dataTables.bootstrap5.min.js"></script> <script defer src="./static/js/fontawesome.all.min.js"></script> <script src="./static/js/bulma-carousel.min.js"></script> <script src="./static/js/bulma-slider.min.js"></script> <script src="./static/js/index.js"></script>

    <div class="navbar-item has-dropdown is-hoverable">
      <a class="navbar-link">
        More
      </a>
      <div class="navbar-dropdown">
        <a class="navbar-item" href="https://github.com/infi-coder">
          InfiCoder Organization
        </a>
      </div>
    </div>
  </div>

</div>

InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

Linyi Li Shijie Geng *Zhenwen Li *Yibo He *Hao Yu
*Ziyue Hua Guanghan Ning Siwei Wang Tao Xie Hongxia Yang

Simon Fraser University Rutgers University Peking University
ByteDance Inc The Hong Kong Polytechnic University
(* denotes to equal contribution)

        <div class="is-size-5 publication-authors">
          <!-- <span class="author-block"><sup>1</sup>Caption of Quận 5, Ho Chi Minh City, Vietnam</span> -->
          <!-- <span class="author-block"><sup>2</sup>Peking University,</span> -->
        </div>

        <div class="column has-text-centered">
          <div class="publication-links">
            <!-- PDF Link. -->
            <span class="link-block">
              <a href="https://arxiv.org/abs/2404.07940" class="external-link button is-normal is-rounded is-dark" target="_blank">
                <span class="icon">
                  <i class="ai ai-arxiv"></i>
                </span>
                <span>Paper</span>
              </a>
            </span>
            <!-- Dataset Link. -->
            <span class="link-block">
              <a href="https://github.com/infi-coder/infibench-evaluation-harness/" class="external-link button is-normal is-rounded is-dark" target="_blank">
                <span class="icon">
                  <i class="fab fa-github"></i>
                </span>
                <span>Code</span>
              </a>
              <!-- <a href="https://github.com/infi-coder/inficoder-eval-framework"
                 class="external-link button is-normal is-rounded is-dark" target='_blank'>
                <span class="icon">
                  <i class="fab fa-github"></i>
                </span>
                <span>Evaluation Repo</span>
              </a> -->
            </span>
            <!-- Slides Link. -->
            <span class="link-block">
              <a href="https://cs.sfu.ca/~linyi/res/pub/infibench_slides.pdf" class="external-link button is-normal is-rounded is-dark" target="_blank">
                <span class="icon">
                  <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"  style="fill:white"><!--!Font Awesome Free 6.7.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free Copyright 2024 Fonticons, Inc.--><path d="M187.7 153.7c-34 0-61.7 25.7-61.7 57.7 0 31.7 27.7 57.7 61.7 57.7s61.7-26 61.7-57.7c0-32-27.7-57.7-61.7-57.7zm143.4 0c-34 0-61.7 25.7-61.7 57.7 0 31.7 27.7 57.7 61.7 57.7 34.3 0 61.7-26 61.7-57.7 .1-32-27.4-57.7-61.7-57.7zm156.6 90l-6 4.3V49.7c0-27.4-20.6-49.7-46-49.7H76.6c-25.4 0-46 22.3-46 49.7V248c-2-1.4-4.3-2.9-6.3-4.3-15.1-10.6-25.1 4-16 17.7 18.3 22.6 53.1 50.3 106.3 72C58.3 525.1 252 555.7 248.9 457.5c0-.7 .3-56.6 .3-96.6 5.1 1.1 9.4 2.3 13.7 3.1 0 39.7 .3 92.8 .3 93.5-3.1 98.3 190.6 67.7 134.3-124 53.1-21.7 88-49.4 106.3-72 9.1-13.8-.9-28.3-16.1-17.8zm-30.5 19.2c-68.9 37.4-128.3 31.1-160.6 29.7-23.7-.9-32.6 9.1-33.7 24.9-10.3-7.7-18.6-15.5-20.3-17.1-5.1-5.4-13.7-8-27.1-7.7-31.7 1.1-89.7 7.4-157.4-28V72.3c0-34.9 8.9-45.7 40.6-45.7h317.7c30.3 0 40.9 12.9 40.9 45.7v190.6z"/></svg>
                </span>
                <span>Slides</span>
              </a>
            </span>
            <!-- Slides Link. -->
            <span class="link-block">
              <a href="https://openreview.net/forum?id=E8EAeyTxOy#discussion" class="external-link button is-normal is-rounded is-dark" target="_blank">
                <span class="icon">
                  <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"  style="fill:white"><!--!Font Awesome Free 6.7.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free Copyright 2024 Fonticons, Inc.--><path d="M224 96a160 160 0 1 0 0 320 160 160 0 1 0 0-320zM448 256A224 224 0 1 1 0 256a224 224 0 1 1 448 0z"/></svg>
                </span>
                <span>OpenReview</span>
              </a>
            </span>
          </div>
        </div>

        <div class="column has-text-centered">
          <a href="https://neurips.cc/virtual/2024/poster/97797" target="_blank">Appear at NeurIPS 2024 Datasets and Benchmarks Track</a>
        </div>
      </div>
    </div>
  </div>
</div>

InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.

Overview

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.

Statistics and Examples

InfiBench comprises 234 carefully picked high-quality Stack Overflow questions, covering 15 programming languages, and largely following the natural question distribution of Stack Overflow.

        <img src="static/images/data_domain_stats.png">

        <p>
        <br>
          We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria.
          Specifically, the InfiBench framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.
        </p>

        <img src="static/images/data_examples.png">

        <p>
        <br>
          Below is the question type, metric type, and length statistics.
        </p>

        <center>
          <img style="width: 350px" src="static/images/general_statistics.png">
        </center>

      </div>
    </div>
  </div>
  <!--/ Example. -->
</div>

Comparison

Existing benchmarks weigh heavily on code generation, unit-test-based evaluation, and a limited set of programming languages. InfiBench processes a much higher diversity to reflect real-world code LLMs’ usage scenarios and is far from saturation.

Prompts and Evaluation Protocol

Each question contains a system prompt and content prompt. For questions whose responses are mainly in natural language, the system prompt is

You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.

For other questions, the system prompt is

You are a professional assistant for programmers. By default, questions and answers are in Markdown format.

We then format the system prompt and content prompt following each model's default instruction template. If no instruction template specified, we use the prompt format

{system prompt}\n{content prompt}

We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.

For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 234 questions in the benchmark, the full score is 234, and we by default report the percentage score (achieved score divided by the full score which is 234). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.

Leaderboard

Each point corresponds to an open-source model, with error bars for those smaller than 30B. Each dotted segment corresponds to an MoE model. Proprietary models shown as lines with uncertainty ranges.

Notice: we set the max tokens to generate=1024 (since GPT4 generates 662 tokens without the constraint, we provide some wiggle room by setting to 1024 tokens)

For models with >30B parameters, we evaluate once due to resource limit, otherwise we evaluate three times and report the mean and standard deviation.

stands for proprietary models.

{% for item in site.data.leaderboard.records %} {% if item.link != null %} {% else %} {% endif %} {% if item.size != null %} {% else %} {% endif %} {% if item.score_std != null %} {% else %} {% endif %} {% endfor %}

Rank	Model Name	# Params. (in B)	Context Length	Full Set Score	Full Set Std
{{ item.rank }}	{% if item.locked %}{% endif %}{{ item.title }}	{% if item.locked %}{% endif %}{{ item.title }}	{{ item.size }}	/	{{ item.ctx_length }}	{{ item.score }}	{{ item.score_std }}		{{ item.comment }}

    </div>
  </div>
</div>

Try the Benchmark!

We only support Linux environment yet.

Reference Docker file (contributed by Kaixin Li).

Convert or save your model weights in Hugging Face Transformers format.
Clone our code repository.
Follow the short tutorial to generate responses and evaluate on InfiBench!

Feedback

You can also give us feedback in the issue section of our code repository:

BibTeX

@inproceedings{
li2024infibench,
title={InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models},
author={Linyi Li and Shijie Geng and Zhenwen Li and Yibo He and Hao Yu and Ziyue Hua and Guanghan Ning and Siwei Wang and Tao Xie and Hongxia Yang},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=E8EAeyTxOy}
}

More Leaderboards

In addition to InfiBench, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

BigCodeBench
Big Code Models Leaderboard
Chatbot Arena Leaderboard
CrossCodeEval
ClassEval
CRUXEval
Code Lingua
Evo-Eval
EffiBench
HumanEval.jl - Julia version HumanEval with EvalPlus test cases
LiveCodeBench
MHPP
NaturalCodeBench
RepoBench
SWE-bench
SWE-bench Verified
TabbyML Leaderboard
TestEval
EvalPlus

This website is adapted from ds1000-code-gen.github.io and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This means you are free to borrow the source code of this website, we just ask that you link back to this page in the footer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.markdown

index.markdown

InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.

Overview

Statistics and Examples

Comparison

Prompts and Evaluation Protocol

Leaderboard

Try the Benchmark!

Feedback

BibTeX

More Leaderboards

Files

index.markdown

Latest commit

History

index.markdown

File metadata and controls

InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.

Overview

Statistics and Examples

Comparison

Prompts and Evaluation Protocol

Leaderboard

Try the Benchmark!

Feedback

BibTeX

More Leaderboards