Skip to content

Latest commit

 

History

History
209 lines (166 loc) · 17.8 KB

CODE.md

File metadata and controls

209 lines (166 loc) · 17.8 KB

2x coding speed https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

code improves reasoning

ccording to the post, Claude 2 now 71.2%, a significant upgrade from 1.3 (56.0%). (Found in model card: pass@1)

For comparison:

  • GPT-4 claims 85.4 on HumanEval, in a recent paper https://arxiv.org/pdf/2303.11366.pdf GPT-4 was tested at 80.1 pass@1 and 91 pass@1 using their Reflexion technique. They also include MBPP and Leetcode Hard benchmark comparisons

  • WizardCoder, a StarCoder fine-tune is one of the top open models, scoring a 57.3 pass@1, model card here: https://huggingface.co/WizardLM/WizardCoder-15B-V1.0

  • The best open model I know of atm is replit-code-instruct-glaive, a replit-code-3b fine tune, which scores a 63.5% pass@1. An independent developer abacaj has reproduced that announcement as part of code-eval, a repo for getting human-eval results: https://github.com/abacaj/code-eval

Those interested in this area may also want to take a look at this repo https://github.com/my-other-github-account/llm-humaneval-ben... that also ranks with Eval+, the CanAiCode Leaderboard https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul... and airate https://github.com/catid/supercharger/tree/main/airate

Also, as with all LLM evals, to be taken with a grain of salt...pull

Liu, Jiawei, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.” arXiv, June 12, 2023. https://doi.org/10.48550/arXiv.2305.01210.

Data/Timeline

"Today, GitHub Copilot is behind an average of 46% of a developers’ code across all programming languages—and in Java, that number jumps to 61%."

Known Issues

code models

benchmarks

https://arxiv.org/pdf/2303.06689.pdf MBPP [Austin et al., 2021] This benchmark, referred to as "Mostly Basic Programming Problems", contains nearly 1000 crowd-sourced python programming problems, covering programming fundamentals, standard library functionality, and more. Each problem in the benchmark consists of a NL description, a code solution, and 3 automated test cases. A portion of the manually verified data is extracted as "MBPP-sanitized". For MBPP, which does not include function signatures, only the NL description is provided as input.

HumanEval [Chen et al., 2021] This benchmark is a set of 164 handwritten programming problems, proposed by OpenAI. Each problem includes a function signature, NL description, function body, and several unit tests, with an average of 7.7 tests per problem. For HumanEval, function signature, NL description, and public test cases are provided as input. Furthermore, we utilize an expanded version of MBPP and HumanEval , which includes over 100 additional test cases per task, to reinforce the validity of code evaluation [Dong et al., 2023]. This extended version is referred to as MBPP-ET and HumanEval-ET.

bigcode eval harness https://github.com/bigcode-project/bigcode-evaluation-harness/

products

(alessio's blogpost https://evcrevolution.com/p/evc-10-llm-for-developers)

sourcegraph list https://github.com/sourcegraph/awesome-code-ai

autogenerate PRs

commit msg generation

Test generation

Codium - https://www.codium.ai/blog/codiumai-powered-by-testgpt-accounces-beta-and-raised-11m/ - video demo https://twitter.com/mathemagic1an/status/1638598693623582720

GPT low code

alternative languages

function sdks

misc