Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

ytxmobile98 · 2024-11-29T09:08:27Z

Update 2024-12-11

I have found that when running the Aider benchmark with the DeepSeek-Coder-6.7B-Instruct model, most of the results generated by the model did not include the SEARCH/REPLACE blocks which is used by the benchmarking program to save the code into Python source files and run unit tests. See this comment.

Original post on 2024-11-29

I got some extraordinary low results on running Aider benchmark, with the DeepSeek-6.7B-Instruct model. When I inspected the output files, what most astonished me was that most of the output files do not contain valid solution code, but instead the original signature along with the pass statement. What steps did I miss to run the evaluations and get the desired results? Thanks.

My results

Edit mode: `diff`

{
  "pass_rate_1": 0.9,
  "pass_rate_2": 0.9,
  "percent_cases_well_performed": 100
}

Edit mode: `whole`

{
  "pass_rate_1": 1.5,
  "pass_rate_2": 1.5,
  "percent_cases_well_performed": 100
}

The output

When I inspected the outputs, I noticed that the majority of the code files were not edited to contain the correct solution, but still left with the signature + a simple pass statement. For example the isogram test case has the following output isogram.py:

def is_isogram(string):
    pass

The model's config file

Meanwhile here is the model's config.json file:

{
  "_name_or_path": "/3fs-jd/prod/deepseek/shared/zhuqihao/public_model/deepseek-coder-7b-instruct2",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 32013,
  "eos_token_id": 32014,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "vocab_size": 32256
}

The bash scripts

`run.sh`

# The model name matches a model directory on my test machine
# MODEL_NAME="Qwen2.5-Coder-7B-Instruct"
export MODEL_NAME="deepseek-coder-6___7b-instruct"
# export MODEL_NAME="DeepSeek-Coder-V2-Lite-Instruct"

# edit format (`whole` / `diff`)
# export EDIT_FORMAT=whole
export EDIT_FORMAT=diff

export CUDA_VISIBLE_DEVICES="2,3"
TP=2

EVAL_SCRIPT="./evaluate.sh"
MODEL_DIR="/data/models/${MODEL_NAME}/"
OUTPUT_DIR="./results/${MODEL_NAME}/${EDIT_FORMAT}"
bash "${EVAL_SCRIPT}" "${MODEL_DIR}" "${OUTPUT_DIR}" "${TP}"

`evaluate.sh`

MODEL_DIR=${1}
OUTPUT_DIR=${2}
TP=${3}
MODEL_DIR=${MODEL_DIR:-"./pretrained_models/"}
OUTPUT_DIR=${OUTPUT_DIR:-"./results/"}
mkdir -p ${OUTPUT_DIR}
TP=${TP:-2}
echo $TP

ROOT_DIR="."

bash test.sh "${MODEL_DIR}" ${TP} "${OUTPUT_DIR}/aider"

`test.sh`

export PATH=./aider/bin:$PATH

export HF_ENDPOINT=http://hf-mirror.com
export HF_HOME=""
export HF_DATASETS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1

export OPENAI_API_BASE=http://0.0.0.0:8000/v1
export OPENAI_API_KEY=token-abc123

export MODEL=$1
export TP=$2
export OUTPUT_DIR=$3

export SERVED_MODEL_NAME=$(basename ${MODEL})
export API_MODEL_NAME=openai/${SERVED_MODEL_NAME}

# Edit format is `whole` or `diff`
# normally it should be passed from `run.sh`
if [ -z "$EDIT_FORMAT" ]; then
    EDIT_FORMAT=diff
fi

mkdir -p ${OUTPUT_DIR}

echo "Starting serving ${MODEL} as ${SERVED_MODEL_NAME}..."
vllm serve ${MODEL} \
    --served-model-name ${SERVED_MODEL_NAME} \
    --tensor-parallel-size ${TP} \
    --trust-remote-code \
    --max-model-len 4096 \
    --dtype auto \
    --api-key token-abc123 \
    > ${OUTPUT_DIR}/vllm-server.txt 2>&1 &

sleep 5
jobs -l > ${OUTPUT_DIR}/jobs.txt
PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
echo "PID: $PID"

echo "Waiting for the model to be served..."
while true; do
    if grep -q 'Uvicorn running on' "${OUTPUT_DIR}/vllm-server.txt"; then
        echo "Model is being served..."
        break
    else
        echo "Waiting for model to start..."
        sleep 1
    fi
done

echo "Benchmarking ${SERVED_MODEL_NAME}..."

python benchmark/benchmark.py ${SERVED_MODEL_NAME} \
    --new \
    --model ${API_MODEL_NAME} \
    --edit-format ${EDIT_FORMAT} \
    --threads 1 \
    > ${OUTPUT_DIR}/log.txt

# extract the required lines from log.txt and use awk to extract the corresponding values
pass_rate_1=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_1' | awk '{print $2}')
pass_rate_2=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_2' | awk '{print $2}')
percent_cases_well_formed=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'percent_cases_well_formed' | awk '{print $2}')

# create JSON-formatted content
json_content=$(cat <<EOF
{
  "pass_rate_1": $pass_rate_1,
  "pass_rate_2": $pass_rate_2,
  "percent_cases_well_formed": $percent_cases_well_formed
}
EOF
)

# write the JSON content to the results.json file
echo "$json_content" > ${OUTPUT_DIR}/results.json

PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
kill ${PID}

The text was updated successfully, but these errors were encountered:

cyente · 2024-12-02T03:57:56Z

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

ytxmobile98 · 2024-12-02T07:21:03Z

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

@cyente What "ds-coder" are you referring to? Thanks.

Hambaobao · 2024-12-02T08:04:37Z

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

ytxmobile98 · 2024-12-04T00:53:28Z

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

Looks like --max-model-len does not help so much. I tried to run Aider diff mode with DeepSeek-6.7B-Instruct model with --max-model-len=8192 and I got the scores of two passes to be both 1.5, just a little bit higher than 0.9 when I ran it the first time.

ytxmobile98 · 2024-12-11T06:59:32Z

Update 2024-12-11

@cyente @Hambaobao

I have done some further work in the past two days, testing the Qwen2.5-7B-Instruct model and the DeepSeek-Coder-6.7B-Instruct model, and found one key cause:

The benchmarking program relies on the search-replace blocks to copy code from the chat history and paste them in the *.py files. While the output of the Qwen2.5 model mostly follows the expected format, the DeepSeek model seems to output content as if it is solving a regular coding problem rather than a diff problem.

Example: `accumulate`

Qwen2.5-7B-Instruct output:
Qwen2.5-7B-Instruct .aider.chat.history.md
DeepSeek-Coder-6.7B output:
DeepSeek-Coder-6.7B-Instruct .aider.chat.history.md

ytxmobile98 changed the title ~~Unusual low results on DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code)~~ Unusual low results on Aider, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Dec 3, 2024

ytxmobile98 changed the title ~~Unusual low results on Aider, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code)~~ Unusual low results on Aider benchmark, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Dec 3, 2024

ytxmobile98 changed the title ~~Unusual low results on Aider benchmark, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code)~~ Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

ytxmobile98 commented Nov 29, 2024 •

edited

Loading

cyente commented Dec 2, 2024

ytxmobile98 commented Dec 2, 2024

Hambaobao commented Dec 2, 2024 •

edited

Loading

ytxmobile98 commented Dec 4, 2024

ytxmobile98 commented Dec 11, 2024 •

edited

Loading

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

Comments

ytxmobile98 commented Nov 29, 2024 • edited Loading

Update 2024-12-11

Original post on 2024-11-29

My results

Edit mode: diff

Edit mode: whole

The output

The model's config file

The bash scripts

run.sh

evaluate.sh

test.sh

cyente commented Dec 2, 2024

ytxmobile98 commented Dec 2, 2024

Hambaobao commented Dec 2, 2024 • edited Loading

ytxmobile98 commented Dec 4, 2024

ytxmobile98 commented Dec 11, 2024 • edited Loading

Update 2024-12-11

Example: accumulate

ytxmobile98 commented Nov 29, 2024 •

edited

Loading

Edit mode: `diff`

Edit mode: `whole`

`run.sh`

`evaluate.sh`

`test.sh`

Hambaobao commented Dec 2, 2024 •

edited

Loading

ytxmobile98 commented Dec 11, 2024 •

edited

Loading

Example: `accumulate`