Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates #192

Open
ytxmobile98 opened this issue Nov 29, 2024 · 5 comments

Comments

@ytxmobile98
Copy link

ytxmobile98 commented Nov 29, 2024

Update 2024-12-11

I have found that when running the Aider benchmark with the DeepSeek-Coder-6.7B-Instruct model, most of the results generated by the model did not include the SEARCH/REPLACE blocks which is used by the benchmarking program to save the code into Python source files and run unit tests. See this comment.


Original post on 2024-11-29

I got some extraordinary low results on running Aider benchmark, with the DeepSeek-6.7B-Instruct model. When I inspected the output files, what most astonished me was that most of the output files do not contain valid solution code, but instead the original signature along with the pass statement. What steps did I miss to run the evaluations and get the desired results? Thanks.

My results

Edit mode: diff

{
  "pass_rate_1": 0.9,
  "pass_rate_2": 0.9,
  "percent_cases_well_performed": 100
}

Edit mode: whole

{
  "pass_rate_1": 1.5,
  "pass_rate_2": 1.5,
  "percent_cases_well_performed": 100
}

The output

When I inspected the outputs, I noticed that the majority of the code files were not edited to contain the correct solution, but still left with the signature + a simple pass statement. For example the isogram test case has the following output isogram.py:

def is_isogram(string):
    pass

The model's config file

Meanwhile here is the model's config.json file:

{
  "_name_or_path": "/3fs-jd/prod/deepseek/shared/zhuqihao/public_model/deepseek-coder-7b-instruct2",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 32013,
  "eos_token_id": 32014,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "vocab_size": 32256
}

The bash scripts

run.sh

# The model name matches a model directory on my test machine
# MODEL_NAME="Qwen2.5-Coder-7B-Instruct"
export MODEL_NAME="deepseek-coder-6___7b-instruct"
# export MODEL_NAME="DeepSeek-Coder-V2-Lite-Instruct"

# edit format (`whole` / `diff`)
# export EDIT_FORMAT=whole
export EDIT_FORMAT=diff

export CUDA_VISIBLE_DEVICES="2,3"
TP=2

EVAL_SCRIPT="./evaluate.sh"
MODEL_DIR="/data/models/${MODEL_NAME}/"
OUTPUT_DIR="./results/${MODEL_NAME}/${EDIT_FORMAT}"
bash "${EVAL_SCRIPT}" "${MODEL_DIR}" "${OUTPUT_DIR}" "${TP}"

evaluate.sh

MODEL_DIR=${1}
OUTPUT_DIR=${2}
TP=${3}
MODEL_DIR=${MODEL_DIR:-"./pretrained_models/"}
OUTPUT_DIR=${OUTPUT_DIR:-"./results/"}
mkdir -p ${OUTPUT_DIR}
TP=${TP:-2}
echo $TP

ROOT_DIR="."

bash test.sh "${MODEL_DIR}" ${TP} "${OUTPUT_DIR}/aider"

test.sh

export PATH=./aider/bin:$PATH

export HF_ENDPOINT=http://hf-mirror.com
export HF_HOME=""
export HF_DATASETS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1

export OPENAI_API_BASE=http://0.0.0.0:8000/v1
export OPENAI_API_KEY=token-abc123

export MODEL=$1
export TP=$2
export OUTPUT_DIR=$3

export SERVED_MODEL_NAME=$(basename ${MODEL})
export API_MODEL_NAME=openai/${SERVED_MODEL_NAME}

# Edit format is `whole` or `diff`
# normally it should be passed from `run.sh`
if [ -z "$EDIT_FORMAT" ]; then
    EDIT_FORMAT=diff
fi

mkdir -p ${OUTPUT_DIR}

echo "Starting serving ${MODEL} as ${SERVED_MODEL_NAME}..."
vllm serve ${MODEL} \
    --served-model-name ${SERVED_MODEL_NAME} \
    --tensor-parallel-size ${TP} \
    --trust-remote-code \
    --max-model-len 4096 \
    --dtype auto \
    --api-key token-abc123 \
    > ${OUTPUT_DIR}/vllm-server.txt 2>&1 &

sleep 5
jobs -l > ${OUTPUT_DIR}/jobs.txt
PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
echo "PID: $PID"

echo "Waiting for the model to be served..."
while true; do
    if grep -q 'Uvicorn running on' "${OUTPUT_DIR}/vllm-server.txt"; then
        echo "Model is being served..."
        break
    else
        echo "Waiting for model to start..."
        sleep 1
    fi
done

echo "Benchmarking ${SERVED_MODEL_NAME}..."

python benchmark/benchmark.py ${SERVED_MODEL_NAME} \
    --new \
    --model ${API_MODEL_NAME} \
    --edit-format ${EDIT_FORMAT} \
    --threads 1 \
    > ${OUTPUT_DIR}/log.txt

# extract the required lines from log.txt and use awk to extract the corresponding values
pass_rate_1=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_1' | awk '{print $2}')
pass_rate_2=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_2' | awk '{print $2}')
percent_cases_well_formed=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'percent_cases_well_formed' | awk '{print $2}')

# create JSON-formatted content
json_content=$(cat <<EOF
{
  "pass_rate_1": $pass_rate_1,
  "pass_rate_2": $pass_rate_2,
  "percent_cases_well_formed": $percent_cases_well_formed
}
EOF
)

# write the JSON content to the results.json file
echo "$json_content" > ${OUTPUT_DIR}/results.json

PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
kill ${PID}
@cyente
Copy link
Collaborator

cyente commented Dec 2, 2024

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

@ytxmobile98
Copy link
Author

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

@cyente What "ds-coder" are you referring to? Thanks.

@Hambaobao
Copy link
Contributor

Hambaobao commented Dec 2, 2024

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

@ytxmobile98 ytxmobile98 changed the title Unusual low results on DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Unusual low results on Aider, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Dec 3, 2024
@ytxmobile98 ytxmobile98 changed the title Unusual low results on Aider, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Unusual low results on Aider benchmark, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Dec 3, 2024
@ytxmobile98
Copy link
Author

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

Looks like --max-model-len does not help so much. I tried to run Aider diff mode with DeepSeek-6.7B-Instruct model with --max-model-len=8192 and I got the scores of two passes to be both 1.5, just a little bit higher than 0.9 when I ran it the first time.

@ytxmobile98
Copy link
Author

ytxmobile98 commented Dec 11, 2024

Update 2024-12-11

@cyente @Hambaobao

I have done some further work in the past two days, testing the Qwen2.5-7B-Instruct model and the DeepSeek-Coder-6.7B-Instruct model, and found one key cause:

The benchmarking program relies on the search-replace blocks to copy code from the chat history and paste them in the *.py files. While the output of the Qwen2.5 model mostly follows the expected format, the DeepSeek model seems to output content as if it is solving a regular coding problem rather than a diff problem.

Example: accumulate

@ytxmobile98 ytxmobile98 changed the title Unusual low results on Aider benchmark, DeepSeek-6.7B-Instruct model (most of the outputs have no valid solution code) Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants