Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do I have a lot of code> in generated Java code? What should I do to get rid of them? #142

Open
ytxmobile98 opened this issue Oct 25, 2024 · 3 comments

Comments

@ytxmobile98
Copy link

ytxmobile98 commented Oct 25, 2024

I was doing some code completion evaluation using codefuseEval, on the Qwen2.5-Coder base model. When I ran a Java evaluation, I saw a lot of <|fim_prefix|> and code> markups in the generated code. So I followed the issue #99 and added the special tokens to the tokenizer, as follows:

        tokenizer = AutoTokenizer.from_pretrained(
            path, trust_remote_code=True, use_fast=False, legacy=False)

        add_special_tokens = ["<|file_sep|>", "<film_pad|>",
                              "<|fim_prefix|>", "<|fim_suffix|>",
                              "<|fim_middle|>", "<|repo_name|>"]
        tokenizer.add_special_tokens({"additional_special_tokens": add_special_tokens},
                                     replace_additional_special_tokens=False)
        tokenizer.eos_token = "<|file_sep|>"
        tokenizer.eos_token_id = 151664

Then, when I ran the evaluation again after modifying the evaluation code, adding the lines above, the <|fim_prefix|> markups are gone, but the code> markups are still there.

What do I need to do in order to get rid of the code> markups?


code

@ytxmobile98
Copy link
Author

@ytxmobile98 ytxmobile98 changed the title Why do I have a lot of code> in generated Java code? Why do I have a lot of code> in generated Java code? What should I do to get rid of them? Oct 25, 2024
@cyente
Copy link
Collaborator

cyente commented Nov 1, 2024

it is weired, let me try the samples

@ytxmobile98
Copy link
Author

it is weired, let me try the samples

@cyente Did you see anything unusual as you tried out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants