Special tokens not showing up correctly when tokenized. #29

amazingvince · 2023-11-05T00:29:07Z

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)


vocab.resize(32000, reset_token_ids=False)


# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]

alasdairforsythe · 2023-11-15T02:59:09Z

It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens not showing up correctly when tokenized. #29

Special tokens not showing up correctly when tokenized. #29

amazingvince commented Nov 5, 2023

alasdairforsythe commented Nov 15, 2023

Special tokens not showing up correctly when tokenized. #29

Special tokens not showing up correctly when tokenized. #29

Comments

amazingvince commented Nov 5, 2023

alasdairforsythe commented Nov 15, 2023