You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.
How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:
vocab=tokenmonster.load("englishcode-32000-consistent-v1")
vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)
vocab.resize(32000, reset_token_ids=False)
# Tokenize some texttext= [
"<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
"<s>Some text to turn into token IDs. <|im_end|>",
"<s>Some text to turn into token IDs....<|im_end|>",
]
The text was updated successfully, but these errors were encountered:
It's unclear what you're trying to do, what you expect to happen, and what is happening. Please provide the results of what you get, and a description of what you expected to get.
I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.
How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:
The text was updated successfully, but these errors were encountered: