-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detokenizer fixes #8039
Detokenizer fixes #8039
Conversation
Initial detokenizer state:
|
Real initial state:
|
Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files
Brute force encoding and decoding tests (number of errors,
|
Improvements:
|
I gave it a test run using phi3-mini-instruct
Now I ran the same through llama.cpp tokenization:
Update: I mimicked the python tokenizer by adding that into llama.cpp:
right before
Update2
That would result in identical tokenization:
|
Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors.
Using exit() is throwing random exceptions
UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece().
The models baichuan, falcon and mpt have tokenizations errors, so detokenization fails too.
|
Not all special tokens, see the attributes {
"id": 32007,
"content": "<|end|>",
"single_word": false,
"lstrip": false,
"rstrip": TRUE,
"normalized": false,
"special": true
}, You can see the Lines 5202 to 5217 in e112b61
If I tried you example and got another result! |
@jaime-m-p
I'll repeat the tests after fixing those issues and reverting my changes. Given your results that's promising. It's a little troublesome that such errors can very easily sneak into a model and it's very hard to notice them, even harder to fix them without blindly recreating the model from originals. |
Hi @jaime-m-p and @cmp-nct, really grateful you both are looking into this! I'm traveling without reliable access to a computer at the moment, but wanted to ask if these fixes now keep stability on retokenization with Phi-3 (i.e. the roundtrip of text -> tokens -> text -> tokens results in the same tokens). The constant whitespace insertion on each cycle was causing serious kv-cache reuse issues on our side and I'm really hopeful that this update resolves it! |
Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check.
Overall current tokenize and detokenize state. WPM models (bert-bge, jina-v2-en) are still broken. Probably due to the unicode NFD normalization. BPE models qwen2, olmo and mpt are probably faling due to the missing unicode NFC normalization. All BPE and SPM models seems to detokenize properly. Each cell show the number of tokenization and detokenization errros (up to 10). Empty cell means 0 errors.
|
AutoTokenizer is not completing this roundtrip either for some models. llama-bpe
' \x00z \x07z \x0ez \x15z \x1cz z !z "z $z %z &z (z )z *z +z ,z -' # input text
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz z!z "z $z %z &z (z )z *z +z,z -' # AutoTokenizer
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz z!z "z $z %z &z (z )z *z +z,z -' # Llama.cpp phi-3
' \x00z \x07z \x0ez \x15z \x1cz z !z "z $z %z &z (z )z *z +z ,z -' # input text
'<s> \x00z \x07z \x0ez \x15z \x1cz z !z "z $z %z &z (z )z *z +z ,z -' # AutoTokenizer
'<s> \x00z \x07z \x0ez \x15z \x1cz z !z "z $z %z &z (z )z *z +z ,z -' # Llama.cpp llama-bpe removes spaces before some punctuation characters. Re-tokenization is different. Probably a few models can achieve this, but Information can be lost in tokenization (normalization, lstrip, rstrip, etc). |
Hmm, great point. I think what I'm really hoping for is eventual stability on the second or third tokenize/detokenize cycles -- before your PR, Phi-3 had the problem of constantly changing the token_id at index 1 (due to growing spaces), which really caused issues. I think this set of changes is good enough to solve most of our problems :). |
No more improvements in detokenizing. All remaining detokenization problems are related to NFD/NFC normalization (MPT, OLMO, QWHEN2 and all WPM models). BAICHUAN has other kind of problems, maybe due to unexpected byte tokens in the vocab. Also tested the recently added GEMMA, VIKING and JAIS models. Each cell show the number of tokenization and detokenization errros (up to 10). Empty cell means 0 errors.
|
I will resume WPM and NFD/NFC normalizations later. I have been collecting more tokenization and vocab problems. NOTE: I know nothing about SWIFT, hope this 6d233bc is correct. |
style: spaces Update bruteforce test: add more models Update bruteforce test: header files location Better leading space removal 'viking' detokenizer clean spaces style : remove spaces Update bruteforce test Better leading space removal Symetric params for llama_tokenize() and llama_detokenize() Update brute force test: Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check. Bugfix: custom regexs splits undefined unicode codepoints style: remove trailing whitespace Do not remove space when decoding special tokens Fix detokenizer(): UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece(). tets: skip unicode surrogaes and undefined tests: gracefully exit threads Using exit() is throwing random exceptions tests: unexpected vocab type as test fail instead of error Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors. Add tokenizer flag: clean_up_tokenization_spaces Remove previous space Remove previous space Fix add_space_prefix, set false by default Update bruteforce random tests Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files Clean old known problematic codepoints minor: confusing hexadecimal codepoint Fix tokenizer tests Using llama_tokenize() in tests Using llama_tokenize() in tests Add llama_detokenize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Oh, sorry. I forgot to change commit title adding the module. |
style: spaces Update bruteforce test: add more models Update bruteforce test: header files location Better leading space removal 'viking' detokenizer clean spaces style : remove spaces Update bruteforce test Better leading space removal Symetric params for llama_tokenize() and llama_detokenize() Update brute force test: Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check. Bugfix: custom regexs splits undefined unicode codepoints style: remove trailing whitespace Do not remove space when decoding special tokens Fix detokenizer(): UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece(). tets: skip unicode surrogaes and undefined tests: gracefully exit threads Using exit() is throwing random exceptions tests: unexpected vocab type as test fail instead of error Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors. Add tokenizer flag: clean_up_tokenization_spaces Remove previous space Remove previous space Fix add_space_prefix, set false by default Update bruteforce random tests Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files Clean old known problematic codepoints minor: confusing hexadecimal codepoint Fix tokenizer tests Using llama_tokenize() in tests Using llama_tokenize() in tests Add llama_detokenize()
* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces
* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces
* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces
* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces
This PR tries to solve most common problems with detokenization:
Related issues: #8023, #7938.