-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds tokenization task for embeddings #365
Conversation
TokenizationResults | ||
The token count | ||
""" | ||
result = self.model.tokenizer.encode_plus(text, return_offsets_mapping=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode_plus
is deprecated: https://huggingface.co/docs/transformers/v4.41.3/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode_plus
Can we use __call__
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes themselves mainly LGTM, but let's add a small unit test to make sure we have expected results/prevent possible future breakages?
5fd8f44
to
0822f3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge caused an indexing miss. See inline
] | ||
tokens = [Token(start=i[0], end=i[1], text=text[i[0] : i[1]]) for i in mapping] | ||
|
||
return TokenizationResults(token_count=len(result.input_ids), results=tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs index zero because we wrapped the text in a list for _get_tokenized().
return TokenizationResults(token_count=len(result.input_ids), results=tokens) | |
return TokenizationResults(token_count=len(result.input_ids[0), results=tokens) |
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Signed-off-by: Flávia Béo <119421251+flaviabeo@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Dismissing on maintainer availability, the requested change appears addressed.
This implements the
TokenizeTask
for theEmbeddings
module. Works for all the models of type SentenceTransformer.cc: @mynhardtburger @markstur