-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added run_tokenizer
method in caikit_nlp/modules/text_generation/text_generation_local.py
#402
Added run_tokenizer
method in caikit_nlp/modules/text_generation/text_generation_local.py
#402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, @m-misiura! The DCO check is currently failing - could you resolve that with the steps here?
- 🎨 tox formatting - 🚧 added a test to assert the length of the run_tokenizer output - 🚧 made a more comprehensive test for the run tokenizer method Signed-off-by: m-misiura <mmisiura@redhat.com>
cd36806
to
261e1a3
Compare
Many thanks for highlighting the lack of the sign-off @evaline-ju; I've rebased and added the sign-off accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions
@@ -590,7 +591,11 @@ def run_tokenizer( | |||
TokenizationResults | |||
The token count | |||
""" | |||
raise NotImplementedError("Tokenization not implemented for local") | |||
error.type_check("<NLP48137045E>", str, text=text) | |||
tokenized_output = self.model.tokenizer(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we may want to specifically include return_attention_mask
instead of leaving to the model default such as in other modules https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_embedding/embedding.py#L1062 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting question! HF in their documentation mention that If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
For consistency and clarity with the other modules, it could be prudent to include return_attention_mask
explicitly. In such a case, the only potential concern would be whether setting return_attention_mask
explicitly could result in messing with the default behaviour of the tokeniser, but I think the likelihood of this is low.
Thus, I can change the method to:
def run_tokenizer(
self,
text: str,
) -> TokenizationResults:
"""Run tokenization task against the model
Args:
text: str
Text to tokenize
Returns:
TokenizationResults
The token count
"""
error.type_check("<NLP48137045E>", str, text=text)
tokenized_output = self.model.tokenizer(text, return_attention_mask=True)
return TokenizationResults(
token_count=len(tokenized_output["input_ids"]),
)
Let me know how you would like me to proceed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer the consistency with other modules but also don’t hold my opinion particularly strongly. The use of this local text generation module has been fairly limited, thus the lack of particular tokenization implementation previously. If the “default” would be preferred for your/your users’ usage, we can also just keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added return_attention_mask=True
based on our conversation to ensure consistency with the other modules
short_text = "This is a test sentence." | ||
short_result = model.run_tokenizer(short_text) | ||
assert isinstance(short_result, TokenizationResults) | ||
assert short_result.token_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could check a particular number here to make sure an expected "token count" is compared [even for this dummy model], rather than just non-zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for an expected number instead of or in addition to a non-zero condition seems like a good idea to increase the robustness of the test.
What are your thoughts on re-writing such a test like this:
assert short_result.token_count == len(model.model.tokenizer.encode(short_text))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great -- changes to tests have been implemented accordingly and they seem to pass
…d number instead of checking if length is non-zero; added `return_attention_mask=True` in the `run_tokenizer` method Signed-off-by: m-misiura <mmisiura@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the updates!
There is a method with
text_generation_local.py
which does not appear to be implemented:caikit-nlp/caikit_nlp/modules/text_generation/text_generation_local.py
Line 593 in 56b7e18
I have written a method, which takes a string as an input and returns the token count (based on the model tokenizer). The output should follow the data_model:
TokenizationResults
This method was tested with
tox
; the tests include checking outputs in the case of:The implemented tests seem to pass