-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added run_tokenizer
method in caikit_nlp/modules/text_generation/text_generation_local.py
#402
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
"""Tests for text-generation module | ||
""" | ||
|
||
# Standard | ||
import os | ||
import platform | ||
|
@@ -10,7 +11,7 @@ | |
import torch | ||
|
||
# First Party | ||
from caikit.interfaces.nlp.data_model import GeneratedTextResult | ||
from caikit.interfaces.nlp.data_model import GeneratedTextResult, TokenizationResults | ||
import caikit | ||
|
||
# Local | ||
|
@@ -211,7 +212,26 @@ def test_zero_epoch_case(disable_wip): | |
assert isinstance(model.model, HFAutoSeq2SeqLM) | ||
|
||
|
||
def test_run_tokenizer_not_implemented(): | ||
with pytest.raises(NotImplementedError): | ||
model = TextGeneration.bootstrap(SEQ2SEQ_LM_MODEL) | ||
model.run_tokenizer("This text doesn't matter") | ||
# ############################## Run Tokenizer ################################ | ||
|
||
|
||
def test_run_tokenizer_edge_cases(disable_wip, set_cpu_device): | ||
"""Test tokenizer on edge cases like empty strings and long input.""" | ||
model = TextGeneration.bootstrap(CAUSAL_LM_MODEL) | ||
|
||
# Edge case: Empty string | ||
empty_result = model.run_tokenizer("") | ||
assert isinstance(empty_result, TokenizationResults) | ||
assert empty_result.token_count == 0 | ||
|
||
# Normal case: short sentence | ||
short_text = "This is a test sentence." | ||
short_result = model.run_tokenizer(short_text) | ||
assert isinstance(short_result, TokenizationResults) | ||
assert short_result.token_count > 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering if we could check a particular number here to make sure an expected "token count" is compared [even for this dummy model], rather than just non-zero There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checking for an expected number instead of or in addition to a non-zero condition seems like a good idea to increase the robustness of the test. What are your thoughts on re-writing such a test like this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that works There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great -- changes to tests have been implemented accordingly and they seem to pass |
||
|
||
# Edge case: Long input | ||
long_text = "This is a test sentence. " * 1000 | ||
long_result = model.run_tokenizer(long_text) | ||
assert isinstance(long_result, TokenizationResults) | ||
assert long_result.token_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we may want to specifically include
return_attention_mask
instead of leaving to the model default such as in other modules https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_embedding/embedding.py#L1062 ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting question! HF in their documentation mention that If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
For consistency and clarity with the other modules, it could be prudent to include
return_attention_mask
explicitly. In such a case, the only potential concern would be whether settingreturn_attention_mask
explicitly could result in messing with the default behaviour of the tokeniser, but I think the likelihood of this is low.Thus, I can change the method to:
Let me know how you would like me to proceed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer the consistency with other modules but also don’t hold my opinion particularly strongly. The use of this local text generation module has been fairly limited, thus the lack of particular tokenization implementation previously. If the “default” would be preferred for your/your users’ usage, we can also just keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added
return_attention_mask=True
based on our conversation to ensure consistency with the other modules