Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added run_tokenizer method in caikit_nlp/modules/text_generation/text_generation_local.py #402

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@

TRAINING_LOSS_LOG_FILENAME = "training_logs.jsonl"


# pylint: disable=too-many-lines,too-many-instance-attributes
@module(
id="f9181353-4ccf-4572-bd1e-f12bcda26792",
Expand Down Expand Up @@ -590,7 +591,11 @@ def run_tokenizer(
TokenizationResults
The token count
"""
raise NotImplementedError("Tokenization not implemented for local")
error.type_check("<NLP48137045E>", str, text=text)
tokenized_output = self.model.tokenizer(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we may want to specifically include return_attention_mask instead of leaving to the model default such as in other modules https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_embedding/embedding.py#L1062 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting question! HF in their documentation mention that If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

For consistency and clarity with the other modules, it could be prudent to include return_attention_mask explicitly. In such a case, the only potential concern would be whether setting return_attention_mask explicitly could result in messing with the default behaviour of the tokeniser, but I think the likelihood of this is low.

Thus, I can change the method to:

def run_tokenizer(
        self,
        text: str,
    ) -> TokenizationResults:
        """Run tokenization task against the model

        Args:
           text: str
                Text to tokenize
        Returns:
            TokenizationResults
                The token count
        """
        error.type_check("<NLP48137045E>", str, text=text)
        tokenized_output = self.model.tokenizer(text, return_attention_mask=True)
        return TokenizationResults(
            token_count=len(tokenized_output["input_ids"]),
        )

Let me know how you would like me to proceed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer the consistency with other modules but also don’t hold my opinion particularly strongly. The use of this local text generation module has been fairly limited, thus the lack of particular tokenization implementation previously. If the “default” would be preferred for your/your users’ usage, we can also just keep it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added return_attention_mask=True based on our conversation to ensure consistency with the other modules

return TokenizationResults(
token_count=len(tokenized_output["input_ids"]),
)

################################## Private Functions ######################################

Expand Down
30 changes: 25 additions & 5 deletions tests/modules/text_generation/test_text_generation_local.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Tests for text-generation module
"""

# Standard
import os
import platform
Expand All @@ -10,7 +11,7 @@
import torch

# First Party
from caikit.interfaces.nlp.data_model import GeneratedTextResult
from caikit.interfaces.nlp.data_model import GeneratedTextResult, TokenizationResults
import caikit

# Local
Expand Down Expand Up @@ -211,7 +212,26 @@ def test_zero_epoch_case(disable_wip):
assert isinstance(model.model, HFAutoSeq2SeqLM)


def test_run_tokenizer_not_implemented():
with pytest.raises(NotImplementedError):
model = TextGeneration.bootstrap(SEQ2SEQ_LM_MODEL)
model.run_tokenizer("This text doesn't matter")
# ############################## Run Tokenizer ################################


def test_run_tokenizer_edge_cases(disable_wip, set_cpu_device):
"""Test tokenizer on edge cases like empty strings and long input."""
model = TextGeneration.bootstrap(CAUSAL_LM_MODEL)

# Edge case: Empty string
empty_result = model.run_tokenizer("")
assert isinstance(empty_result, TokenizationResults)
assert empty_result.token_count == 0

# Normal case: short sentence
short_text = "This is a test sentence."
short_result = model.run_tokenizer(short_text)
assert isinstance(short_result, TokenizationResults)
assert short_result.token_count > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could check a particular number here to make sure an expected "token count" is compared [even for this dummy model], rather than just non-zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for an expected number instead of or in addition to a non-zero condition seems like a good idea to increase the robustness of the test.

What are your thoughts on re-writing such a test like this:

assert short_result.token_count == len(model.model.tokenizer.encode(short_text))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great -- changes to tests have been implemented accordingly and they seem to pass


# Edge case: Long input
long_text = "This is a test sentence. " * 1000
long_result = model.run_tokenizer(long_text)
assert isinstance(long_result, TokenizationResults)
assert long_result.token_count > 0
Loading