Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added run_tokenizer method in caikit_nlp/modules/text_generation/text_generation_local.py #402

Merged

Conversation

m-misiura
Copy link
Contributor

There is a method with text_generation_local.py which does not appear to be implemented:

raise NotImplementedError("Tokenization not implemented for local")

I have written a method, which takes a string as an input and returns the token count (based on the model tokenizer). The output should follow the data_model: TokenizationResults

This method was tested with tox; the tests include checking outputs in the case of:

  1. an empty string
  2. a short sentence
  3. a relatively long input

The implemented tests seem to pass

Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @m-misiura! The DCO check is currently failing - could you resolve that with the steps here?

- 🎨 tox formatting
- 🚧 added a test to assert the length of the run_tokenizer output
- 🚧 made a more comprehensive test for the run tokenizer method
Signed-off-by: m-misiura <mmisiura@redhat.com>
@m-misiura m-misiura force-pushed the run_tokenier_in_text_generation_local branch from cd36806 to 261e1a3 Compare December 4, 2024 09:33
@m-misiura
Copy link
Contributor Author

Many thanks for highlighting the lack of the sign-off @evaline-ju; I've rebased and added the sign-off accordingly

Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions

@@ -590,7 +591,11 @@ def run_tokenizer(
TokenizationResults
The token count
"""
raise NotImplementedError("Tokenization not implemented for local")
error.type_check("<NLP48137045E>", str, text=text)
tokenized_output = self.model.tokenizer(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we may want to specifically include return_attention_mask instead of leaving to the model default such as in other modules https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_embedding/embedding.py#L1062 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting question! HF in their documentation mention that If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

For consistency and clarity with the other modules, it could be prudent to include return_attention_mask explicitly. In such a case, the only potential concern would be whether setting return_attention_mask explicitly could result in messing with the default behaviour of the tokeniser, but I think the likelihood of this is low.

Thus, I can change the method to:

def run_tokenizer(
        self,
        text: str,
    ) -> TokenizationResults:
        """Run tokenization task against the model

        Args:
           text: str
                Text to tokenize
        Returns:
            TokenizationResults
                The token count
        """
        error.type_check("<NLP48137045E>", str, text=text)
        tokenized_output = self.model.tokenizer(text, return_attention_mask=True)
        return TokenizationResults(
            token_count=len(tokenized_output["input_ids"]),
        )

Let me know how you would like me to proceed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer the consistency with other modules but also don’t hold my opinion particularly strongly. The use of this local text generation module has been fairly limited, thus the lack of particular tokenization implementation previously. If the “default” would be preferred for your/your users’ usage, we can also just keep it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added return_attention_mask=True based on our conversation to ensure consistency with the other modules

short_text = "This is a test sentence."
short_result = model.run_tokenizer(short_text)
assert isinstance(short_result, TokenizationResults)
assert short_result.token_count > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could check a particular number here to make sure an expected "token count" is compared [even for this dummy model], rather than just non-zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for an expected number instead of or in addition to a non-zero condition seems like a good idea to increase the robustness of the test.

What are your thoughts on re-writing such a test like this:

assert short_result.token_count == len(model.model.tokenizer.encode(short_text))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great -- changes to tests have been implemented accordingly and they seem to pass

…d number instead of checking if length is non-zero; added `return_attention_mask=True` in the `run_tokenizer` method

Signed-off-by: m-misiura <mmisiura@redhat.com>
Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the updates!

@evaline-ju evaline-ju merged commit cd44077 into caikit:main Dec 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants