Added `run_tokenizer` method in `caikit_nlp/modules/text_generation/text_generation_local.py` #402

m-misiura · 2024-12-03T13:52:44Z

There is a method with text_generation_local.py which does not appear to be implemented:

caikit-nlp/caikit_nlp/modules/text_generation/text_generation_local.py

Line 593 in 56b7e18

raise NotImplementedError("Tokenization not implemented for local")

I have written a method, which takes a string as an input and returns the token count (based on the model tokenizer). The output should follow the data_model: TokenizationResults

This method was tested with tox; the tests include checking outputs in the case of:

an empty string
a short sentence
a relatively long input

The implemented tests seem to pass

evaline-ju

Thanks for the contribution, @m-misiura! The DCO check is currently failing - could you resolve that with the steps here?

- 🎨 tox formatting - 🚧 added a test to assert the length of the run_tokenizer output - 🚧 made a more comprehensive test for the run tokenizer method Signed-off-by: m-misiura <mmisiura@redhat.com>

m-misiura · 2024-12-04T09:36:09Z

Many thanks for highlighting the lack of the sign-off @evaline-ju; I've rebased and added the sign-off accordingly

evaline-ju

A couple questions

evaline-ju · 2024-12-04T16:51:16Z

caikit_nlp/modules/text_generation/text_generation_local.py

@@ -590,7 +591,11 @@ def run_tokenizer(
            TokenizationResults
                The token count
        """
-        raise NotImplementedError("Tokenization not implemented for local")
+        error.type_check("<NLP48137045E>", str, text=text)
+        tokenized_output = self.model.tokenizer(text)


Wondering if we may want to specifically include return_attention_mask instead of leaving to the model default such as in other modules https://github.com/caikit/caikit-nlp/blob/main/caikit_nlp/modules/text_embedding/embedding.py#L1062 ?

Very interesting question! HF in their documentation mention that If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

For consistency and clarity with the other modules, it could be prudent to include return_attention_mask explicitly. In such a case, the only potential concern would be whether setting return_attention_mask explicitly could result in messing with the default behaviour of the tokeniser, but I think the likelihood of this is low.

Thus, I can change the method to:

def run_tokenizer( self, text: str, ) -> TokenizationResults: """Run tokenization task against the model Args: text: str Text to tokenize Returns: TokenizationResults The token count """ error.type_check("<NLP48137045E>", str, text=text) tokenized_output = self.model.tokenizer(text, return_attention_mask=True) return TokenizationResults( token_count=len(tokenized_output["input_ids"]), )

Let me know how you would like me to proceed

I slightly prefer the consistency with other modules but also don’t hold my opinion particularly strongly. The use of this local text generation module has been fairly limited, thus the lack of particular tokenization implementation previously. If the “default” would be preferred for your/your users’ usage, we can also just keep it as is.

I've added return_attention_mask=True based on our conversation to ensure consistency with the other modules

evaline-ju · 2024-12-04T16:53:02Z

tests/modules/text_generation/test_text_generation_local.py

+    short_text = "This is a test sentence."
+    short_result = model.run_tokenizer(short_text)
+    assert isinstance(short_result, TokenizationResults)
+    assert short_result.token_count > 0


Wondering if we could check a particular number here to make sure an expected "token count" is compared [even for this dummy model], rather than just non-zero

Checking for an expected number instead of or in addition to a non-zero condition seems like a good idea to increase the robustness of the test.

What are your thoughts on re-writing such a test like this:

assert short_result.token_count == len(model.model.tokenizer.encode(short_text))

I think that works

Great -- changes to tests have been implemented accordingly and they seem to pass

…d number instead of checking if length is non-zero; added `return_attention_mask=True` in the `run_tokenizer` method Signed-off-by: m-misiura <mmisiura@redhat.com>

evaline-ju

LGTM - thanks for the updates!

m-misiura requested review from alex-jw-brooks, gkumbhat, evaline-ju, gabe-l-hart, tharapalanivel and Ssukriti as code owners December 3, 2024 13:52

evaline-ju reviewed Dec 3, 2024

View reviewed changes

✨ added run_tokenizer in text_generation_local.py

261e1a3

- 🎨 tox formatting - 🚧 added a test to assert the length of the run_tokenizer output - 🚧 made a more comprehensive test for the run tokenizer method Signed-off-by: m-misiura <mmisiura@redhat.com>

m-misiura force-pushed the run_tokenier_in_text_generation_local branch from cd36806 to 261e1a3 Compare December 4, 2024 09:33

evaline-ju reviewed Dec 4, 2024

View reviewed changes

✅ based on the PR comments, changed test case to check for an expecte…

f629248

…d number instead of checking if length is non-zero; added `return_attention_mask=True` in the `run_tokenizer` method Signed-off-by: m-misiura <mmisiura@redhat.com>

evaline-ju approved these changes Dec 4, 2024

View reviewed changes

evaline-ju merged commit cd44077 into caikit:main Dec 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `run_tokenizer` method in `caikit_nlp/modules/text_generation/text_generation_local.py` #402

Added `run_tokenizer` method in `caikit_nlp/modules/text_generation/text_generation_local.py` #402

m-misiura commented Dec 3, 2024

evaline-ju left a comment

m-misiura commented Dec 4, 2024

evaline-ju left a comment

evaline-ju Dec 4, 2024

m-misiura Dec 4, 2024

evaline-ju Dec 4, 2024

m-misiura Dec 4, 2024

evaline-ju Dec 4, 2024

m-misiura Dec 4, 2024

evaline-ju Dec 4, 2024

m-misiura Dec 4, 2024

evaline-ju left a comment

Added run_tokenizer method in caikit_nlp/modules/text_generation/text_generation_local.py #402

Added run_tokenizer method in caikit_nlp/modules/text_generation/text_generation_local.py #402

Conversation

m-misiura commented Dec 3, 2024

evaline-ju left a comment

Choose a reason for hiding this comment

m-misiura commented Dec 4, 2024

evaline-ju left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evaline-ju left a comment

Choose a reason for hiding this comment

Added `run_tokenizer` method in `caikit_nlp/modules/text_generation/text_generation_local.py` #402

Added `run_tokenizer` method in `caikit_nlp/modules/text_generation/text_generation_local.py` #402