Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix embedding use of tokenizer to avoid already borrowed #369

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

markstur
Copy link
Contributor

With high concurrency the fast tokenizer (Rust) hits "Already borrowed" exceptions with the truncation and padding (and max length) params change. Do not change those. Detect where to truncate (or return message) using the default settings and some code. Truncate texts and re-tokenize when needed.

In addition, the error message which used to report number of tokens over the limit (but did not say which sentence) now returns the index(es) of the sentence(s) that was/were found to be too long.

With high concurrency the fast tokenizer (Rust) hits "Already borrowed"
exceptions with the truncation and padding (and max length) params change.
Do not change those. Detect where to truncate (or return message) using
the default settings and some code. Truncate texts and re-tokenize when
needed.

In addition, the error message which used to report number of tokens
over the limit (but did not say which sentence) now returns the
index(es) of the sentence(s) that was/were found to be too long.

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>
@markstur markstur force-pushed the tokenizer_already_borrowed branch from f07a634 to 1c09a02 Compare July 22, 2024 05:28
Tokenizer is not completely thread-safe. This change
copies the tokenizer for each thread (per model) to
avoid future problems related to this.

Eats up some memory, but small compared to models.

Note: For embeddings server, we've been recommending 5
threads because more is slower anyway.

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>
return self.tokenizer(

# Keep copies of tokenizer per thread (in each wrapped model instance)
thread_id = threading.get_ident()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other primitive to take a look at here is threading.Local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using threading.Local instead of a map, but realized that each thread would need a map to get the correct tokenizer per model. So I preferred map keyed by thread in the model instance vs map keyed by model name (id?) in thread local. So I gave up experimenting with threading.Local, but was hoping it had some advantage here that I did not get to. Is a map in thread local better somehow for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense!

Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@evaline-ju evaline-ju merged commit 44d61a5 into caikit:main Jul 24, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants