-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix embedding use of tokenizer to avoid already borrowed #369
Conversation
With high concurrency the fast tokenizer (Rust) hits "Already borrowed" exceptions with the truncation and padding (and max length) params change. Do not change those. Detect where to truncate (or return message) using the default settings and some code. Truncate texts and re-tokenize when needed. In addition, the error message which used to report number of tokens over the limit (but did not say which sentence) now returns the index(es) of the sentence(s) that was/were found to be too long. Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>
f07a634
to
1c09a02
Compare
Tokenizer is not completely thread-safe. This change copies the tokenizer for each thread (per model) to avoid future problems related to this. Eats up some memory, but small compared to models. Note: For embeddings server, we've been recommending 5 threads because more is slower anyway. Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>
return self.tokenizer( | ||
|
||
# Keep copies of tokenizer per thread (in each wrapped model instance) | ||
thread_id = threading.get_ident() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other primitive to take a look at here is threading.Local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using threading.Local instead of a map, but realized that each thread would need a map to get the correct tokenizer per model. So I preferred map keyed by thread in the model instance vs map keyed by model name (id?) in thread local. So I gave up experimenting with threading.Local, but was hoping it had some advantage here that I did not get to. Is a map in thread local better somehow for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
With high concurrency the fast tokenizer (Rust) hits "Already borrowed" exceptions with the truncation and padding (and max length) params change. Do not change those. Detect where to truncate (or return message) using the default settings and some code. Truncate texts and re-tokenize when needed.
In addition, the error message which used to report number of tokens over the limit (but did not say which sentence) now returns the index(es) of the sentence(s) that was/were found to be too long.