-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for Bert #137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, @jkrukowski!
Of course, let's add @ashvardanian as a co-author. Can you submit (or resubmit) your commit appending the following line to the commit description?
Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
(I believe it should work)
Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
597756c
to
65054fe
Compare
@pcuenca done, I think it worked, thanks! |
I'm glad it's coming. I have been working on the same thing and was about to submit a PR ;) |
preTokenizer1.preTokenize(text: " Hey, friend , 0 99 what's up? "), | ||
["Hey", ",", "friend", ",", "0", "99", "what", "\'", "s", "up", "?"] | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original Rust implementation also had this test case scenario:
XCTAssertEqual(
preTokenizer.preTokenize(text: "野口里佳 Noguchi Rika"),
["野", "口", "里", "佳", "Noguchi", "Rika"]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that as well, but figured that the architecture here is a bit different and it should be handled by BertNormalizer
, is this assumption correct @pcuenca?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is great as it is, we can always iterate if needed.
Merging now, thanks again @jkrukowski, @ashvardanian and @piotrkowalczuk, very much appreciated! 🙌 |
Glad it helped! Hopefully the other patches can also make it to the upstream if someone has the ambition to polish them 🤗 PS: Thanks for mentioning as a co-author! |
In this PR I took some changes from #89 (which seems to be quite old and I'm not sure if the author is willing to continue). I've tried to achieve the same with more focused set of changes and I've added some tests.
Side note: if this is good to merge maybe it's ok to tag @ashvardanian as a co-author?