Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Bert #137

Merged
merged 1 commit into from
Oct 30, 2024
Merged

Conversation

jkrukowski
Copy link
Contributor

In this PR I took some changes from #89 (which seems to be quite old and I'm not sure if the author is willing to continue). I've tried to achieve the same with more focused set of changes and I've added some tests.

Side note: if this is good to merge maybe it's ok to tag @ashvardanian as a co-author?

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @jkrukowski!

Of course, let's add @ashvardanian as a co-author. Can you submit (or resubmit) your commit appending the following line to the commit description?

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>

(I believe it should work)

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
@jkrukowski
Copy link
Contributor Author

@pcuenca done, I think it worked, thanks!

@piotrkowalczuk
Copy link

I'm glad it's coming. I have been working on the same thing and was about to submit a PR ;)

preTokenizer1.preTokenize(text: " Hey, friend , 0 99 what's up? "),
["Hey", ",", "friend", ",", "0", "99", "what", "\'", "s", "up", "?"]
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original Rust implementation also had this test case scenario:

        XCTAssertEqual(
            preTokenizer.preTokenize(text: "野口里佳 Noguchi Rika"),
            ["野", "口", "里", "佳", "Noguchi", "Rika"]
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that as well, but figured that the architecture here is a bit different and it should be handled by BertNormalizer, is this assumption correct @pcuenca?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is great as it is, we can always iterate if needed.

@pcuenca
Copy link
Member

pcuenca commented Oct 30, 2024

Merging now, thanks again @jkrukowski, @ashvardanian and @piotrkowalczuk, very much appreciated! 🙌

@pcuenca pcuenca merged commit 2c68d53 into huggingface:main Oct 30, 2024
1 check passed
@jkrukowski jkrukowski deleted the bert-pre-tokenizer branch October 30, 2024 15:58
@ashvardanian
Copy link
Contributor

ashvardanian commented Nov 1, 2024

Glad it helped! Hopefully the other patches can also make it to the upstream if someone has the ambition to polish them 🤗

PS: Thanks for mentioning as a co-author!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants