Added support for Bert #137

jkrukowski · 2024-10-29T19:15:14Z

In this PR I took some changes from #89 (which seems to be quite old and I'm not sure if the author is willing to continue). I've tried to achieve the same with more focused set of changes and I've added some tests.

Side note: if this is good to merge maybe it's ok to tag @ashvardanian as a co-author?

pcuenca

Looks good to me, @jkrukowski!

Of course, let's add @ashvardanian as a co-author. Can you submit (or resubmit) your commit appending the following line to the commit description?

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>

(I believe it should work)

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>

jkrukowski · 2024-10-29T20:27:50Z

@pcuenca done, I think it worked, thanks!

piotrkowalczuk · 2024-10-29T20:42:22Z

I'm glad it's coming. I have been working on the same thing and was about to submit a PR ;)

piotrkowalczuk · 2024-10-29T20:44:36Z

Tests/PreTokenizerTests/PreTokenizerTests.swift

+            preTokenizer1.preTokenize(text: "   Hey,    friend ,  0 99  what's up?  "),
+            ["Hey", ",", "friend", ",", "0", "99", "what", "\'", "s", "up", "?"]
+        )
+    }


Original Rust implementation also had this test case scenario:

XCTAssertEqual( preTokenizer.preTokenize(text: "野口里佳 Noguchi Rika"), ["野", "口", "里", "佳", "Noguchi", "Rika"] )

I noticed that as well, but figured that the architecture here is a bit different and it should be handled by BertNormalizer, is this assumption correct @pcuenca?

Yes, I think this is great as it is, we can always iterate if needed.

pcuenca · 2024-10-30T15:53:02Z

Merging now, thanks again @jkrukowski, @ashvardanian and @piotrkowalczuk, very much appreciated! 🙌

ashvardanian · 2024-11-01T09:05:01Z

Glad it helped! Hopefully the other patches can also make it to the upstream if someone has the ambition to polish them 🤗

PS: Thanks for mentioning as a co-author!

pcuenca reviewed Oct 29, 2024

View reviewed changes

Added support for Bert models

65054fe

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>

jkrukowski force-pushed the bert-pre-tokenizer branch from 597756c to 65054fe Compare October 29, 2024 20:27

piotrkowalczuk reviewed Oct 29, 2024

View reviewed changes

pcuenca merged commit 2c68d53 into huggingface:main Oct 30, 2024
1 check passed

jkrukowski deleted the bert-pre-tokenizer branch October 30, 2024 15:58

jkrukowski mentioned this pull request Oct 30, 2024

Tokenizers: additional Normalizers, PreTokenizers, PostProcessors #4

Open

pcuenca mentioned this pull request Dec 12, 2024

Supporting more BERT-like models #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Bert #137

Added support for Bert #137

jkrukowski commented Oct 29, 2024

pcuenca left a comment

jkrukowski commented Oct 29, 2024

piotrkowalczuk commented Oct 29, 2024

piotrkowalczuk Oct 29, 2024

jkrukowski Oct 29, 2024

pcuenca Oct 30, 2024

pcuenca commented Oct 30, 2024

ashvardanian commented Nov 1, 2024 •

edited

Loading

Added support for Bert #137

Added support for Bert #137

Conversation

jkrukowski commented Oct 29, 2024

pcuenca left a comment

Choose a reason for hiding this comment

jkrukowski commented Oct 29, 2024

piotrkowalczuk commented Oct 29, 2024

piotrkowalczuk Oct 29, 2024

Choose a reason for hiding this comment

jkrukowski Oct 29, 2024

Choose a reason for hiding this comment

pcuenca Oct 30, 2024

Choose a reason for hiding this comment

pcuenca commented Oct 30, 2024

ashvardanian commented Nov 1, 2024 • edited Loading

ashvardanian commented Nov 1, 2024 •

edited

Loading