Supporting more BERT-like models #89

ashvardanian · 2024-04-11T05:09:52Z

Hi HF team!

I am extending our UForm repository of multimodal models to support Swift and mobile deployments, and along that way I've noticed that several classes for a broad range of BERT-like models are not yet supported by swift-transformers. So I've added a WordPieceDecoder class and aliases for BertPreTokenizer and BertProcessing.

Moreover, are you are well aware config.json and tokenizer.json come in all shapes and sizes. So I've added fallback mechanisms to handle different tuple order in vocabulary listings.

The current main-dev branch of UForm is already using this functionality from my fork. I am looking into integrating more Hub functionality next. Please let me know what you think about this PR 🤗

BERT and RoBERTa config files often come with a different config structure: string+int instead of int+string.

ashvardanian · 2024-04-21T05:14:33Z

Another open problem that I've recently discovered is the way strings are compared in Swift. By default, the language uses UTF8-aware normalization techniques when comparing strings. This is great for some applications, but horrible for tokenization, especially with multilingual models. I've solved that by introducing a LiteralString wrapper for String, that uses the literal comparators:

    struct LiteralString: Hashable {
        let value: String

        static func ==(lhs: LiteralString, rhs: LiteralString) -> Bool {
            return lhs.value.compare(rhs.value, options: .literal) == .orderedSame
        }

        func hash(into hasher: inout Hasher) {
            hasher.combine(value)
        }
    }

I believe it should be applicable in other places as well. Let me know what you think, @pcuenca 🤗

pcuenca

Thanks a lot! Amazing contribution, this is in very good shape so I think we can merge soon.

Sources/Hub/Hub.swift

pcuenca · 2024-04-24T13:59:19Z

Sources/Tokenizers/Decoder.swift

@@ -23,7 +23,7 @@ extension Decoder {

 enum DecoderType: String {
    case Sequence
-//    case WordPiece
+    case WordPiece


pcuenca · 2024-04-24T14:08:31Z

Sources/Tokenizers/PostProcessor.swift

-    case TemplateProcessing
+    case Template
    case ByteLevel
-    case RobertaProcessing
+    case Bert
+    case Roberta
+
+    static let BertProcessing = "Bert"
+    static let RobertaProcessing = "Roberta"
+    static let TemplateProcessing = "Template"


I'm not sure this is worth doing, I find the aliases and the suffix removal distracting for little benefit. My original approach was to simply use the same names that appear in the json, so someone reading both could easily match.

It's very hard to use JSON as the ground truth, as they cone in all shapes and sizes.

Shorter name was needed for our models to work, but I've added the static variables for backward-compatibility, to avoid breaking the library for other users.

I'd say the json names used in transformers should be quite stable right now. Why do your models require shorter names? (just curious)

I guess no model "requires" specific names, and JSONs can always be changed, but it generally results in a snowballing set of changes that have to be applied on every platform...

In UForm I have identical tests that run for the same models on the same data across all 3 languages across ONNX, PyTorch, and CoreML. You can check them here.

If a certain behavior is standard in the more popular ports of the library (Python and JS), I assume Hugging Face may want to provide the same behavior here to encourage adoption. A lot of people would probably appreciate the portability 🤗

pcuenca · 2024-04-24T14:11:29Z

Sources/Tokenizers/PostProcessor.swift

-
+

If we could keep the empty lines empty that'd be awesome. Otherwise no big deal, we can address all those issues in the style PR.

pcuenca · 2024-04-24T14:17:52Z

Sources/Tokenizers/Tokenizer.swift

-        "BertTokenizer"      : BertTokenizer.self,
-        "CodeGenTokenizer"   : CodeGenTokenizer.self,
-        "CodeLlamaTokenizer" : CodeLlamaTokenizer.self,
-        "FalconTokenizer"    : FalconTokenizer.self,
-        "GemmaTokenizer"     : GemmaTokenizer.self,
-        "GPT2Tokenizer"      : GPT2Tokenizer.self,
-        "LlamaTokenizer"     : LlamaTokenizer.self,
-        "T5Tokenizer"        : T5Tokenizer.self,
-        "WhisperTokenizer"   : WhisperTokenizer.self,
-        "CohereTokenizer"    : CohereTokenizer.self,
-        "PreTrainedTokenizer": BPETokenizer.self
+        "Bert"      : BertTokenizer.self,
+        "CodeGen"   : CodeGenTokenizer.self,
+        "CodeLlama" : CodeLlamaTokenizer.self,
+        "Falcon"    : FalconTokenizer.self,
+        "Gemma"     : GemmaTokenizer.self,
+        "GPT2"      : GPT2Tokenizer.self,
+        "Llama"     : LlamaTokenizer.self,
+        "Unigram"   : UnigramTokenizer.self,
+        "T5"        : T5Tokenizer.self,
+        "Whisper"   : WhisperTokenizer.self,
+        "Cohere"    : CohereTokenizer.self,
+        "PreTrained": BPETokenizer.self


Hmmm, I think I'd rather keep the same names if possible. If I search in the project for "PreTrainedTokenizer" I'd like to see this entry.

pcuenca · 2024-04-24T14:20:19Z

Sources/Tokenizers/Tokenizer.swift

+        if let tokenizerClass = TokenizerModel.knownTokenizers[tokenizerName] {
+            return try tokenizerClass.init(tokenizerConfig: tokenizerConfig, tokenizerData: tokenizerData, addedTokens: addedTokens)
+        } else {
+            // If the direct lookup fails, perform a case-insensitive scan over the keys


Nice! This is where we may want to drop the Tokenizer suffix, in my opinion.

pcuenca · 2024-04-24T14:22:29Z

Sources/Tokenizers/UnigramTokenizer.swift

+            // Immediately mapping to `Float` values will result in exception,
+            // when precision loss is detected. So let's convert to `Double` first.


pcuenca · 2024-04-24T14:25:38Z

Sources/Tokenizers/UnigramTokenizer.swift

+        tokensToIds = Dictionary(uniqueKeysWithValues: vocab.map { $0.token }.enumerated().map { (LiteralString(value: $1), $0) })
+        bosTokenId = tokensToIds[LiteralString(value: bosToken!)]      // May be nil


maybe worth considering a String extension that returns a LiteralString, for readability? (Just an idea)

ConfuseIous · 2024-05-25T16:33:35Z

Any chance of this being merged in soon? I'm trying to use a BERT model and this PR would be a huge help :)

eemilk · 2024-07-17T07:07:01Z

I was trying out these changes since have DistilBERT model I wanted to test, but getting thrown unsupportedTokenizer("Distilbert") on try await AutoTokenizer.from(pretrained:..

Is there a chance support for DistilBERT is also added?

pcuenca · 2024-12-12T18:03:09Z

Superseded by #137

ashvardanian added 8 commits April 9, 2024 15:48

Add BertPreTokenizer

9b67fe1

Drop PreTokenizer suffix

ec905f9

Drop Normalizer suffix

5d36013

Add BertProcessing

0a818ac

Order-agnostic tokenValue

c14fa13

BERT and RoBERTa config files often come with a different config structure: string+int instead of int+string.

Handle arrays

a723369

Add WordPieceDecoder

1251447

Improve TokenizerModel mapping

4060e8f

ashvardanian mentioned this pull request Apr 11, 2024

UForm for Swift 🍏 🐦 unum-cloud/uform#76

Merged

ashvardanian added 2 commits April 10, 2024 23:42

Capitalization-agnostic tokenizer lookup

9ef46a5

Fix: UTF8 support in Unigram tokenizer

89fb5d9

ashvardanian force-pushed the main branch from 81ea673 to 89fb5d9 Compare April 21, 2024 05:13

pcuenca reviewed Apr 24, 2024

View reviewed changes

jkrukowski mentioned this pull request Oct 29, 2024

Added support for Bert #137

Merged

pcuenca closed this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting more BERT-like models #89

Supporting more BERT-like models #89

ashvardanian commented Apr 11, 2024

ashvardanian commented Apr 21, 2024

pcuenca left a comment

pcuenca Apr 24, 2024

pcuenca Apr 24, 2024

ashvardanian Apr 24, 2024

pcuenca Apr 24, 2024

ashvardanian Apr 24, 2024

pcuenca Apr 24, 2024

pcuenca Apr 24, 2024

pcuenca Apr 24, 2024

pcuenca Apr 24, 2024

pcuenca Apr 24, 2024

ashvardanian Apr 24, 2024

ConfuseIous commented May 25, 2024

eemilk commented Jul 17, 2024

pcuenca commented Dec 12, 2024

		// Immediately mapping to `Float` values will result in exception,
		// when precision loss is detected. So let's convert to `Double` first.

		tokensToIds = Dictionary(uniqueKeysWithValues: vocab.map { $0.token }.enumerated().map { (LiteralString(value: $1), $0) })
		bosTokenId = tokensToIds[LiteralString(value: bosToken!)] // May be nil

Supporting more BERT-like models #89

Supporting more BERT-like models #89

Conversation

ashvardanian commented Apr 11, 2024

ashvardanian commented Apr 21, 2024

pcuenca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConfuseIous commented May 25, 2024

eemilk commented Jul 17, 2024

pcuenca commented Dec 12, 2024