Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting more BERT-like models #89

Closed
wants to merge 10 commits into from

Conversation

ashvardanian
Copy link
Contributor

Hi HF team!

I am extending our UForm repository of multimodal models to support Swift and mobile deployments, and along that way I've noticed that several classes for a broad range of BERT-like models are not yet supported by swift-transformers. So I've added a WordPieceDecoder class and aliases for BertPreTokenizer and BertProcessing.

Moreover, are you are well aware config.json and tokenizer.json come in all shapes and sizes. So I've added fallback mechanisms to handle different tuple order in vocabulary listings.

The current main-dev branch of UForm is already using this functionality from my fork. I am looking into integrating more Hub functionality next. Please let me know what you think about this PR 🤗

@ashvardanian
Copy link
Contributor Author

Another open problem that I've recently discovered is the way strings are compared in Swift. By default, the language uses UTF8-aware normalization techniques when comparing strings. This is great for some applications, but horrible for tokenization, especially with multilingual models. I've solved that by introducing a LiteralString wrapper for String, that uses the literal comparators:

    struct LiteralString: Hashable {
        let value: String

        static func ==(lhs: LiteralString, rhs: LiteralString) -> Bool {
            return lhs.value.compare(rhs.value, options: .literal) == .orderedSame
        }

        func hash(into hasher: inout Hasher) {
            hasher.combine(value)
        }
    }

I believe it should be applicable in other places as well. Let me know what you think, @pcuenca 🤗

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Amazing contribution, this is in very good shape so I think we can merge soon.

Sources/Hub/Hub.swift Show resolved Hide resolved
@@ -23,7 +23,7 @@ extension Decoder {

enum DecoderType: String {
case Sequence
// case WordPiece
case WordPiece
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines -25 to +32
case TemplateProcessing
case Template
case ByteLevel
case RobertaProcessing
case Bert
case Roberta

static let BertProcessing = "Bert"
static let RobertaProcessing = "Roberta"
static let TemplateProcessing = "Template"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is worth doing, I find the aliases and the suffix removal distracting for little benefit. My original approach was to simply use the same names that appear in the json, so someone reading both could easily match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hard to use JSON as the ground truth, as they cone in all shapes and sizes.

Shorter name was needed for our models to work, but I've added the static variables for backward-compatibility, to avoid breaking the library for other users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the json names used in transformers should be quite stable right now. Why do your models require shorter names? (just curious)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess no model "requires" specific names, and JSONs can always be changed, but it generally results in a snowballing set of changes that have to be applied on every platform...

In UForm I have identical tests that run for the same models on the same data across all 3 languages across ONNX, PyTorch, and CoreML. You can check them here.

If a certain behavior is standard in the more popular ports of the library (Python and JS), I assume Hugging Face may want to provide the same behavior here to encourage adoption. A lot of people would probably appreciate the portability 🤗

Comment on lines -134 to +144

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could keep the empty lines empty that'd be awesome. Otherwise no big deal, we can address all those issues in the style PR.

Comment on lines -62 to +73
"BertTokenizer" : BertTokenizer.self,
"CodeGenTokenizer" : CodeGenTokenizer.self,
"CodeLlamaTokenizer" : CodeLlamaTokenizer.self,
"FalconTokenizer" : FalconTokenizer.self,
"GemmaTokenizer" : GemmaTokenizer.self,
"GPT2Tokenizer" : GPT2Tokenizer.self,
"LlamaTokenizer" : LlamaTokenizer.self,
"T5Tokenizer" : T5Tokenizer.self,
"WhisperTokenizer" : WhisperTokenizer.self,
"CohereTokenizer" : CohereTokenizer.self,
"PreTrainedTokenizer": BPETokenizer.self
"Bert" : BertTokenizer.self,
"CodeGen" : CodeGenTokenizer.self,
"CodeLlama" : CodeLlamaTokenizer.self,
"Falcon" : FalconTokenizer.self,
"Gemma" : GemmaTokenizer.self,
"GPT2" : GPT2Tokenizer.self,
"Llama" : LlamaTokenizer.self,
"Unigram" : UnigramTokenizer.self,
"T5" : T5Tokenizer.self,
"Whisper" : WhisperTokenizer.self,
"Cohere" : CohereTokenizer.self,
"PreTrained": BPETokenizer.self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think I'd rather keep the same names if possible. If I search in the project for "PreTrainedTokenizer" I'd like to see this entry.

if let tokenizerClass = TokenizerModel.knownTokenizers[tokenizerName] {
return try tokenizerClass.init(tokenizerConfig: tokenizerConfig, tokenizerData: tokenizerData, addedTokens: addedTokens)
} else {
// If the direct lookup fails, perform a case-insensitive scan over the keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is where we may want to drop the Tokenizer suffix, in my opinion.

Comment on lines +53 to +54
// Immediately mapping to `Float` values will result in exception,
// when precision loss is detected. So let's convert to `Double` first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

Comment on lines +71 to +72
tokensToIds = Dictionary(uniqueKeysWithValues: vocab.map { $0.token }.enumerated().map { (LiteralString(value: $1), $0) })
bosTokenId = tokensToIds[LiteralString(value: bosToken!)] // May be nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth considering a String extension that returns a LiteralString, for readability? (Just an idea)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 🤔

@ConfuseIous
Copy link

Any chance of this being merged in soon? I'm trying to use a BERT model and this PR would be a huge help :)

@eemilk
Copy link

eemilk commented Jul 17, 2024

I was trying out these changes since have DistilBERT model I wanted to test, but getting thrown unsupportedTokenizer("Distilbert") on try await AutoTokenizer.from(pretrained:..

Is there a chance support for DistilBERT is also added?

@pcuenca
Copy link
Member

pcuenca commented Dec 12, 2024

Superseded by #137

@pcuenca pcuenca closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants