30 Jan 19:43

098d90a

v2.3.1 - Patch for local models with Normalize modules

This releases patches a niche bug when loading a Sentence Transformer model which:

is local
uses a Normalize module as specified in modules.json
does not contain the directory specified in the model configuration

This only occurs when a model with Normalize is downloaded from the Hugging Face hub and then later used locally.
See #2458 and #2459 for more details.

Release highlights

Don't require loading files for Normalize by @tomaarsen (#2460)

Full Changelog: v2.3.0...v2.3.1

Contributors

tomaarsen

Assets 2

29 Jan 08:32

tomaarsen

v2.3.0

1ec4902

v2.3.0 - Bug fixes, improved model loading & Cached MNRL

This release focuses on various bug fixes & improvements to keep up with adjacent works like transformers and huggingface_hub. These are the key changes in the release:

Pushing models to the Hugging Face Hub (#2376)

Prior to Sentence Transformers v2.3.0, saving models to the Hugging Face Hub may have resulted in various errors depending on the versions of the dependencies. Sentence Transformers v2.3.0 introduces a refactor to save_to_hub to resolve these issues.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
...
model.save_to_hub("tomaarsen/all-MiniLM-L6-v2-quora")

pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:06<00:00, 13.7MB/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.11s/it]

Model Loading

Efficient model loading (#2345)

Recently, transformers has shifted towards using safetensors files as their primary model file formats. Additionally, various other file formats are commonly used, such as PyTorch (pytorch_model.bin), Rust (rust_model.ot), Tensorflow (tf_model.h5) and ONNX (model.onnx).

Prior to Sentence Transformers v2.3.0, almost all files of a repository would be downloaded, even if theye are not strictly required. Since v2.3.0, only the strictly required files will be downloaded. For example, when loading sentence-transformers/all-MiniLM-L6-v2 which has its model weights in three formats (pytorch_model.bin, rust_model.ot, tf_model.h5), only pytorch_model.bin will be downloaded. Additionally, when downloading intfloat/multilingual-e5-small with two formats (model.safetensors, pytorch_model.bin), only model.safetensors will be downloaded.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

Downloading modules.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<?, ?B/s]
Downloading (…)ce_transformers.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<?, ?B/s]
Downloading README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 10.6k/10.6k [00:00<?, ?B/s]
Downloading (…)nce_bert_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<?, ?B/s]
Downloading config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<?, ?B/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:06<00:00, 15.0MB/s]
Downloading tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<?, ?B/s]
Downloading vocab.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.37MB/s]
Downloading tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 4.61MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<?, ?B/s]
Downloading 1_Pooling/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<?, ?B/s]

Note

This release updates the default cache location from ~/.cache/torch/sentence_transformers to the default cache location of transformers, i.e. ~/.cache/huggingface. You can still specify custom cache locations via the SENTENCE_TRANSFORMERS_HOME environment variable or the cache_folder argument.
Additionally, by supporting newer versions of various dependencies (e.g. huggingface_hub), the cache format changed. A consequence is that the old cached models cannot be used in v2.3.0 onwards, and those models need to be redownloaded. Once redownloaded, an airgapped machine can load the model like normal despite having no internet access.

Loading custom models (#2398)

This release brings models with custom code to Sentence Transformers through trust_remote_code, such as jinaai/jina-embeddings-v2-base-en.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("jinaai/jina-embeddings-v2-base-en", trust_remote_code=True)
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])

print(cos_sim(embeddings[0], embeddings[1]))
# => tensor([[0.9341]])

Loading specific revisions (#2419)

If an embedding model is ever updated, it would invalidate all of the embeddings that you have created with the prior version of that model. We promise to never update the weights of any sentence-transformers/... model, but we cannot offer this guarantee for models by the community.

That is why this version introduces a revision keyword, allowing you to specify exactly which revision or branch you'd like to load:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", revision="982532469af0dff5df8e70b38075b0940e863662")
# or a branch:
model = SentenceTransformer("BAAI/bge-small-en-v1.5", revision="main")

Soft deprecation of `use_auth_token`, use `token` instead (#2376)

Following updates from transformers & huggingface_hub, Sentence Transformers now recommends that you use the token argument to provide your Hugging Face authentication token to download private models.

from sentence_transformers import SentenceTransformer

# new:
model = SentenceTransformer("tomaarsen/all-mpnet-base-v2", token="hf_...")
# old, still works, but throws a warning to upgrade to "token"
model = SentenceTransformer("tomaarsen/all-mpnet-base-v2", use_auth_token="hf_...")

Note

The recommended way to include your Hugging Face authentication token is to run huggingface-cli login & paste your User Access Token from your Hugging Face Settings. See these docs for more information. Then, you don't have to include the token argument at all; it'll be automatically read from your filesystem.

Device patch (#2351)

Prior to this release, SentenceTransformers.device would not always correspond to the device on which embeddings were computed, or on which a model gets trained. This release brings a few fixes:

SentenceTransformers.device now always corresponds to the device that the model is on, and on which it will do its computations.
Models are now immediately moved to their specified device, rather than lazily whenever the model is being used.
SentenceTransformers.to(...), SentenceTransformers.cpu(), SentenceTransformers.cuda(), etc. will now work as expected, rather than being ignored.

Cached Multiple Negatives Ranking Loss (CMNRL) (#1759)

MultipleNegativesRankingLoss (MNRL) is a powerful loss function that is commonly applied to train embedding models. It uses in-batch negative sampling to produce a large number of negative pairs, allowing the model to receive a training signal to push the embeddings of this pair apart. It is commonly shown that a larger batch size results in better performing models (Qu et al., 2021, Li et al., 2023), but a larger batch size requires more VRAM in practice.

To counteract that, @kwang2049 has implemented a slightly modified GradCache technique that is able to separate the batch computation into mini-batches without any reduction in training quality. This allows the common practitioner to train with competitive batch sizes, e.g. 65536!
The downside is that training with Cached MNRL (CMNRL) is roughly 2 to 2.4 times slower than using normal MNRL.

CachedMultipleNegativesRankingLoss is a drop-in replacement for MultipleNegativesRankingLoss, but with a new mini_batch_size argument. I recommend trying out CMNRL with a large batch size and a fairly small mini_batch_size - the larger mini batch size that will fit into memory.

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("distilbert-base-uncased")
train_examples = [
    InputExample(texts=['Anchor 1', 'Positive 1']),
    InputExample(texts=['Anchor 2', 'Positive 2']),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1024)  # Here we can try much larger batch sizes!
train_loss = losses.CachedMultipleNegativesRankingLoss(model=model, mini_batch_size = 32)

model.fit([(t...

Contributors

mokha, lambdaofgod, and 16 other contributors

Assets 2

26 Jun 19:52

nreimers

v2.2.2

f38e91e

v2.2.2 - Bugfix huggingface_hub for Python 3.6

huggingface_hub dropped support in version 0.5.0 for Python 3.6

This release fixes the issue so that huggingface_hub with version 0.4.0 and Python 3.6 can still be used.

Assets 2

23 Jun 12:59

nreimers

v2.2.1

c9ce433

v2.2.1 - Update huggingface_hub & fixes

Version 0.8.1 of huggingface_hub introduces several changes that resulted in errors and warnings. This version of sentence-transformers fixes these issues.

Further, several improvements have been added / merged:

util.community_detection was improved: 1) It works in a batched mode to save memory, 2) Overlapping clusters are no longer dropped but removed by overlapping items, 3) The parameter init_max_size was removed and replaced by a heuristic to estimate the max size of clusters
#1581 the training dataset names can be saved in the model card
#1426 fix the text summarization example
#1487 Rekursive sentence-transformers models are now possible
#1522 Private models can now be loaded
#1551 DataLoaders can now have workers
#1565 Models are just checked on the hub if they don't exist in the cache. Fixes issues with connectivity issues
#1591 Example added how to stream encode larger datasets

Assets 2

10 Feb 13:12

nreimers

v2.2.0

f702594

v2.2.0 - T5 Encoder & Private models

T5

You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:

from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer('t5-base', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

See T5-Benchmark results - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

New Models

The models from the papers Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models and Large Dual Encoders Are Generalizable Retrievers have been added:

For benchmark results, see https://seb.sbert.net

Private Models

Thanks to #1406 you can now load private models from the hub:

model = SentenceTransformer("your-username/your-model", use_auth_token=True)

Assets 2

01 Oct 09:10

nreimers

v2.1.0

afee883

v2.1.0 - New Loss Functions

This is a smaller release with some new features

MarginMSELoss

MarginMSELoss is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: MSMARCO - MarginMSE Training

You pass your training data in the format:

InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])

MultipleNegativesSymmetricRankingLoss

MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

MultipleNegativesSymmetricRankingLoss also computes the loss in the other direction: Find the correct question for a given answer.

Breaking Change: CLIPModel

The CLIPModel is now based on the transformers model.

You can still load it like this:

model = SentenceTransformer('clip-ViT-B-32')

Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.

Added files on the hub are automatically downloaded

PR #1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

`SentenceTransformers.encode()` can return all values

When you set output_value=None for the encode method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.

Assets 2

24 Jun 16:16

nreimers

v2.0.0

3ddd7a7

v2.0.0 - Integration into Huggingface Model Hub

Models hosted on the hub

All pre-trained models are now hosted on the Huggingface Models hub.

Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

model = SentenceTransformer('[your_username]/[model_name]')

For more information, see: Sentence Transformers in the Hugging Face Hub

Breaking changes

There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

Find sentence-transformer models on the Hub

You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

Add the sentence-transformers tag to you model card so that others can find your model.

Widget & Inference API

A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website:
https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

Save Model to Hub

A new method was added to the SentenceTransformer class: save_to_hub.

Provide the model name and the model is saved on the hub.

Here you find the explanation from transformers how the hub works: Model sharing and uploading

Automatic Model Card

When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

New Models

Several new sentence embedding models have been added, which are much better than the previous model: Sentence Embedding Models
Some new models for semantic search based on MS MARCO have been added: MSMARCO Models
The training script for these MS MARCO models have been released as well: Train MS MARCO Bi-Encoder v3

Assets 2

24 Jun 14:20

nreimers

v1.2.1

8a59617

v1.2.1 - Forward compatibility with version 2

Final release of version 1: Makes v1 of sentence-transformers forward compatible with models from version 2 of sentence-transformers.

Assets 2

12 May 13:14

nreimers

v1.2.0

a208d64

v1.2.0 - Unsupervised Learning, New Training Examples, Improved Models

Unsupervised Sentence Embedding Learning

New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

New methods:

CT: Integration of Semantic Re-Tuning With Contrastive Tension (CT) to tune models without labeled data
CT_In-Batch_Negatives: A modification of CT using in-batch negatives
SimCSE: An unsupervised sentence embedding learning method by Gao et al.

Pre-Training Methods

MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

Training Examples

Paraphrase Data: In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.
NLI with MultipleNegativeRankingLoss: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.

New models

New NLI & STS models: Following the Paraphrase Data training example we published new models trained on NLI and NLI+STS data. Training code is available: training_nli_v2.py.

Model-Name STSb-test performance

Previous best models

nli-bert-large 79.19

stsb-roberta-large 86.39

New v2 models

nli-mpnet-base-v2 86.53

stsb-mpnet-base-v2 88.57
New MS MARCO model for Semantic Search: Hofstätter et al. optimized the training procedure on the MS MARCO dataset. The resulting model is integrated as msmarco-distilbert-base-tas-b and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR@10

Model-Name	STSb-test performance
Previous best models
nli-bert-large	79.19
stsb-roberta-large	86.39
New v2 models
nli-mpnet-base-v2	86.53
stsb-mpnet-base-v2	88.57

New Functions

SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
```
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
```
Valid values are mean/max/cls.
NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~

Assets 2

21 Apr 13:12

nreimers

v1.1.0

abdfbf0

Unsupervised Sentence Embedding Learning

This release integrates methods that allows to learn sentence embeddings without having labeled data:

TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.
GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

New Models - SentenceTransformer

MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.
MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

New Models - CrossEncoder

MSMARCO Re-ranking-Models v2: We trained new significantly faster and significantly better CrossEncoder re-ranking models on the MSMARCO dataset. It outperforms BERT-large models in terms of accuracy while being 18 times faster. Trainingcode is available

New Features

You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.
You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release highlights

Contributors

Pushing models to the Hugging Face Hub (#2376)

Model Loading

Efficient model loading (#2345)

Loading custom models (#2398)

Loading specific revisions (#2419)

Soft deprecation of `use_auth_token`, use `token` instead (#2376)

Device patch (#2351)

Cached Multiple Negatives Ranking Loss (CMNRL) (#1759)

Contributors

T5

New Models

Private Models

MarginMSELoss

MultipleNegativesSymmetricRankingLoss

Breaking Change: CLIPModel

Added files on the hub are automatically downloaded

`SentenceTransformers.encode()` can return all values

Models hosted on the hub

Breaking changes

Find sentence-transformer models on the Hub

Widget & Inference API

Save Model to Hub

Automatic Model Card

New Models

Unsupervised Sentence Embedding Learning

Pre-Training Methods

Training Examples

New models

New Functions

Unsupervised Sentence Embedding Learning

New Models - SentenceTransformer

New Models - CrossEncoder

New Features

Releases: UKPLab/sentence-transformers

v2.3.1 - Patch for local models with Normalize modules

Release highlights

Contributors

v2.3.0 - Bug fixes, improved model loading & Cached MNRL

Pushing models to the Hugging Face Hub (#2376)

Model Loading

Efficient model loading (#2345)

Loading custom models (#2398)

Loading specific revisions (#2419)

Soft deprecation of use_auth_token, use token instead (#2376)

Device patch (#2351)

Cached Multiple Negatives Ranking Loss (CMNRL) (#1759)

Contributors

v2.2.2 - Bugfix huggingface_hub for Python 3.6

v2.2.1 - Update huggingface_hub & fixes

v2.2.0 - T5 Encoder & Private models

T5

New Models

Private Models

v2.1.0 - New Loss Functions

MarginMSELoss

MultipleNegativesSymmetricRankingLoss

Breaking Change: CLIPModel

Added files on the hub are automatically downloaded

SentenceTransformers.encode() can return all values

v2.0.0 - Integration into Huggingface Model Hub

Models hosted on the hub

Breaking changes

Find sentence-transformer models on the Hub

Widget & Inference API

Save Model to Hub

Automatic Model Card

New Models

v1.2.1 - Forward compatibility with version 2

v1.2.0 - Unsupervised Learning, New Training Examples, Improved Models

Unsupervised Sentence Embedding Learning

Pre-Training Methods

Training Examples

New models

New Functions

Unsupervised Sentence Embedding Learning

Unsupervised Sentence Embedding Learning

New Models - SentenceTransformer

New Models - CrossEncoder

New Features

Soft deprecation of `use_auth_token`, use `token` instead (#2376)

`SentenceTransformers.encode()` can return all values