19 Oct 14:23

nreimers

3d12b0c

v0.3.8 - CrossEncoder, Data Augmentation, new Models

Add support training and using CrossEncoder
Data Augmentation method AugSBERT added
New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
SentenceTransformer.encode method detaches tensors from compute graph
SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

Assets 2

29 Sep 20:17

nreimers

v0.3.7

a37ba6a

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
Added models.Normalize() to allow the normalization of embeddings to unit length

Assets 2

11 Sep 08:06

nreimers

v0.3.6

18c057c

v0.3.6 - Update transformers to v3.1.0

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

Assets 2

01 Sep 13:09

nreimers

v0.3.5

073dd37

v0.3.5 - Automatic Mixed Precision & Bugfixes

The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
Several bugfixes: Downloading of files, mutli-GPU-encoding

Assets 2

24 Aug 16:24

nreimers

v0.3.4

e6759fa

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
Smaller bugfixes

Breaking changes:

Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator

Assets 2

06 Aug 08:16

nreimers

v0.3.3

f4377b2

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

New Functions

Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
Tokenization of datasets for training can now run in parallel (Linux Only)
New example for Quora Duplicate Questions Retrieval: See examples-folder
Many small improvements for training better models for Information Retrieval
Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
Added new Evaluators for ParaphraseMining and InformationRetrieval
evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

Assets 2

23 Jul 15:03

nreimers

v0.3.2

ec5d73b

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

This is a minor release. There should be no breaking changes.

ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings

Assets 2

22 Jul 13:54

nreimers

v0.3.1

631a687

v0.3.1 - Updates on Multilingual Training

This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

The following classes/files have been changed:

datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

New evaluation files:

evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

Bugfixes:

model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.

Assets 2

09 Jul 15:11

nreimers

v0.3.0

7222990

v0.3.0 - Transformers Updated to Version 3

This release updates HuggingFace transformers to v3.0.2. Transformers did some breaking changes to the tokenization API. This (and future) versions will not be compatible with HuggingFace transfomers v2.

There are no known breaking changes for existent models or existent code. Models trained with version 2 can be loaded without issues.

New Loss Functions

Thanks to PR #299 and #176 several new loss functions: Different triplet loss functions and ContrastiveLoss

Assets 2

16 Apr 14:12

nreimers

v0.2.6

eb39d01

v0.2.6 - Transformers Update - AutoModel - WKPooling

The release update huggingface/transformers to the release v2.8.0.

New Features

models.Transformer: The Transformer-Model can now load any huggingface transformers model, like BERT, RoBERTa, XLNet, XLM-R, Elextra... It is based on the AutoModel from HuggingFace. You now longer need the architecture specific models (like models.BERT, models.RoBERTa) any more. It also works with the community models.
Multilingual Training: Code is released for making mono-lingual sentence embeddings models mutli-lingual. See training_multilingual.py for an example. More documentation and details will follow soon.
WKPooling: Adding a pytorch implementation of SBERT-WK. Note, due to an inefficient implementation in pytorch of QR decomposition, WKPooling can only be run on the CPU, which makes it about 40 slower than mean pooling. For some models WKPooling improves the performance, for other don't.
WeightedLayerPooling: A new pooling layer that uses representations from all transformer layers and learns a weighted sum of them. So far no improvement compared to only averaging the last layer.
New pre-trained models released. Every available model is document in a google Spreadsheet for an easier overview.

Minor changes

Clean-up of the examples folder.
Model and tokenizer arguments can now be passed to the according transformers models.
Previous version had some issues with RoBERTa and XLM-RoBERTa, that the wrong special characters were added. Everything is fixed now and relies on huggingface transformers for the correct addition of special characters to the input sentences.

Breaking changes

STSDataReader: The default parameter values have been changed, so that it expects the sentences in the first two columns and the score in the third column. If you want to load the STS benchmkark dataset, you can use the STSBenchmarkDataReader.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Functions

Breaking Changes

New Loss Functions

New Features

Minor changes

Breaking changes

Releases: UKPLab/sentence-transformers

v0.3.8 - CrossEncoder, Data Augmentation, new Models

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

v0.3.6 - Update transformers to v3.1.0

v0.3.5 - Automatic Mixed Precision & Bugfixes

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

New Functions

Breaking Changes

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

v0.3.1 - Updates on Multilingual Training

v0.3.0 - Transformers Updated to Version 3

New Loss Functions

v0.2.6 - Transformers Update - AutoModel - WKPooling

New Features

Minor changes

Breaking changes