Update dependency sentence-transformers to v2.7.0 #129
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.2.2
->==2.7.0
Release Notes
UKPLab/sentence-transformers (sentence-transformers)
v2.7.0
: - CachedGISTEmbedLoss, easy Matryoshka inference & evaluation, CrossEncoder, Intel Gaudi2Compare Source
This release introduces a new promising loss function, easier inference for Matryoshka models, new functionality for CrossEncoders and Inference on Intel Gaudi2, along much more.
Install this version with
New loss function: CachedGISTEmbedLoss (#2592)
For a number of years,
MultipleNegativesRankingLoss
(also known as SimCSE, InfoNCE, in-batch negatives loss) has been the state of the art in embedding model training. Notably, this loss function performs better with a larger batch size.Recently, various improvements have been introduced:
CachedMultipleNegativesRankingLoss
was introduced, which allows you to pick much higher batch sizes (e.g. 65536) with constant memory.GISTEmbedLoss
takes a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.Now, @JacksonCakes has combined these two approaches to produce the best of both worlds:
CachedGISTEmbedLoss
. This loss function allows for high batch sizes with constant memory usage, while also using a guide model to assist with the in-batch negative sample selection.As can be seen in our Loss Overview, this model should be used with
(anchor, positive)
pairs or(anchor, positive, negative)
triplets, much likeMultipleNegativesRankingLoss
,CachedMultipleNegativesRankingLoss
, andGISTEmbedLoss
. In short, any example using those loss functions can be updated to useCachedGISTEmbedLoss
! Feel free to experiment, e.g. with this training script.Automatic Matryoshka model truncation (#2573)
Sentence Transformers v2.4.0 introduced Matryoshka models: models whose embeddings are still useful after truncation. Since then, many useful Matryoshka models have been trained.
As of this release, the truncation for these Matryoshka embedding models can be done automatically via a new
truncate_dim
constructor argument:Extra information:
Model truncation in all evaluators (#2582)
Alongside easier inference with Matryoshka models, evaluating them is now also much easier. You can also pass
truncate_dim
to any Evaluator. This way you can easily check the performance of any Sentence Transformer model at various truncated dimensions (even if the model was not trained withMatryoshkaLoss
!)Here are some example training scripts that use this new
truncate_dim
option to assist with training Matryoshka models:CrossEncoder improvements
This release improves the support for CrossEncoder reranker models.
push_to_hub
(#2524)You can now push trained CrossEncoder models to the 🤗 Hugging Face Hub!
CrossEncoder.push_to_hub
trust_remote_code
for custom models (#2595)You can now load custom models from the Hugging Face Hub, i.e. models that have custom modelling code that require
trust_remote_code
to load.CrossEncoder
Inference on Intel Gaudi2 (#2557)
From this release onwards, you will be able to perform inference on Intel Gaudi2 accelerators. No modifications are needed, as the library will automatically detect the
hpu
device and configure the model accordingly. Thanks to Intel Habana for the support here.All changes
docs
] Add simple Makefile for building docs by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2566examples
] Add Matryoshka evaluation plot by @kddubey in https://github.com/UKPLab/sentence-transformers/pull/2564push_to_hub
to CrossEncoder by @imvladikon in https://github.com/UKPLab/sentence-transformers/pull/2524requirements
] Set minimum transformers version to 4.34.0 for is_nltk_available by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2574docs
] Update link: retrieve_rerank_simple_wikipedia.py -> .ipynb by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2580feat
] Add truncation support by @kddubey in https://github.com/UKPLab/sentence-transformers/pull/2573examples
] Add model upload for training_nli_v3 with GISTEmbedLoss by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2584fix
] Matryoshka training always patch original forward, and check matryoshka_dims by @kddubey in https://github.com/UKPLab/sentence-transformers/pull/2593docs
] Fix search bar on sbert.net by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2597clip
] Prevent warning withpadding
when tokenizing for CLIP by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2599New Contributors
I especially want to thank @JacksonCakes for their excellent
CachedGISTEmbedLoss
PR and @kddubey for their wonderful PRs surrounding Matryoshka models and general repository housekeeping.Full Changelog: UKPLab/sentence-transformers@v2.6.1...v2.7.0
v2.6.1
: - Fix Quantized Semantic Search rescoringCompare Source
This is a patch release to fix a bug in
semantic_search_faiss
andsemantic_search_usearch
that caused the scores to not correspond to the returned corpus indices. Additionally, you can now evaluate embedding models after quantizing their embeddings.Precision support in EmbeddingSimilarityEvaluator
You can now pass
precision
to theEmbeddingSimilarityEvaluator
to evaluate the performance after quantization:All changes
Full Changelog: UKPLab/sentence-transformers@v2.6.0...v2.6.1
v2.6.0
: - Embedding Quantization, GISTEmbedLossCompare Source
This release brings embedding quantization: a way to heavily speed up retrieval & other tasks, and a new powerful loss function: GISTEmbedLoss.
Install this version with
Embedding Quantization
Embeddings may be challenging to scale up, which leads to expensive solutions and high latencies. However, there is a new approach to counter this problem; it entails reducing the size of each of the individual values in the embedding: Quantization. Experiments on quantization have shown that we can maintain a large amount of performance while significantly speeding up computation and saving on memory, storage, and costs.
To be specific, using binary quantization may result in retaining 96% of the retrieval performance, while speeding up retrieval by 25x and saving on memory & disk space with 32x. Do not underestimate this approach! Read more about Embedding Quantization in our extensive blogpost.
Binary and Scalar Quantization
Two forms of quantization exist at this time: binary and scalar (int8). These quantize embedding values from
float32
intobinary
andint8
, respectively. For Binary quantization, you can use the following snippet:References:
GISTEmbedLoss
GISTEmbedLoss, as introduced in Solatorio (2024), is a guided variant of the more standard in-batch negatives (
MultipleNegativesRankingLoss
) loss. Both loss functions are provided with a list of (anchor, positive) pairs, but whileMultipleNegativesRankingLoss
usesanchor_i
andpositive_i
as positive pair and allpositive_j
withi != j
as negative pairs,GISTEmbedLoss
uses a second model to guide the in-batch negative sample selection.This can be very useful, because it is plausible that
anchor_i
andpositive_j
are actually quite semantically similar. In this case,GISTEmbedLoss
would not consider them a negative pair, whileMultipleNegativesRankingLoss
would. When finetuning MPNet-base on the AllNLI dataset, these are the Spearman correlation based on cosine similarity using the STS Benchmark dev set (higher is better):The blue line is
MultipleNegativesRankingLoss
, whereas the grey line isGISTEmbedLoss
with the smallall-MiniLM-L6-v2
as the guide model. Note thatall-MiniLM-L6-v2
by itself does not reach 88 Spearman correlation on this dataset, so this is really the effect of two models (mpnet-base
andall-MiniLM-L6-v2
) reaching a performance that they could not reach separately.Soft
save_to_hub
DeprecationMost codebases that allow for pushing models to the Hugging Face Hub adopt a
push_to_hub
method instead of asave_to_hub
method, and now Sentence Transformers will follow that convention. Thepush_to_hub
method will now be the recommended approach, althoughsave_to_hub
will continue to exist for the time being: it will simply callpush_to_hub
internally.All changes
feat
] Add 'get_config_dict' method to GISTEmbedLoss for better model cards by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2543deprecation
] Deprecatesave_to_hub
in favor ofpush_to_hub
; add safe_serialization support topush_to_hub
by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2544docs
] Update return docstring of encode_multi_process by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2548feat
] Add binary & scalar embedding quantization support to Sentence Transformers by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2549New Contributors
Full Changelog: UKPLab/sentence-transformers@v2.5.1...v2.6.0
v2.5.1
: - fix CrossEncoder.rank bug with default top_kCompare Source
This is a patch release to fix a bug in
CrossEncoder.rank
that caused the last value to be discarded when using the defaulttop_k=-1
.CrossEncoder.rank
patch:Previously, the lowest score document would be removed from the output.
All changes
examples
] Update model repo_id in 2dMatryoshka example by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2515feat
] Add get_config_dict to new Matryoshka2dLoss & AdaptiveLayerLoss by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2516chore
] Update to ruff 0.3.0; update ruff.toml by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2517example
] Don't always normalize the embeddings in clustering example by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2520top_k
by @xenova in https://github.com/UKPLab/sentence-transformers/pull/2518New Contributors
Full Changelog: UKPLab/sentence-transformers@v2.5.0...v2.5.1
v2.5.0
: - 2D Matryoshka & Adaptive Layer models, CrossEncoder (re)rankingCompare Source
This release brings two new loss functions, a new way to (re)rank with CrossEncoder models, and more fixes
Install this version with
2D Matryoshka & Adaptive Layer models (#2506)
Embedding models are often encoder models with numerous layers, such as 12 (e.g. all-mpnet-base-v2) or 6 (e.g. all-MiniLM-L6-v2). To get embeddings, every single one of these layers must be traversed. 2D Matryoshka Sentence Embeddings (2DMSE) revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
For example, using Sentence Transformers, you can train an Adaptive Layer model that can be sped up by 2x at a 15% reduction in performance, or 5x on GPU & 10x on CPU for a 20% reduction in performance. The 2DMSE paper highlights scenarios where this is superior to using a smaller model.
Training
Training with Adaptive Layer support is quite elementary: rather than applying some loss function on only the last layer, we also apply that same loss function on the pooled embeddings from previous layers. Additionally, we employ a KL-divergence loss that aims to make the embeddings of the non-last layers match that of the last layer. This can be seen as a fascinating approach of knowledge distillation, but with the last layer as the teacher model and the prior layers as the student models.
For example, with the 12-layer microsoft/mpnet-base, it will now be trained such that the model produces meaningful embeddings after each of the 12 layers.
AdaptiveLayerLoss
Additionally, this can be combined with the
MatryoshkaLoss
such that the resulting model can be reduced both in the number of layers, but also in the size of the output dimensions. See also the Matryoshka Embeddings for more information on reducing output dimensions. In Sentence Transformers, the combination of these two losses is calledMatryoshka2dLoss
, and a shorthand is provided for simpler training.Matryoshka2dLoss
Performance Results
Results
Let's look at the performance that we may be able to expect from an Adaptive Layer embedding model versus a regular embedding model. For this experiment, I have trained two models:
MultipleNegativesRankingLoss
rather thanAdaptiveLayerLoss
on top ofMultipleNegativesRankingLoss
. I also use microsoft/mpnet-base as the base model.Both of these models were trained on the AllNLI dataset, which is a concatenation of the SNLI and MultiNLI datasets. I have evaluated these models on the STSBenchmark test set using multiple different embedding dimensions. The results are plotted in the following figure:
The first figure shows that the Adaptive Layer model stays much more performant when reducing the number of layers in the model. This is also clearly shown in the second figure, which displays that 80% of the performance is preserved when the number of layers is reduced all the way to 1.
Lastly, the third figure shows the expected speedup ratio for GPU & CPU devices in my tests. As you can see, removing half of the layers results in roughly a 2x speedup, at a cost of ~15% performance on STSB (~86 -> ~75 Spearman correlation). When removing even more layers, the performance benefit gets larger for CPUs, and between 5x and 10x speedups are very feasible with a 20% loss in performance.
Inference
Inference
After a model has been trained using the Adaptive Layer loss, you can then truncate the model layers to your desired layer count. Note that this requires doing a bit of surgery on the model itself, and each model is structured a bit differently, so the steps are slightly different depending on the model.
First of all, we will load the model & access the underlying
transformers
model like so:This output will differ depending on the model. We will look for the repeated layers in the encoder. For this MPNet model, this is stored under
model[0].auto_model.encoder.layer
. Then we can slice the model to only keep the first few layers to speed up the model:Then we can run inference with it using
SentenceTransformers.encode
.As you can see, the similarity between the related sentences is much higher than the unrelated sentence, despite only using 3 layers. Feel free to copy this script locally, modify the
new_num_layers
, and observe the difference in similarities.Extra information:
Example training scripts:
CrossEncoder (re)rank (#2514)
CrossEncoder models are often even better than biencoder (
SentenceTransformer
) models, as the model can compare two texts using the attention mechanism, unlike biencoders. However, they are more computationally expensive as well. They are commonly used for reranking the top retrieval results of a biencoder model. As of this release, that should now be more convenient!We now support a
rank
method, which allows you to rank a bunch of documents given a query:Extra information:
rank
All changes
Semantic Textual Similarity
example by @alvarobartt in https://github.com/UKPLab/sentence-transformers/pull/2511loss
] Add AdaptiveLayerLoss; 2d Matryoshka loss modifiers by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/2506New Contributors
I especially want to thank @SeanLee97 and @fkdosilovic for their valuable contributions in this release.
Full Changelog: UKPLab/sentence-transformers@v2.4.0...v2.5.0
v2.4.0
: - Matryoshka models, SOTA loss functions, prompt templates, INSTRUCTOR supportCompare Source
This release introduces numerous notable features that are well worth learning about!
Install this version with
MatryoshkaLoss (#2485)
Dense embedding models typically produce embeddings with a fixed size, such as 768 or 1024. All further computations (clustering, classification, semantic search, retrieval, reranking, etc.) must then be done on these full embeddings. Matryoshka Representation Learning revisits this idea, and proposes a solution to train embedding models whose embeddings are still useful after truncation to much smaller sizes. This allows for considerably faster (bulk) processing.
Training
Training using Matryoshka Representation Learning (MRL) is quite elementary: rather than applying some loss function on only the full-size embeddings, we also apply that same loss function on truncated portions of the embeddings. For example, if a model has an embedding dimension of 768 by default, it can now be trained on 768, 512, 256, 128, 64 and 32. Each of these losses will be added together, optionally with some weight:
MatryoshkaLoss
Inference
Inference
After a model has been trained using a Matryoshka loss, you can then run inference with it using
SentenceTransformers.encode
. You must then truncate the resulting embeddings, and it is recommended to renormalize the embeddings.As you can see, the similarity between the search query and the correct document is much higher than that of an unrelated document, despite the very small matryoshka dimension applied. Feel free to copy this script locally, modify the
matryoshka_dim
, and observe the difference in similarities.Note: Despite the embeddings being smaller, training and inference of a Matryoshka model is not faster, not more memory-efficient, and not smaller. Only the processing and storage of the resulting embeddings will be faster and cheaper.
Extra information:
Example training scripts:
CoSENTLoss (#2454)
CoSENTLoss was introduced by Jianlin Su, 2022 as a drop-in replacement of CosineSimilarityLoss. Experiments have shown that it produces a stronger learning signal than
CosineSimilarityLoss
.You can update training_stsbenchmark.py by replacing
CosineSimilarityLoss
withCoSENTLoss
& you can observe the improved performance.AnglELoss (#2471)
AnglELoss was introduced in Li and Li, 2023. It is an adaptation of the CoSENTLoss, and also acts as a strong drop-in replacement of
CosineSimilarityLoss
. Compared toCoSENTLoss
,AnglELoss
uses a different similarity function which aims to avoid vanishing gradients.Like with
CoSENTLoss
, you can use it just likeCosineSimilarityLoss
.You can update training_stsbenchmark.py by replacing
CosineSimilarityLoss
withAnglELoss
& you can observe the improved performance.Prompt Templates (#2477)
Some models require using specific text prompts to achieve optimal performance. For example, with intfloat/multilingual-e5-large you should prefix all queries with
query:
and all passages withpassage:
. Another example is BAAI/bge-large-en-v1.5, which performs best for retrieval when the input texts are prefixed withRepresent this sentence for searching relevant passages:
.Sentence Transformer models can now be initialized with
prompts
anddefault_prompt_name
parameters:prompts
is an optional argument that accepts a dictionary of prompts with prompt names to prompt texts. The prompt will be prepended to the input text during inference. For example,or
default_prompt_name
is an optional argument that determines the default prompt to be used. It has to correspond with a prompt name fromprompts
. IfNone
, then no prompt is used by default. For example,or
Both of these parameters can also be specified in the
config_sentence_transformers.json
file of a saved model. That way, you won't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
prompt
option inSentenceTransformer.encode
:prompt_name
option inSentenceTransformer.encode
by relying on the prompts loaded from a) initialization or b) the model config.prompt
norprompt_name
are specified inSentenceTransformer.encode
, then the prompt specified bydefault_prompt_name
will be applied. If it isNone
, then no prompt will be applied.Instructor support (#2477)
Some INSTRUCTOR models, such as hkunlp/instructor-large, are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step.
The following models work out of the box:
You can use these models like so:
Information Retrieval usage
All other Instructor models either 1) will not load as they refer to
InstructorEmbedding
in theirmodules.json
or 2) require callingmodel.set_pooling_include_prompt(include_prompt=False)
after loading.Configuration
📅 Schedule: Branch creation - "* 0-4 * * 3" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.