🏆

Top-performing methodologies in the MTEB

Author

Woomyoung Park / CDO & Head of Research

MTEB(Massive Text Embedding Benchmark)

What is MTEB?

•

A large-scale benchmark created to measure the performance of text embedding models in various embedding tasks

•

Number of datasets, languages, scores, and models as of October 10, 2023

◦

Total Datasets: 129

◦

Total Languages: 113

◦

Total Scores: 14667

◦

Total Models: 126

An overview of tasks and datasets in MTEB(https://arxiv.org/pdf/2210.07316)

References :

https://github.com/embeddings-benchmark/mteb

https://huggingface.co/spaces/mteb/leaderboard

https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB (Models specialized for Chinese

Top-performing methodologies within MTEB

1. FlagEmbedding

•

BAAI General Embedding

•

https://github.com/FlagOpen/FlagEmbedding 

(1) Datasets and Training Process

•

Pre-training dataset: Pile, Wikipedia, MS MARCO, etc.

•

Fine-tuning datasets: Wikipedia, CC-NET etc.

•

Dataset composition: Questions, positive answers, negative answers

(2) Training Details:

•

Trained using custom base model, RetroMAE

•

Pre-training:

◦

Batch size: 720

◦

Optimizer: AdamW

◦

Learning Rate (LR): 2e-5

•

Fine-tuning:

◦

Batch size: 32,784

◦

Optimizer: AdamW

◦

Learning Rate (LR): 1e-5

◦

Temperature setting: 0.01

•

Feature: Utilizes "hard negative" learning method during fine-tuning

•

Ratio: 1 positive answer, 1 hard negative answer, 65,566 negative answers

(3) Characteristics

•

For retrieval tasks, a special instruction, "Represent this sentence for searching relevant passages:" was added to perform query encoding, and sentences were made relatively longer to mitigate the length difference between short queries and long documents

•

GTE argues that during fine-tuning, because there are hard negative sentences, the batch size need not be very large

•

Negative answers imply responses that are close to being irrelevant

2. GTE(Towards General Text Embeddings with Multi-stage Contrastive Learning)

Datasets:

•

Pre-training: 788 million response pairs

◦

Response pairs generated in a manner similar to E5's CCPairs data creation method

◦

Filtering methodology undisclosed

•

Fine-tuning: 3 million response pairs

◦

Various datasets used including MS MARCO, NQ, TriviaQA, HotpotQA, Web Questions, SNLI, MNLI, FEVER, Quora, MEDI, BERRI, etc.

Training:

•

Transformer Encoder

•

MiniLM, bert-base, and bert-large used as backbone models.

•

Vanilla dual-encoder structure with mean pooling used on the output layer

•

Pre-training 

•

“Improved” InfoNCE loss

•

Loss calculation:

◦

Typically, the loss is calculated between query and document pairs.

◦

GTE also adds comparisons between queries and between documents in the loss calculation.

•

Pre-training:

◦

Contrastive training with only in-batch negatives.

◦

Due to the absence of hard negatives, used a very large batch size (16,384).

◦

To compensate, limited the max length to 128 tokens.

•

Fine-tuning:

◦

Because hard negative responses are available (random negatives are used for data without hard negatives), a very large batch size isn't necessary. Thus, batch size is set to 128.

◦

Maximum length was set to 512 tokens.

•

Observation:

◦

Simply fine-tuning the backbone model may result in worse performance, possibly due to limitations in data scale.

•

Performance comparison based on the number of training data samples, batch size, and number of model parameters.

Scaling analysis of various factors during contrastive pre-training and fine-tuning.

Model performance is measured by the average performance on MTEB.

•

Characteristics

◦

Uses an approach very similar to E5 with the same model structure, the same backbone models, and similar training methodologies

◦

There are slight differences in the method of generating datasets for pre-training

◦

More data is used for fine-tuning and no teacher model is used

3. E5

•

EmbEddings from bidirEctional Encoder rEpresentations

•

MS

•

https://arxiv.org/abs/2212.03533

•

Pre-training

•

Constructed a dataset called CCPairs”

•

Pairs are generated in forms such as (query, passage), (post, comment), (question, upvoted answer), (entity name+section title, passage), (title, abstract), (title, passage)

•

Initial Filtering: Heuristic-based filtering is applied to data from Reddit and Common Crawl, resulting in 130 million pairs.

•

Consistency-based filtering: A method where the model is first trained on initially collected data, then the training data is reintroduced to select top-k passages based on queries. Only data with actual mapped passages within k=2 are used. (The key is how much to train and iterate). 27 billion pairs

◦

Neural networks tend to fit clean data first and then overfit to noises in data later on

•

Fine-tuning:

◦

Datasets : MS MARCO, NQ, NLI. 

◦

By treating contradictions as hard negatives, the model  for NLI is being trained to distinguish between very similar but fundamentally different sentence pairs. 

◦

msmarco와 nq는 teacher 모델 (cross-encoder) 에서 추출. Teacher 모델로 SimLM 사용

•

Training

◦

Transformer Encoder

◦

MiniLM, bert-base, and bert-large are used as backbone models.

◦

Uses a Bi-encoder structure with average pooling for the output layer

◦

Pre-training

▪

InfoNCE

▪

Uses in-batch negatives. No explicit hard negatives

▪

Query / passage encoder uses parameter sharing. Instead, prefixes are added (query: , passage: )

◦

Fine-tuning

▪

Uses InfoNCE loss with added distillation loss from the teacher model

•

Discussion points

◦

Slightly better than InstructOR. State-of-the-art at the time

◦

Can reference methods for collecting training data, filtering methods, negative data mapping methods, training techniques, objective functions, and other factors

4. InstructOR

•

The University of Hong Kong, U of Washington, Meta, AllenAI

•

https://arxiv.org/abs/2212.09741

•

Dataset

◦

Modeling only through fine-tuning

◦

The authors directly constructed the MEDI dataset

▪

A total of 330 different datasets

▪

300 from Super-NaturalInstructions

◦

30 types of Embedding data :Sentence Transformers(https://huggingface.co/datasets/sentence-transformers/embedding-training-data), KILT, MedMCQA

◦

Uses existing embedding model (Sentence-T5) to perform positive / negative mapping

◦

Adds instructions during tuning to enable a single model to operate on various tasks

▪

Example-retrieval) Represent the Wikipedia question for re- trieving supporting documents:(query), Represent the Wikipedia document for retrieval:(document)

•

학습

Training

◦

Single encoder

◦

Uses GTR model as backbone: Initialized with T5 model, then pre-trained on web corpus, fine-tuned on information retrieval datasets

◦

Obtains embedding vector by mean pooling the representation of the last hidden layer

◦

Characterized by having only a fine-tuning stage

◦

InfoNCE

•

Observations

◦

Adding instructions tends to improve performance

◦

When trained on super-NI data, the performance gap between best and worst cases is much smaller. In other words, performance becomes more robust across different instructions

◦

Performance improves as instructions are added more precisely (!)

•

Performance improves with model size (scaling laws)

•

Shows superior performance in unseen domains

Discussion points

•

GTR is a form of T5 pre-training + fine-tuning using contrastive loss on search data

◦

A model that improves performance by collecting additional data for tuning and using prompts tailored to the purpose

Other methodologies worth referencing

•

Sentence-T5

◦

https://arxiv.org/abs/2108.08877

•

GTR

◦

https://arxiv.org/abs/2112.07899

•

SGPT

◦

https://arxiv.org/abs/2202.08904