Let’s Build Bg2Vec: Apply LLM2Vec on BgGPT

Martin Boyanov
3 min readMay 20, 2024

With the advent of RAG and text generation solutions, it is of utmost importance to have a model which can extract meaningful representations of passages of text.

However, the latest and greatest among the text generators are trained to predict the next token (Next Token Prediction = NTP). While this is great for generating text, the representations it generates are not generally suitable for similarity search.

This does not stop people from trying to use them in such a way. We can think of the following easy ways to encode documents from the token representations:

* Use the represenatation of the last token since it has seen(via attention) the entire document.
* Use some kind of pooling operation on the output representations: mean/max

The authors of the LLM2Vec paper suggest a technique to augment any GPT model to do a better job of encoding the meaning of a passage. The steps are as follows:

1. Allow bidirectional attention — all tokens now see the entire document. Of course, the model was not trained in such a manner, so this actually lowers the quality of the model.
2. Fine-tune the model using the Masked Next Token Prediction(MNTP) objective. This is very similar to Masked Language Model (MLM) objective: we mask out a portion of the document and try to recover it. The difference is that we train on the shifted positions so that we still try to predict the next token.
3. We further fine-tune the resulting model using SimCSE. This is a form of metric learning where we generate two representations of a given passage using different dropout masks and randomly sample a negative item. Then, we train the model to predict higher similarity for the positive samples and lower similarity for the negative samples.

To highlight the difference in the model, please consider the following figures.

In a traditional next token prediction, we try to predict the next token based on the activations of the previous token. Note that the representation of the previous token is influenced by all the previous tokens via the attention mechanism. However, it is only influenced by the past: attention is only allowed to previous tokens. This is also known as causal attention.
In contrast, masked language models mask out a portion of the input and attempt to recover it. The attention is now bidirectional — tokens can be influenced both by the future and the past. The masked out tokens are recovered using the activations to their corresponding positions.
Finally, MNTP is a mixture of the two approaches: we mask out a portion of the input tokens like in MLM, but we try to recover them using the activations of the previous token like in NTP. This is a crucial step as it retains most of the structure of the generative problem that the model is solving, so the fine-tuning process only needs to learn to leverage the bidirectional attention mechanism.

This will be a series of blog posts where we attempt to apply the LLM2Vec technique on BgGPT. We will be using a dump of the Bulgarian Wikipedia as finetuning data. The plan is as follows:

1. Part 1 — Overview & Objectives. (this post)
2. Part 2 — Preparing the training data
3. Part 3 — Masked Next Token Prediction training
4. Part 4 — SimCSE training
5. Part 5 — Evaluation of the resulting text encoder

Stay tuned for the next parts in the series.

Warning:
This is a learning experience for me, so the final model might not end up actually being better. Still, we will encounter many challenges along the way and learn how to solve them.

References:

1. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders https://arxiv.org/abs/2404.05961
2. BgGPT: https://bggpt.ai/

--

--

Martin Boyanov

Data Scientist passionate about NLP and Graph Modeling