Let’s Build Bg2Vec: Apply LLM2Vec on BgGPT
With the advent of RAG and text generation solutions, it is of utmost importance to have a model which can extract meaningful representations of passages of text.
However, the latest and greatest among the text generators are trained to predict the next token (Next Token Prediction = NTP). While this is great for generating text, the representations it generates are not generally suitable for similarity search.
This does not stop people from trying to use them in such a way. We can think of the following easy ways to encode documents from the token representations:
* Use the represenatation of the last token since it has seen(via attention) the entire document.
* Use some kind of pooling operation on the output representations: mean/max
The authors of the LLM2Vec paper suggest a technique to augment any GPT model to do a better job of encoding the meaning of a passage. The steps are as follows:
1. Allow bidirectional attention — all tokens now see the entire document. Of course, the model was not trained in such a manner, so this actually lowers the quality of the model.
2. Fine-tune the model using the Masked Next Token Prediction(MNTP) objective. This is very similar to Masked Language Model (MLM) objective: we mask out a portion of the document and try to recover it. The difference is that we train on the shifted positions so that we still try to predict the next token.
3. We further fine-tune the resulting model using SimCSE. This is a form of metric learning where we generate two representations of a given passage using different dropout masks and randomly sample a negative item. Then, we train the model to predict higher similarity for the positive samples and lower similarity for the negative samples.
To highlight the difference in the model, please consider the following figures.
This will be a series of blog posts where we attempt to apply the LLM2Vec technique on BgGPT. We will be using a dump of the Bulgarian Wikipedia as finetuning data. The plan is as follows:
1. Part 1 — Overview & Objectives. (this post)
2. Part 2 — Preparing the training data
3. Part 3 — Masked Next Token Prediction training
4. Part 4 — SimCSE training
5. Part 5 — Evaluation of the resulting text encoder
Stay tuned for the next parts in the series.
Warning:
This is a learning experience for me, so the final model might not end up actually being better. Still, we will encounter many challenges along the way and learn how to solve them.
References:
1. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders https://arxiv.org/abs/2404.05961
2. BgGPT: https://bggpt.ai/