Train your own AI: Fine tune a large language model for sentence similarity

·

“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain.

You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.

We are going to train this model Harmony, a web-based tool which allows researchers in social sciences fields such as psychology to compare questionnaire items and understand the semantic similarity between those items as a percentage value. Harmony originally had an underlying large language model which is a standard multilingual LLM from HuggingFace, but this general model had some shortcomings, so we decided to fine-tune a psychology domain model.

This notebook is adapted from the scripts for the Harmony/DOXA AI competition/AI hackathon (fine tune an LLM for the psychology domain), but it’s general and can be applied to other domains. Credit to Jeremy Lo Ying Ping for the training code.

Getting started: installing and importing dependencies and downloading the training data

First I’m going to install and import all the libraries which we’re going to use: Pandas, and the HuggingFace libraries.

In the Terminal or Console we can use Pip to install the libraries:

pip install pandas==2.2.2 transformers==4.43.1 sentence-transformers[train]==3.0.1

We can also download the training data with wget from the console. (Or alternatively you can download it manually if you don’t have wget.)

wget https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip

Now we switch to the Python interpreter and we can import the libraries:

import pandas as pd

Let’s load our training data from the zipped CSV file we just downloaded:

df = pd.read_csv("harmony-matching-training-data.csv.zip")

Now we’ve got our training data here, we can have a look inside:

df
sentence1sentence2score
0Do you believe in telepathy (mind-reading)?I believe that there are secret signs in the w...0.15
1Irritable behavior, angry outbursts, or acting...Felt “on edge”?0.62
2I have some eccentric (odd) habits.I often have difficulty following what someone...0.00
3Do you often feel nervous when you are in a gr...Been easily annoyed by different things?0.00
4Do you believe in telepathy (mind-reading)?Most of the time I find it is very difficult t...0.26
............
2346Little interest or pleasure in doing thingsAt times I have wondered if my body was really...0.00
2347Feeling down, depressed, or hopeless?I find that I am very often confused about wha...0.00
2348Not being able to stop or control worrying?If given the choice, I would much rather be wi...0.16
2349Feeling nervous, anxious or on edge?Have had changes in appetite or sleep?0.16
2350Do you believe in clairvoyance (psychic forces...Felt hopeless?0.20

2351 rows × 3 columns

Splitting into train and test sets

Now we’re going to split the dataframe of training data into a train set and a test set. This is not strictly necessary to get the fine tuning to run but it’s good practice in machine learning to always split off a test set so that you have something to evaluate on once you’ve trained a model.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df)

In order for the import into HuggingFace to work, we just need to reset the indices in our two dataframes so that they are a sequence of integers again.

df_train.reset_index(inplace=True)
df_test.reset_index(inplace=True)
df_train.drop(columns=["index"], inplace=True)
df_test.drop(columns=["index"], inplace=True)

Let’s check the size of our training and test set. By default Scikit-learn puts 25% of our data into the test set and 75% into the training set (although this behaviour may change in future so I would not rely on it).

len(df_train), len(df_test)
(1763, 588)

Converting from Pandas to HuggingFace dataset format

Now let’s import everything from Pandas dataframes into HuggingFace Datasets which is what we will need for the training:

from datasets import Dataset
dataset_train = Dataset.from_pandas(df_train)

Let’s take a look at our training dataset now that it’s in HuggingFace format:

dataset_train
Dataset({
    features: ['sentence1', 'sentence2', 'score'],
    num_rows: 1763
})

We can do the same with the test dataset:

dataset_test = Dataset.from_pandas(df_test)

Initialising the model that we will fine tune

Although there were a number of mental health models already on the HuggingFace Hub, we are going to start with a generalist multilingual model which produces embeddings in 768 dimensions. We’re going to start from the sentence transformer model all-mpnet-base-v2 and fine tune it.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")

Let’s check the vital statistics of our model:

model
    SentenceTransformer(
      (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
      (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
      (2): Normalize()
    )

Let’s import the training libraries. We’re going to use the cosine similarity loss function.

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import CosineSimilarityLoss
loss = CosineSimilarityLoss(model)

Setting up the training

Let’s define our training hyperparameters. For purposes of this demonstration we will train only for 3 epochs, but in practice you would set this to much higher and monitor your training and test set loss function and stop training when the test loss stops improving. You can also experiment with other parameters such as learning rate and batch size.

trainer = SentenceTransformerTrainer(
    model = model,
    args = SentenceTransformerTrainingArguments(
        output_dir="checkpoints",
        num_train_epochs=3,
        per_device_eval_batch_size=16
    ),
    train_dataset=dataset_train,
    loss = loss
)
trainer
    <sentence_transformers.trainer.SentenceTransformerTrainer at 0x7a081c90f810>

Running the training

The training takes about 10 minutes on my laptop without a GPU. You should see a progress bar in your Jupyter or Colab notebook:

trainer.train()
[663/663 09:12, Epoch 3/3]
StepTraining Loss
5000.059000

TrainOutput(global_step=663, training_loss=0.05342011286302569, metrics={'train_runtime': 553.4312, 'train_samples_per_second': 9.557, 'train_steps_per_second': 1.198, 'total_flos': 0.0, 'train_loss': 0.05342011286302569, 'epoch': 3.0})
model
    SentenceTransformer(
      (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
      (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
      (2): Normalize()
    )

Training has finished. Let’s encode a sentence with our model!

Now we can test the encoding by encoding a single sentence:

model.encode(["I feel nervous, anxious or afraid"])
    array([[-9.93097015e-03, -7.88239613e-02,  6.90429413e-04,
            -5.47722541e-03, -7.27352547e-03,  5.51959611e-02,
             .... [abbreviated] ...
             2.22559460e-02, -4.00685295e-02, -5.54708019e-02,
             9.85449180e-03, -2.92167068e-02, -2.18849741e-02,
            -3.62964422e-02, -1.33359190e-02, -3.98904458e-02]], dtype=float32)

Embed the entire dataframe

We can take all of the two text columns in our dataframe and embed them:

sentence_1_embeddings = model.encode(df_test.sentence1)
sentence_2_embeddings = model.encode(df_test.sentence2)
sentence_1_embeddings.shape
    (588, 768)

Calculate the cosine similarity

Now we’re going to use a cosine similarity function. I’ve just copy pasting this from the Harmony matching function.

from numpy import dot, matmul, ndarray, matrix
from numpy.linalg import norm
import numpy as np
def cosine_similarity(vec1: ndarray, vec2: ndarray) -> ndarray:
    dp = dot(vec1, vec2.T)
    m1 = matrix(norm(vec1, axis=1))
    m2 = matrix(norm(vec2.T, axis=0))

    return np.asarray(dp / matmul(m1.T, m2))
    
similarity_matrix = cosine_similarity(sentence_1_embeddings, sentence_2_embeddings)
similarity_matrix.shape
    (588, 588)
df_test["y_pred"] = [similarity_matrix[i,i] for i in range(len(similarity_matrix))]
df_test
sentence1sentence2scorey_pred
0I sometimes jump quickly from one topic to ano...People find my conversations to be confusing o...0.700.700950
1Trouble concentrating on things, such as readi...Felt nervous or anxious?0.000.289797
2Loss of interest in activities that you used t...My thoughts and behaviors are almost always di...0.700.197288
3Avoiding external reminders of the stressful e...Some people can make me aware of them just by ...0.000.114830
4I sometimes jump quickly from one topic to ano...I have trouble following conversations with ot...0.250.433558
...............
583I often ramble on too much when speaking.I have had the momentary feeling that someone'...0.010.049073
584I sometimes jump quickly from one topic to ano...Avoiding external reminders of the experience ...0.400.114422
585Being so restless that it is hard to sit still?I believe that dreams have magical properties.0.370.066781
586Trouble relaxing?Throughout my life, very few things have been ...0.000.244570
587I sometimes forget what I am trying to say.Some people can make me aware of them just by ...0.000.031787

588 rows × 4 columns

df_test["residual"] = df_test.y_pred - df_test.score
df_test
sentence1sentence2scorey_predresidual
0I sometimes jump quickly from one topic to ano...People find my conversations to be confusing o...0.700.7009500.000950
1Trouble concentrating on things, such as readi...Felt nervous or anxious?0.000.2897970.289797
2Loss of interest in activities that you used t...My thoughts and behaviors are almost always di...0.700.197288-0.502712
3Avoiding external reminders of the stressful e...Some people can make me aware of them just by ...0.000.1148300.114830
4I sometimes jump quickly from one topic to ano...I have trouble following conversations with ot...0.250.4335580.183558
..................
583I often ramble on too much when speaking.I have had the momentary feeling that someone'...0.010.0490730.039073
584I sometimes jump quickly from one topic to ano...Avoiding external reminders of the experience ...0.400.114422-0.285578
585Being so restless that it is hard to sit still?I believe that dreams have magical properties.0.370.066781-0.303219
586Trouble relaxing?Throughout my life, very few things have been ...0.000.2445700.244570
587I sometimes forget what I am trying to say.Some people can make me aware of them just by ...0.000.0317870.031787

588 rows × 5 columns

np.mean(df_test.residual * df_test.residual)
    np.float64(0.06922512204080025)
np.mean(np.abs(df_test.residual))
    np.float64(0.20813965150724928)

Hire an NLP developer
Ai and nlpBusiness applications

Hire an NLP developer

Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.

What is NLP?

What is NLP?

Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.

Hire an NLP data scientist
Ai and nlpBusiness applications

Hire an NLP data scientist

Hire an NLP data scientist and boost your business with AI As artificial intelligence transcends the realm of sci-fi and starts getting intricately woven into our everyday lives, the demand for specialized professionals to oversee its many dimensions has never been higher. If your company is looking to step into the future, now is the perfect time to hire an NLP data scientist! What is an NLP data scientist? Natural Language Processing (NLP), a subset of machine learning, focuses on the interaction between humans and computers via natural language.