Train your own AI: Fine tune a large language model for sentence similarity

“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain.

You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.

We are going to train this model Harmony, a web-based tool which allows researchers in social sciences fields such as psychology to compare questionnaire items and understand the semantic similarity between those items as a percentage value. Harmony originally had an underlying large language model which is a standard multilingual LLM from HuggingFace, but this general model had some shortcomings, so we decided to fine-tune a psychology domain model.

This notebook is adapted from the scripts for the Harmony/DOXA AI competition/AI hackathon (fine tune an LLM for the psychology domain), but it’s general and can be applied to other domains. Credit to Jeremy Lo Ying Ping for the training code.

Useful links

Getting started: installing and importing dependencies and downloading the training data

First I’m going to install and import all the libraries which we’re going to use: Pandas, and the HuggingFace libraries.

In the Terminal or Console we can use Pip to install the libraries:

pip install pandas==2.2.2 transformers==4.43.1 sentence-transformers[train]==3.0.1

We can also download the training data with wget from the console. (Or alternatively you can download it manually if you don’t have wget.)

wget https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip

Now we switch to the Python interpreter and we can import the libraries:

import pandas as pd

Let’s load our training data from the zipped CSV file we just downloaded:

df = pd.read_csv("harmony-matching-training-data.csv.zip")

Now we’ve got our training data here, we can have a look inside:

df

	sentence1	sentence2	score
0	Do you believe in telepathy (mind-reading)?	I believe that there are secret signs in the w...	0.15
1	Irritable behavior, angry outbursts, or acting...	Felt “on edge”?	0.62
2	I have some eccentric (odd) habits.	I often have difficulty following what someone...	0.00
3	Do you often feel nervous when you are in a gr...	Been easily annoyed by different things?	0.00
4	Do you believe in telepathy (mind-reading)?	Most of the time I find it is very difficult t...	0.26
...	...	...	...
2346	Little interest or pleasure in doing things	At times I have wondered if my body was really...	0.00
2347	Feeling down, depressed, or hopeless?	I find that I am very often confused about wha...	0.00
2348	Not being able to stop or control worrying?	If given the choice, I would much rather be wi...	0.16
2349	Feeling nervous, anxious or on edge?	Have had changes in appetite or sleep?	0.16
2350	Do you believe in clairvoyance (psychic forces...	Felt hopeless?	0.20

2351 rows × 3 columns

Splitting into train and test sets

Now we’re going to split the dataframe of training data into a train set and a test set. This is not strictly necessary to get the fine tuning to run but it’s good practice in machine learning to always split off a test set so that you have something to evaluate on once you’ve trained a model.

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df)

In order for the import into HuggingFace to work, we just need to reset the indices in our two dataframes so that they are a sequence of integers again.

df_train.reset_index(inplace=True)
df_test.reset_index(inplace=True)

df_train.drop(columns=["index"], inplace=True)
df_test.drop(columns=["index"], inplace=True)

Let’s check the size of our training and test set. By default Scikit-learn puts 25% of our data into the test set and 75% into the training set (although this behaviour may change in future so I would not rely on it).

len(df_train), len(df_test)

(1763, 588)

Converting from Pandas to HuggingFace dataset format

Now let’s import everything from Pandas dataframes into HuggingFace Datasets which is what we will need for the training:

from datasets import Dataset

dataset_train = Dataset.from_pandas(df_train)

Let’s take a look at our training dataset now that it’s in HuggingFace format:

dataset_train

Dataset({
    features: ['sentence1', 'sentence2', 'score'],
    num_rows: 1763
})

We can do the same with the test dataset:

dataset_test = Dataset.from_pandas(df_test)

Initialising the model that we will fine tune

Although there were a number of mental health models already on the HuggingFace Hub, we are going to start with a generalist multilingual model which produces embeddings in 768 dimensions. We’re going to start from the sentence transformer model all-mpnet-base-v2 and fine tune it.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

Let’s check the vital statistics of our model:

model

    SentenceTransformer(
      (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
      (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
      (2): Normalize()
    )

Let’s import the training libraries. We’re going to use the cosine similarity loss function.

from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

from sentence_transformers.losses import CosineSimilarityLoss

loss = CosineSimilarityLoss(model)

Setting up the training

Let’s define our training hyperparameters. For purposes of this demonstration we will train only for 3 epochs, but in practice you would set this to much higher and monitor your training and test set loss function and stop training when the test loss stops improving. You can also experiment with other parameters such as learning rate and batch size.

trainer = SentenceTransformerTrainer(
    model = model,
    args = SentenceTransformerTrainingArguments(
        output_dir="checkpoints",
        num_train_epochs=3,
        per_device_eval_batch_size=16
    ),
    train_dataset=dataset_train,
    loss = loss
)

trainer

    <sentence_transformers.trainer.SentenceTransformerTrainer at 0x7a081c90f810>

Running the training

The training takes about 10 minutes on my laptop without a GPU. You should see a progress bar in your Jupyter or Colab notebook:

trainer.train()

[663/663 09:12, Epoch 3/3]

Step	Training Loss
500	0.059000

TrainOutput(global_step=663, training_loss=0.05342011286302569, metrics={'train_runtime': 553.4312, 'train_samples_per_second': 9.557, 'train_steps_per_second': 1.198, 'total_flos': 0.0, 'train_loss': 0.05342011286302569, 'epoch': 3.0})

model

    SentenceTransformer(
      (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
      (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
      (2): Normalize()
    )

Training has finished. Let’s encode a sentence with our model!

Now we can test the encoding by encoding a single sentence:

model.encode(["I feel nervous, anxious or afraid"])

    array([[-9.93097015e-03, -7.88239613e-02,  6.90429413e-04,
            -5.47722541e-03, -7.27352547e-03,  5.51959611e-02,
             .... [abbreviated] ...
             2.22559460e-02, -4.00685295e-02, -5.54708019e-02,
             9.85449180e-03, -2.92167068e-02, -2.18849741e-02,
            -3.62964422e-02, -1.33359190e-02, -3.98904458e-02]], dtype=float32)

Embed the entire dataframe

We can take all of the two text columns in our dataframe and embed them:

sentence_1_embeddings = model.encode(df_test.sentence1)

sentence_2_embeddings = model.encode(df_test.sentence2)

sentence_1_embeddings.shape

    (588, 768)

Calculate the cosine similarity

Now we’re going to use a cosine similarity function. I’ve just copy pasting this from the Harmony matching function.

from numpy import dot, matmul, ndarray, matrix
from numpy.linalg import norm
import numpy as np
def cosine_similarity(vec1: ndarray, vec2: ndarray) -> ndarray:
    dp = dot(vec1, vec2.T)
    m1 = matrix(norm(vec1, axis=1))
    m2 = matrix(norm(vec2.T, axis=0))

    return np.asarray(dp / matmul(m1.T, m2))

similarity_matrix = cosine_similarity(sentence_1_embeddings, sentence_2_embeddings)

similarity_matrix.shape

    (588, 588)

df_test["y_pred"] = [similarity_matrix[i,i] for i in range(len(similarity_matrix))]

df_test

	sentence1	sentence2	score	y_pred
0	I sometimes jump quickly from one topic to ano...	People find my conversations to be confusing o...	0.70	0.700950
1	Trouble concentrating on things, such as readi...	Felt nervous or anxious?	0.00	0.289797
2	Loss of interest in activities that you used t...	My thoughts and behaviors are almost always di...	0.70	0.197288
3	Avoiding external reminders of the stressful e...	Some people can make me aware of them just by ...	0.00	0.114830
4	I sometimes jump quickly from one topic to ano...	I have trouble following conversations with ot...	0.25	0.433558
...	...	...	...	...
583	I often ramble on too much when speaking.	I have had the momentary feeling that someone'...	0.01	0.049073
584	I sometimes jump quickly from one topic to ano...	Avoiding external reminders of the experience ...	0.40	0.114422
585	Being so restless that it is hard to sit still?	I believe that dreams have magical properties.	0.37	0.066781
586	Trouble relaxing?	Throughout my life, very few things have been ...	0.00	0.244570
587	I sometimes forget what I am trying to say.	Some people can make me aware of them just by ...	0.00	0.031787

588 rows × 4 columns

df_test["residual"] = df_test.y_pred - df_test.score

df_test

	sentence1	sentence2	score	y_pred	residual
0	I sometimes jump quickly from one topic to ano...	People find my conversations to be confusing o...	0.70	0.700950	0.000950
1	Trouble concentrating on things, such as readi...	Felt nervous or anxious?	0.00	0.289797	0.289797
2	Loss of interest in activities that you used t...	My thoughts and behaviors are almost always di...	0.70	0.197288	-0.502712
3	Avoiding external reminders of the stressful e...	Some people can make me aware of them just by ...	0.00	0.114830	0.114830
4	I sometimes jump quickly from one topic to ano...	I have trouble following conversations with ot...	0.25	0.433558	0.183558
...	...	...	...	...	...
583	I often ramble on too much when speaking.	I have had the momentary feeling that someone'...	0.01	0.049073	0.039073
584	I sometimes jump quickly from one topic to ano...	Avoiding external reminders of the experience ...	0.40	0.114422	-0.285578
585	Being so restless that it is hard to sit still?	I believe that dreams have magical properties.	0.37	0.066781	-0.303219
586	Trouble relaxing?	Throughout my life, very few things have been ...	0.00	0.244570	0.244570
587	I sometimes forget what I am trying to say.	Some people can make me aware of them just by ...	0.00	0.031787	0.031787

588 rows × 5 columns

np.mean(df_test.residual * df_test.residual)

    np.float64(0.06922512204080025)

np.mean(np.abs(df_test.residual))

    np.float64(0.20813965150724928)