“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain.
You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.
We are going to train this model Harmony, a web-based tool which allows researchers in social sciences fields such as psychology to compare questionnaire items and understand the semantic similarity between those items as a percentage value. Harmony originally had an underlying large language model which is a standard multilingual LLM from HuggingFace, but this general model had some shortcomings, so we decided to fine-tune a psychology domain model.
This notebook is adapted from the scripts for the Harmony/DOXA AI competition/AI hackathon (fine tune an LLM for the psychology domain), but it’s general and can be applied to other domains. Credit to Jeremy Lo Ying Ping for the training code.
First I’m going to install and import all the libraries which we’re going to use: Pandas, and the HuggingFace libraries.
In the Terminal or Console we can use Pip to install the libraries:
pip install pandas==2.2.2 transformers==4.43.1 sentence-transformers[train]==3.0.1
We can also download the training data with wget
from the console. (Or alternatively you can download it manually if you don’t have wget
.)
wget https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip
Now we switch to the Python interpreter and we can import the libraries:
import pandas as pd
Let’s load our training data from the zipped CSV file we just downloaded:
df = pd.read_csv("harmony-matching-training-data.csv.zip")
Now we’ve got our training data here, we can have a look inside:
df
sentence1 | sentence2 | score | |
---|---|---|---|
0 | Do you believe in telepathy (mind-reading)? | I believe that there are secret signs in the w... | 0.15 |
1 | Irritable behavior, angry outbursts, or acting... | Felt “on edge”? | 0.62 |
2 | I have some eccentric (odd) habits. | I often have difficulty following what someone... | 0.00 |
3 | Do you often feel nervous when you are in a gr... | Been easily annoyed by different things? | 0.00 |
4 | Do you believe in telepathy (mind-reading)? | Most of the time I find it is very difficult t... | 0.26 |
... | ... | ... | ... |
2346 | Little interest or pleasure in doing things | At times I have wondered if my body was really... | 0.00 |
2347 | Feeling down, depressed, or hopeless? | I find that I am very often confused about wha... | 0.00 |
2348 | Not being able to stop or control worrying? | If given the choice, I would much rather be wi... | 0.16 |
2349 | Feeling nervous, anxious or on edge? | Have had changes in appetite or sleep? | 0.16 |
2350 | Do you believe in clairvoyance (psychic forces... | Felt hopeless? | 0.20 |
2351 rows × 3 columns
Now we’re going to split the dataframe of training data into a train set and a test set. This is not strictly necessary to get the fine tuning to run but it’s good practice in machine learning to always split off a test set so that you have something to evaluate on once you’ve trained a model.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df)
In order for the import into HuggingFace to work, we just need to reset the indices in our two dataframes so that they are a sequence of integers again.
df_train.reset_index(inplace=True)
df_test.reset_index(inplace=True)
df_train.drop(columns=["index"], inplace=True)
df_test.drop(columns=["index"], inplace=True)
Let’s check the size of our training and test set. By default Scikit-learn puts 25% of our data into the test set and 75% into the training set (although this behaviour may change in future so I would not rely on it).
len(df_train), len(df_test)
(1763, 588)
Now let’s import everything from Pandas dataframes into HuggingFace Dataset
s which is what we will need for the training:
from datasets import Dataset
dataset_train = Dataset.from_pandas(df_train)
Let’s take a look at our training dataset now that it’s in HuggingFace format:
dataset_train
Dataset({ features: ['sentence1', 'sentence2', 'score'], num_rows: 1763 })
We can do the same with the test dataset:
dataset_test = Dataset.from_pandas(df_test)
Although there were a number of mental health models already on the HuggingFace Hub, we are going to start with a generalist multilingual model which produces embeddings in 768 dimensions. We’re going to start from the sentence transformer model all-mpnet-base-v2 and fine tune it.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
Let’s check the vital statistics of our model:
model
SentenceTransformer( (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() )
Let’s import the training libraries. We’re going to use the cosine similarity loss function.
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import CosineSimilarityLoss
loss = CosineSimilarityLoss(model)
Let’s define our training hyperparameters. For purposes of this demonstration we will train only for 3 epochs, but in practice you would set this to much higher and monitor your training and test set loss function and stop training when the test loss stops improving. You can also experiment with other parameters such as learning rate and batch size.
trainer = SentenceTransformerTrainer(
model = model,
args = SentenceTransformerTrainingArguments(
output_dir="checkpoints",
num_train_epochs=3,
per_device_eval_batch_size=16
),
train_dataset=dataset_train,
loss = loss
)
trainer
<sentence_transformers.trainer.SentenceTransformerTrainer at 0x7a081c90f810>
The training takes about 10 minutes on my laptop without a GPU. You should see a progress bar in your Jupyter or Colab notebook:
trainer.train()
Step | Training Loss |
---|---|
500 | 0.059000 |
TrainOutput(global_step=663, training_loss=0.05342011286302569, metrics={'train_runtime': 553.4312, 'train_samples_per_second': 9.557, 'train_steps_per_second': 1.198, 'total_flos': 0.0, 'train_loss': 0.05342011286302569, 'epoch': 3.0})
model
SentenceTransformer( (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() )
Now we can test the encoding by encoding a single sentence:
model.encode(["I feel nervous, anxious or afraid"])
array([[-9.93097015e-03, -7.88239613e-02, 6.90429413e-04, -5.47722541e-03, -7.27352547e-03, 5.51959611e-02, .... [abbreviated] ... 2.22559460e-02, -4.00685295e-02, -5.54708019e-02, 9.85449180e-03, -2.92167068e-02, -2.18849741e-02, -3.62964422e-02, -1.33359190e-02, -3.98904458e-02]], dtype=float32)
We can take all of the two text columns in our dataframe and embed them:
sentence_1_embeddings = model.encode(df_test.sentence1)
sentence_2_embeddings = model.encode(df_test.sentence2)
sentence_1_embeddings.shape
(588, 768)
Now we’re going to use a cosine similarity function. I’ve just copy pasting this from the Harmony matching function.
from numpy import dot, matmul, ndarray, matrix
from numpy.linalg import norm
import numpy as np
def cosine_similarity(vec1: ndarray, vec2: ndarray) -> ndarray:
dp = dot(vec1, vec2.T)
m1 = matrix(norm(vec1, axis=1))
m2 = matrix(norm(vec2.T, axis=0))
return np.asarray(dp / matmul(m1.T, m2))
similarity_matrix = cosine_similarity(sentence_1_embeddings, sentence_2_embeddings)
similarity_matrix.shape
(588, 588)
df_test["y_pred"] = [similarity_matrix[i,i] for i in range(len(similarity_matrix))]
df_test
sentence1 | sentence2 | score | y_pred | |
---|---|---|---|---|
0 | I sometimes jump quickly from one topic to ano... | People find my conversations to be confusing o... | 0.70 | 0.700950 |
1 | Trouble concentrating on things, such as readi... | Felt nervous or anxious? | 0.00 | 0.289797 |
2 | Loss of interest in activities that you used t... | My thoughts and behaviors are almost always di... | 0.70 | 0.197288 |
3 | Avoiding external reminders of the stressful e... | Some people can make me aware of them just by ... | 0.00 | 0.114830 |
4 | I sometimes jump quickly from one topic to ano... | I have trouble following conversations with ot... | 0.25 | 0.433558 |
... | ... | ... | ... | ... |
583 | I often ramble on too much when speaking. | I have had the momentary feeling that someone'... | 0.01 | 0.049073 |
584 | I sometimes jump quickly from one topic to ano... | Avoiding external reminders of the experience ... | 0.40 | 0.114422 |
585 | Being so restless that it is hard to sit still? | I believe that dreams have magical properties. | 0.37 | 0.066781 |
586 | Trouble relaxing? | Throughout my life, very few things have been ... | 0.00 | 0.244570 |
587 | I sometimes forget what I am trying to say. | Some people can make me aware of them just by ... | 0.00 | 0.031787 |
588 rows × 4 columns
df_test["residual"] = df_test.y_pred - df_test.score
df_test
sentence1 | sentence2 | score | y_pred | residual | |
---|---|---|---|---|---|
0 | I sometimes jump quickly from one topic to ano... | People find my conversations to be confusing o... | 0.70 | 0.700950 | 0.000950 |
1 | Trouble concentrating on things, such as readi... | Felt nervous or anxious? | 0.00 | 0.289797 | 0.289797 |
2 | Loss of interest in activities that you used t... | My thoughts and behaviors are almost always di... | 0.70 | 0.197288 | -0.502712 |
3 | Avoiding external reminders of the stressful e... | Some people can make me aware of them just by ... | 0.00 | 0.114830 | 0.114830 |
4 | I sometimes jump quickly from one topic to ano... | I have trouble following conversations with ot... | 0.25 | 0.433558 | 0.183558 |
... | ... | ... | ... | ... | ... |
583 | I often ramble on too much when speaking. | I have had the momentary feeling that someone'... | 0.01 | 0.049073 | 0.039073 |
584 | I sometimes jump quickly from one topic to ano... | Avoiding external reminders of the experience ... | 0.40 | 0.114422 | -0.285578 |
585 | Being so restless that it is hard to sit still? | I believe that dreams have magical properties. | 0.37 | 0.066781 | -0.303219 |
586 | Trouble relaxing? | Throughout my life, very few things have been ... | 0.00 | 0.244570 | 0.244570 |
587 | I sometimes forget what I am trying to say. | Some people can make me aware of them just by ... | 0.00 | 0.031787 | 0.031787 |
588 rows × 5 columns
np.mean(df_test.residual * df_test.residual)
np.float64(0.06922512204080025)
np.mean(np.abs(df_test.residual))
np.float64(0.20813965150724928)
Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.
Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.
Hire an NLP data scientist and boost your business with AI As artificial intelligence transcends the realm of sci-fi and starts getting intricately woven into our everyday lives, the demand for specialized professionals to oversee its many dimensions has never been higher. If your company is looking to step into the future, now is the perfect time to hire an NLP data scientist! What is an NLP data scientist? Natural Language Processing (NLP), a subset of machine learning, focuses on the interaction between humans and computers via natural language.