“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain.
You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.
We are going to train this model Harmony, a web-based tool which allows researchers in social sciences fields such as psychology to compare questionnaire items and understand the semantic similarity between those items as a percentage value. Harmony originally had an underlying large language model which is a standard multilingual LLM from HuggingFace, but this general model had some shortcomings, so we decided to fine-tune a psychology domain model.
This notebook is adapted from the scripts for the Harmony/DOXA AI competition (fine tune an LLM for the psychology domain), but it’s general and can be applied to other domains. Credit to Jeremy Lo Ying Ping for the training code.
First I’m going to install and import all the libraries which we’re going to use: Pandas, and the HuggingFace libraries.
python
# !pip install pandas==2.2.2 transformers==4.43.1 sentence-transformers[train]==3.0.1
import pandas as pd
Now we can download the training data with wget
.
!wget https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip
Now we’ve got our training data here, we can have a look inside:
df = pd.read_csv("harmony-matching-training-data.csv.zip")
df
sentence1 | sentence2 | score | |
---|---|---|---|
0 | Do you believe in telepathy (mind-reading)? | I believe that there are secret signs in the w... | 0.15 |
1 | Irritable behavior, angry outbursts, or acting... | Felt “on edge”? | 0.62 |
2 | I have some eccentric (odd) habits. | I often have difficulty following what someone... | 0.00 |
3 | Do you often feel nervous when you are in a gr... | Been easily annoyed by different things? | 0.00 |
4 | Do you believe in telepathy (mind-reading)? | Most of the time I find it is very difficult t... | 0.26 |
... | ... | ... | ... |
2346 | Little interest or pleasure in doing things | At times I have wondered if my body was really... | 0.00 |
2347 | Feeling down, depressed, or hopeless? | I find that I am very often confused about wha... | 0.00 |
2348 | Not being able to stop or control worrying? | If given the choice, I would much rather be wi... | 0.16 |
2349 | Feeling nervous, anxious or on edge? | Have had changes in appetite or sleep? | 0.16 |
2350 | Do you believe in clairvoyance (psychic forces... | Felt hopeless? | 0.20 |
2351 rows × 3 columns
Now we’re going to split this into a train set and a test set.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df)
df_train.reset_index(inplace=True)
df_test.reset_index(inplace=True)
df_train.drop(columns=["index"], inplace=True)
df_test.drop(columns=["index"], inplace=True)
len(df_train), len(df_test)
(1763, 588)
Now let’s import everything from Pandas dataframes into HuggingFace datasets which is what we will need for the training:
from datasets import Dataset
dataset_train = Dataset.from_pandas(df_train)
dataset_train
Dataset({
features: ['sentence1', 'sentence2', 'score'],
num_rows: 1763
})
dataset_test = Dataset.from_pandas(df_test)
We’re going to start from the sentence transformer model all-mpnet-base-v2 and fine tune it.
Although there were a number of mental health models on hugging on the HuggingFace Hub, we are going to start with a generalist model which produces embeddings in 768 dimensions.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
model
SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Let’s import the training libraries. We’re going to use the cosine similarity loss function.
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import CosineSimilarityLoss
loss = CosineSimilarityLoss(model)
Let’s define our training hyperparameters:
trainer = SentenceTransformerTrainer(
model = model,
args = SentenceTransformerTrainingArguments(
output_dir="checkpoints",
num_train_epochs=3,
per_device_eval_batch_size=16
),
train_dataset=dataset_train,
loss = loss
)
trainer
<sentence_transformers.trainer.SentenceTransformerTrainer at 0x7a081c90f810>
The training took about 10 minutes on my laptop without a GPU.
trainer.train()
Step | Training Loss |
---|---|
500 | 0.059000 |
TrainOutput(global_step=663, training_loss=0.05342011286302569, metrics={'train_runtime': 553.4312, 'train_samples_per_second': 9.557, 'train_steps_per_second': 1.198, 'total_flos': 0.0, 'train_loss': 0.05342011286302569, 'epoch': 3.0})
model
SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Now we can test the encoding by encoding a single sentence:
model.encode(["I feel nervous, anxious or afraid"])
array([[-9.93097015e-03, -7.88239613e-02, 6.90429413e-04,
-5.47722541e-03, -7.27352547e-03, 5.51959611e-02,
.... [abbreviated] ...
2.22559460e-02, -4.00685295e-02, -5.54708019e-02,
9.85449180e-03, -2.92167068e-02, -2.18849741e-02,
-3.62964422e-02, -1.33359190e-02, -3.98904458e-02]], dtype=float32)
We can take all of the two text columns in our dataframe and embed them:
sentence_1_embeddings = model.encode(df_test.sentence1)
sentence_2_embeddings = model.encode(df_test.sentence2)
sentence_1_embeddings.shape
(588, 768)
Now we’re going to use a cosine similarity function. I’ve just copy pasting this from the Harmony matching function.
from numpy import dot, matmul, ndarray, matrix
from numpy.linalg import norm
import numpy as np
def cosine_similarity(vec1: ndarray, vec2: ndarray) -> ndarray:
dp = dot(vec1, vec2.T)
m1 = matrix(norm(vec1, axis=1))
m2 = matrix(norm(vec2.T, axis=0))
return np.asarray(dp / matmul(m1.T, m2))
similarity_matrix = cosine_similarity(sentence_1_embeddings, sentence_2_embeddings)
similarity_matrix.shape
(588, 588)
df_test["y_pred"] = [similarity_matrix[i,i] for i in range(len(similarity_matrix))]
df_test
sentence1 | sentence2 | score | y_pred | |
---|---|---|---|---|
0 | I sometimes jump quickly from one topic to ano... | People find my conversations to be confusing o... | 0.70 | 0.700950 |
1 | Trouble concentrating on things, such as readi... | Felt nervous or anxious? | 0.00 | 0.289797 |
2 | Loss of interest in activities that you used t... | My thoughts and behaviors are almost always di... | 0.70 | 0.197288 |
3 | Avoiding external reminders of the stressful e... | Some people can make me aware of them just by ... | 0.00 | 0.114830 |
4 | I sometimes jump quickly from one topic to ano... | I have trouble following conversations with ot... | 0.25 | 0.433558 |
... | ... | ... | ... | ... |
583 | I often ramble on too much when speaking. | I have had the momentary feeling that someone'... | 0.01 | 0.049073 |
584 | I sometimes jump quickly from one topic to ano... | Avoiding external reminders of the experience ... | 0.40 | 0.114422 |
585 | Being so restless that it is hard to sit still? | I believe that dreams have magical properties. | 0.37 | 0.066781 |
586 | Trouble relaxing? | Throughout my life, very few things have been ... | 0.00 | 0.244570 |
587 | I sometimes forget what I am trying to say. | Some people can make me aware of them just by ... | 0.00 | 0.031787 |
588 rows × 4 columns
df_test["residual"] = df_test.y_pred - df_test.score
df_test
sentence1 | sentence2 | score | y_pred | residual | |
---|---|---|---|---|---|
0 | I sometimes jump quickly from one topic to ano... | People find my conversations to be confusing o... | 0.70 | 0.700950 | 0.000950 |
1 | Trouble concentrating on things, such as readi... | Felt nervous or anxious? | 0.00 | 0.289797 | 0.289797 |
2 | Loss of interest in activities that you used t... | My thoughts and behaviors are almost always di... | 0.70 | 0.197288 | -0.502712 |
3 | Avoiding external reminders of the stressful e... | Some people can make me aware of them just by ... | 0.00 | 0.114830 | 0.114830 |
4 | I sometimes jump quickly from one topic to ano... | I have trouble following conversations with ot... | 0.25 | 0.433558 | 0.183558 |
... | ... | ... | ... | ... | ... |
583 | I often ramble on too much when speaking. | I have had the momentary feeling that someone'... | 0.01 | 0.049073 | 0.039073 |
584 | I sometimes jump quickly from one topic to ano... | Avoiding external reminders of the experience ... | 0.40 | 0.114422 | -0.285578 |
585 | Being so restless that it is hard to sit still? | I believe that dreams have magical properties. | 0.37 | 0.066781 | -0.303219 |
586 | Trouble relaxing? | Throughout my life, very few things have been ... | 0.00 | 0.244570 | 0.244570 |
587 | I sometimes forget what I am trying to say. | Some people can make me aware of them just by ... | 0.00 | 0.031787 | 0.031787 |
588 rows × 5 columns
np.mean(df_test.residual * df_test.residual)
np.float64(0.06922512204080025)
np.mean(np.abs(df_test.residual))
np.float64(0.20813965150724928)
Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.
Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.
Hire an NLP data scientist and boost your business with AI As artificial intelligence transcends the realm of sci-fi and starts getting intricately woven into our everyday lives, the demand for specialized professionals to oversee its many dimensions has never been higher. If your company is looking to step into the future, now is the perfect time to hire an NLP data scientist! What is an NLP data scientist? Natural Language Processing (NLP), a subset of machine learning, focuses on the interaction between humans and computers via natural language.