Welcome to the comprehensive blog on solving Natural Language Processing (NLP) tasks using Python. This guide covers a wide range of practical NLP tasks and provides clear solutions in Python. You will find code snippets throughout the blog to help you kickstart your NLP projects right away.
This blog caters to individuals engaged in the development of Natural Language Processing applications using Python.
Python is the language of choice for NLP, thanks to its extensive community support and a plethora of machine learning libraries such as TensorFlow, scikit-learn, PyTorch, pandas, spaCy, and NLTK. Its simplicity makes it an ideal entry point into the field of data science.
The blog explores the full spectrum of NLP tasks, providing practical insights and code samples to tackle challenges effectively.
The blog explores using Python for Natural Language Processing (NLP) tasks. It’s for those building NLP applications. Python’s community support, ML libraries (like TensorFlow, scikit-learn, PyTorch, pandas, spaCy, NLTK), and ease of learning make it ideal for NLP. The blog covers a wide range of practical NLP tasks, offering code snippets for quick starts.
NLP is used in a wide range of applications with some of them being used in our daily lives such as:
There are several fundamental tasks that appear frequently while solving various kinds NLP projects. This section explores these fundamental tasks in brief.
Natural Language Processing is challenging for several reasons. The most popular reasons being:
While building NLP projects we usually have to pre-process the text before feeding the data to Machine Learning models. Text pre-processing is a crucial step as it ensures data is of high quality and accuracy. Below we have listed down the various steps that are involved in text pre-processing:
Text encoding: It is common to have strings containing emojis, symbols or other graphic characters. To process these symbols and special characters they need to be converted into a binary representation. In order to do so, we need to convert them to an encoding scheme.
text = "I like to play 🏓"
encoded_text = text.encode("utf-8")
print(encoded_text)
OUTPUT: b'I like to play \xf0\x9f\x8f\x93'
Spelling correction: While dealing with language it is common for the data to be noisy and contain some spelling mistakes. This can be prevalent in social media data and search queries on the web. Below we show how you can leverage Symspellpy to deal with spelling mistakes and get a cleaner version of your text
import pkg_resources
from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
input_term = (
"hello wrld I am lerning naturl languge procesing in pthon"
)
# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=2)
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
print(suggestion)
OUTPUT: hello world i am learning natural language processing in python, 8, 0
Sentence segmentation: While building NLP systems, typically text is broken down into sentences and tokens (words). Sentence segmentation breaks down text into sentences based on full stops and question marks. We can use the NLTK library for this purpose.
from nltk.tokenize import sent_tokenize
text = """While building NLP systems, typically text is broken down into sentences and tokens (words).
Sentence segmentation breaks down text into sentences based on full stops and question marks.
We can use the nltk library for this purpose. It may be easy to split words based on punctutations.
However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
Let us see how sentence segmentation can handle these cases.
"""
sentences = sent_tokenize(text)
for sentence in sentences:
print(sentence)
OUTPUT: While building NLP systems, typically text is broken down into sentences and tokens (words).
Sentence segmentation breaks down text into sentences based on full stops and question marks.
We can use the nltk library for this purpose.
It may be easy to split words based on punctutations.
However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
Let us see how sentence segmentation can handle these cases
Word tokenization: Just like sentence tokenization is breaking down text into sentences, word tokenization is breaking down text into words. We can use the NLTK library for this purpose.
from nltk.tokenize import sent_tokenize, word_tokenize
text = """While building NLP systems, typically text is broken down into sentences and tokens (words).
Sentence segmentation breaks down text into sentences based on full stops and question marks.
We can use the nltk library for this purpose. It may be easy to split words based on punctutations.
However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
Let us see how sentence segmentation can handle these cases.
"""
sentences = sent_tokenize(text)
for sentence in sentences:
word_tokenize(sentence)
OUTPUT:
['While', 'building', 'NLP', 'systems', ',', 'typically', 'text', 'is', 'broken', 'down', 'into', 'sentences', 'and', 'tokens', '(', 'words', ')', '.']
['Sentence', 'segmentation', 'breaks', 'down', 'text', 'into', 'sentences', 'based', 'on', 'full', 'stops', 'and', 'question', 'marks', '.']
['We', 'can', 'use', 'the', 'nltk', 'library', 'for', 'this', 'purpose', '.']
['It', 'may', 'be', 'easy', 'to', 'split', 'words', 'based', 'on', 'punctutations', '.']
['However', ',', 'we', 'will', 'face', 'issues', 'while', 'splitting', 'words', 'that', 'contain', 'Dr.', 'or', 'Mr.', 'or', 'Mrs.', 'Let', 'us', 'see', 'how', 'sentence', 'segmentation', 'can', 'handle', 'these', 'cases', '.']
Stop-words removal: In most nlp applications we need to extract useful information from a piece of text. Frequently used words such as a, an, the, of, in, etc; do not add much value to the meaning of the sentence. Such words are known as stop-words and are typically removed in the preprocessing step.
from nltk.corpus import stopwords
stopwords_list = set(stopwords.words('english'))
text = "hello this sentence consists of multiple stopwords"
cleaned_text = " ".join(token for token in word_tokenize(text) if token not in stopwords_list)
print(cleaned_text)
OUTPUT: hello sentence consists multiple stopwords
Stemming and lemmatization:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
print(stemmer.stem("dollars"), lemmatizer.lemmatize("happening", pos="v"))
OUTPUT: dollar,happen
Lowercasing: Lowercasing text helps in normalizing the text data. By converting all text to lowercase, you ensure that the same word in different cases (e.g., “apple” and “Apple”) is treated as identical. This reduces the vocabulary size and helps NLP models focus on the meaning of words rather than their capitalization.
text = "HELLO THIS SENTENCE IS IN UPPERCASE"
print(text.lower())
OUTPUT: hello this sentence is in uppercase
Punctuation removal: Punctuation marks, such as periods, commas, exclamation points, and question marks, add complexity to text. By removing them, you simplify the text and reduce the dimensionality, making it easier for NLP models to process and analyze.
import re
PUNCTUATION_REGEX = r'[,#<>"!=&.@|\[\]\':)(-;?]'
text = "this piece of text contains..... multiple punctuations!!!???@@@"
print( re.sub(PUNCTUATION_REGEX, "", text))
OUTPUT: this piece of text contains multiple punctuations
Part of Speech (POS) tagging: In tasks like named entity recognition and information extraction, POS tags can help identify relevant entities and relationships. For example, recognizing noun phrases (e.g., “New York City”) is easier with POS tagging.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
text = "Joe Biden is the president of United States of America"
print(pos_tag(word_tokenize(text)))
OUTPUT: [('Joe', 'NNP'), ('Biden', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('president', 'NN'), ('of', 'IN'), ('United', 'NNP'), ('States', 'NNPS'), ('of', 'IN'), ('America', 'NNP')]
In order for Machine learning algorithms to process data, they need to be in numerical form. This conversion of raw text to its numerical form is called Text Representation.
Basic vectorization:
def get_one_hot_encoding(text: str,
vocab: list):
one_hot_encoded = list()
for word in text.split():
temp = [0] * len(vocab)
if word in vocab:
temp[vocab[word] - 1] = 1
one_hot_encoded.append(temp)
return one_hot_encoded
text = "hello this is ryan"
vocab = {"hello": 1, "this": 2, "is": 3, "ryan": 4, "greg": 5, "harry": 6, "john": 7}
print(get_one_hot_encoding(text, vocab))
OUTPUT: [[1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0]]
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
corpus = ["hello my name is ryan",
"we are studying natural language processing in python",
"trying out bag of words text representation",
"repeating some some words here"]
bow_rep = count_vec.fit_transform(corpus)
print(bow_rep[3].toarray())
OUTPUT: array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1]])
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
corpus = ["hello my name is ryan",
"we are studying natural language processing in python",
"trying out bag of words text representation",
"repeating some some words here"]
tfidf_rep = tfidf_vec.fit_transform(corpus)
print(tfidf_rep[1].toarray())
OUTPUT: array([[0.35355339, 0. , 0. , 0. , 0.35355339,
0. , 0.35355339, 0. , 0. , 0.35355339,
0. , 0. , 0.35355339, 0.35355339, 0. ,
0. , 0. , 0. , 0.35355339, 0. ,
0. , 0.35355339, 0. ]])
Distributed Representations:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
for index, word in enumerate(wv.index_to_key):
if index == 10:
break
print(f"word #{index}/{len(wv.index_to_key)} is {word}")
vec_king = wv['king']
print(vec_king)
import spaCy
nlp = spaCy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print(doc[0].vector)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
X = newsgroups.data
y = newsgroups.target
# obtain train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
count_vec = CountVectorizer()
X_train_vec = count_vec.fit_transform(X_train)
X_test_vec = count_vec.transform(X_test)
# train the model
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)
# obtain predictions
y_pred = nb_model.predict(X_test_vec)
print(classification_report(y_test, y_pred))
from transformers import pipeline
# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
# Define candidate labels or classes
labels = ["Sports", "Politics", "Entertainment", "Technology"]
# Input text for classification
text = "Apple announced the launch of a new product yesterday."
# Perform zero-shot classification
result = classifier(text, labels)
# Print the results
print(result)
OUTPUT: {'sequence': 'Apple announced the launch of a new product yesterday.',
'labels': ['Technology', 'Entertainment', 'Sports', 'Politics'],
'scores': [0.8737325072288513, 0.06840094178915024, 0.03353743255138397, 0.024329133331775665]}
Information extraction: Task of extracting relevant information from a piece of text. Below we provide practical examples of two fundamental tasks associated with Information extraction and how they can be solved in Python.
import spaCy
nlp = spaCy.load("en_core_web_sm")
text = """In the bustling city of New York, the Statue of Liberty, a symbol of freedom,
stands tall on Liberty Island, welcoming tourists from around the world.
Nearby, the New York Stock Exchange, the epicenter of global finance, buzzes with traders buying and selling shares.
Meanwhile, Central Park, a serene oasis in the heart of Manhattan, offers a peaceful escape for residents and visitors alike."""
doc = nlp(text)
for entity in doc.ents:
print(entity.text, entity.label_)
OUTPUT: New York GPE
the Statue of Liberty GPE
Liberty Island LOC
the New York Stock Exchange ORG
Central Park GPE
Manhattan GPE
from keybert import KeyBERT
doc = """
Learning natural language processing (NLP) in Python is an exciting journey into the world of human language and machine understanding.
Python offers a rich ecosystem of libraries and tools that make NLP accessible to both beginners and seasoned developers.
Starting with fundamental concepts like tokenization, stemming, and lemmatization, you'll progress to more advanced topics such as sentiment analysis,
part-of-speech tagging, and named entity recognition.
NLP libraries like NLTK, spaCy, and the Transformers library by Hugging Face provide pre-trained models
and a wealth of resources to accelerate your learning. You'll explore techniques for text classification, machine translation,
chatbots, and more, while gaining a deeper understanding of the complexities of human language.
NLP in Python opens the door to countless applications, from social media sentiment analysis to language translation services,
making it a valuable skill for those eager to delve into the realm of artificial intelligence and linguistics.
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words=None)
OUTPUT: [('nlp in python', 0.7352),
('nlp libraries like', 0.647),
('nlp libraries', 0.6246),
('libraries like nltk', 0.6026),
('human language nlp', 0.5924)]
Information retrieval Information retrieval is the task of retrieving information based on an input query. It is also about making sense from a wide range of text corpora. Below we provide practical examples of two fundamental tasks associated with Information retrieval and how they can be solved in Python.
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Natural language processing is a field within AI.",
"LDA is a popular topic modeling technique.",
"Gensim is a library for NLP tasks.",
"Artificial intelligence is transforming industries.",
]
# Tokenize the documents
tokenized_docs = [doc.split() for doc in documents]
# Create a dictionary of terms
dictionary = corpora.Dictionary(tokenized_docs)
# Create a bag of words (BoW) representation of the documents
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# Train an LDA model
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)
# Classify a new document
new_doc = "AI and machine learning are revolutionizing industries."
new_doc_bow = dictionary.doc2bow(new_doc.split())
topic_distribution = lda_model.get_document_topics(new_doc_bow)
print("Topic distribution for the new document:", topic_distribution)
OUTPUT: Topic distribution for the new document: [(0, 0.17182167), (1, 0.8281783)]
from transformers import pipeline
text = """
In recent years, advances in artificial intelligence (AI) and machine learning have transformed various industries.
Businesses are leveraging these technologies for automation, data analysis, and decision-making.
In the healthcare sector, AI is being used to improve diagnostic accuracy and streamline patient care.
Meanwhile, e-commerce companies are utilizing machine learning to personalize recommendations and enhance the shopping experience.
Furthermore, AI-driven chatbots are revolutionizing customer support by providing instant assistance.
As the field of AI continues to evolve, it's essential for professionals to stay updated with the latest trends and developments in this ever-changing landscape.
"""
summarizer = pipeline("summarization", model="facebook/bart-large-mnli")
print(summarizer(text, max_length=40, min_length=10, do_sample=False))
OUTPUT: [{'summary_text': " Okawaru.As the field of AI continues to evolve, it's essential for professionals to stay updated with the latest trends and developments in this ever-changing landscape"}]
This blog offers an overview of Natural language processing in Python, covering several fundamental concepts and techniques, including text preprocessing, tokenization, stemming and lemmatization, part-of-speech tagging, named-entity recognition, text classification and topic modeling. This blog is a great starting point to get into the field of NLP and can act as a natural transition to some advanced topics of NLP such as:
Keep in mind that Natural Language Processing (NLP) is an expansive and rapidly evolving domain. The secret to becoming proficient in NLP lies in the pursuit of knowledge and a willingness to explore fresh concepts and approaches. Stay curious and keep learning.
“Fine-tuning” means adapting an existing machine learning model for specific tasks or use cases. In this post I’m going to walk you through how you can fine tune a large language model for sentence similarity using some hand annotated test data. This example is in the psychology domain. You need training data consisting of pairs of sentences, and a “ground truth” of how similar you want those sentences to be when you train your custom sentence similarity model.
Hire an NLP developer and untangle the power of natural language in your projects The world is buzzing with the possibilities of natural language processing (NLP). From chatbots that understand your needs to algorithms that analyse mountains of text data, NLP is revolutionising industries across the board. But harnessing this power requires the right expertise. That’s where finding the perfect NLP developer comes in. Post a job in NLP on naturallanguageprocessing.
Natural language processing What is natural language processing? Natural language processing, or NLP, is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP is a branch of AI but is really a mixture of disciplines such as linguistics, computer science, and engineering. There are a number of approaches to NLP, ranging from rule-based modelling of human language to statistical methods. Common uses of NLP include speech recognition systems, the voice assistants available on smartphones, and chatbots.