Natural Language Processing in Python

Welcome to the comprehensive blog on solving Natural Language Processing (NLP) tasks using Python. This guide covers a wide range of practical NLP tasks and provides clear solutions in Python. You will find code snippets throughout the blog to help you kickstart your NLP projects right away.

Who is This Blog For?

This blog caters to individuals engaged in the development of Natural Language Processing applications using Python.

Why Choose Python for NLP?

Python is the language of choice for NLP, thanks to its extensive community support and a plethora of machine learning libraries such as TensorFlow, scikit-learn, PyTorch, pandas, spaCy, and NLTK. Its simplicity makes it an ideal entry point into the field of data science.

What’s Covered in This Blog?

The blog explores the full spectrum of NLP tasks, providing practical insights and code samples to tackle challenges effectively.

TL;DR

The blog explores using Python for Natural Language Processing (NLP) tasks. It’s for those building NLP applications. Python’s community support, ML libraries (like TensorFlow, scikit-learn, PyTorch, pandas, spaCy, NLTK), and ease of learning make it ideal for NLP. The blog covers a wide range of practical NLP tasks, offering code snippets for quick starts.

NLP in the real world

NLP is used in a wide range of applications with some of them being used in our daily lives such as:

Search engines: Google, Bing
Email applications: like Gmail, Outlook; etc
Voice based assistants: like Siri, Alexa, Google Assistant
ChatGPT: Using natural language understanding and text generation for answering questions

Different tasks associated with NLP

There are several fundamental tasks that appear frequently while solving various kinds NLP projects. This section explores these fundamental tasks in brief.

Language modeling: Task of predicting what the next word in a sentence will be based on the previous words. The objective of this task is to learn the probability of a sequence of words appearing in a given language. Language modeling is used in applications such as spelling correction, natural language understanding, machine translation.
Text classification: Task of assigning a label/class to a piece of text based on the content. This is the most popular application of Natural Language Processing. Some examples being: sentiment analysis, spam identification.
Information extraction: Task of extracting relevant information from a piece of text.
Information retrieval: Task of retrieving information based on an input query. Applications include search.
Topic modeling: Task of identifying the words from a large collection of text that best represents the collection.
Machine translation: Task of translating text from one language to another.

Why is NLP challenging?

Natural Language Processing is challenging for several reasons. The most popular reasons being:

Ambiguity: One of the biggest challenges in solving Natural Language processing is dealing with ambiguity. Most human languages are inherently ambiguous. For example: The word apple could refer to a fruit or the electronics company apple.
Diversity of languages: There are a ton of languages in the world and there is no direct mapping between the vocabularies of any 2 languages. A solution that works for one language may not work for another language.
Code mixing: Code mixing is where two or more languages are used in a conversation. This is common in multilingual environments. This kind of language is quite popular in social media and poses a challenge while build NLP solutions.

A primer for Text Pre-processing

While building NLP projects we usually have to pre-process the text before feeding the data to Machine Learning models. Text pre-processing is a crucial step as it ensures data is of high quality and accuracy. Below we have listed down the various steps that are involved in text pre-processing:

Text encoding: It is common to have strings containing emojis, symbols or other graphic characters. To process these symbols and special characters they need to be converted into a binary representation. In order to do so, we need to convert them to an encoding scheme.
```
text = "I like to play 🏓"
encoded_text = text.encode("utf-8")
print(encoded_text)

OUTPUT: b'I like to play \xf0\x9f\x8f\x93'
```

Spelling correction: While dealing with language it is common for the data to be noisy and contain some spelling mistakes. This can be prevalent in social media data and search queries on the web. Below we show how you can leverage Symspellpy to deal with spelling mistakes and get a cleaner version of your text

import pkg_resources
from symspellpy import SymSpell, Verbosity

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)

sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

input_term = (
"hello wrld I am lerning naturl languge procesing in pthon"
)

# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance=2)
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
    print(suggestion)

OUTPUT: hello world i am learning natural language processing in python, 8, 0

Sentence segmentation: While building NLP systems, typically text is broken down into sentences and tokens (words). Sentence segmentation breaks down text into sentences based on full stops and question marks. We can use the NLTK library for this purpose.

from nltk.tokenize import sent_tokenize

text = """While building NLP systems, typically text is broken down into sentences and tokens (words).
Sentence segmentation breaks down text into sentences based on full stops and question marks.
We can use the nltk library for this purpose. It may be easy to split words based on punctutations.
However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
Let us see how sentence segmentation can handle these cases.
"""

sentences = sent_tokenize(text)
for sentence in sentences:
    print(sentence)

OUTPUT: While building NLP systems, typically text is broken down into sentences and tokens (words).
        Sentence segmentation breaks down text into sentences based on full stops and question marks.
        We can use the nltk library for this purpose.
        It may be easy to split words based on punctutations.
        However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
        Let us see how sentence segmentation can handle these cases

Word tokenization: Just like sentence tokenization is breaking down text into sentences, word tokenization is breaking down text into words. We can use the NLTK library for this purpose.

    from nltk.tokenize import sent_tokenize, word_tokenize
text = """While building NLP systems, typically text is broken down into sentences and tokens (words).
Sentence segmentation breaks down text into sentences based on full stops and question marks.
We can use the nltk library for this purpose. It may be easy to split words based on punctutations.
However, we will face issues while splitting words that contain Dr. or Mr. or Mrs.
Let us see how sentence segmentation can handle these cases.
"""

sentences = sent_tokenize(text)
for sentence in sentences:
    word_tokenize(sentence)

OUTPUT:
['While', 'building', 'NLP', 'systems', ',', 'typically', 'text', 'is', 'broken', 'down', 'into', 'sentences', 'and', 'tokens', '(', 'words', ')', '.']
['Sentence', 'segmentation', 'breaks', 'down', 'text', 'into', 'sentences', 'based', 'on', 'full', 'stops', 'and', 'question', 'marks', '.']
['We', 'can', 'use', 'the', 'nltk', 'library', 'for', 'this', 'purpose', '.']
['It', 'may', 'be', 'easy', 'to', 'split', 'words', 'based', 'on', 'punctutations', '.']
['However', ',', 'we', 'will', 'face', 'issues', 'while', 'splitting', 'words', 'that', 'contain', 'Dr.', 'or', 'Mr.', 'or', 'Mrs.', 'Let', 'us', 'see', 'how', 'sentence', 'segmentation', 'can', 'handle', 'these', 'cases', '.']

Stop-words removal: In most nlp applications we need to extract useful information from a piece of text. Frequently used words such as a, an, the, of, in, etc; do not add much value to the meaning of the sentence. Such words are known as stop-words and are typically removed in the preprocessing step.

from nltk.corpus import stopwords

stopwords_list = set(stopwords.words('english'))
text = "hello this sentence consists of multiple stopwords"
cleaned_text = " ".join(token for token in word_tokenize(text) if token not in stopwords_list)
print(cleaned_text)

OUTPUT: hello sentence consists multiple stopwords

Stemming and lemmatization:
- Stemming refers to the process of removing suffixes and reducing the word to its base form such that all different variants of that word can be reporesented by the same form.
- Lemmatization refers to mapping all the different forms of a word to its base word or lemma.
```
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

print(stemmer.stem("dollars"), lemmatizer.lemmatize("happening", pos="v"))

OUTPUT: dollar,happen
```
Lowercasing: Lowercasing text helps in normalizing the text data. By converting all text to lowercase, you ensure that the same word in different cases (e.g., “apple” and “Apple”) is treated as identical. This reduces the vocabulary size and helps NLP models focus on the meaning of words rather than their capitalization.
```
text = "HELLO THIS SENTENCE IS IN UPPERCASE"
print(text.lower())

OUTPUT: hello this sentence is in uppercase
```
Punctuation removal: Punctuation marks, such as periods, commas, exclamation points, and question marks, add complexity to text. By removing them, you simplify the text and reduce the dimensionality, making it easier for NLP models to process and analyze.
```
import re

PUNCTUATION_REGEX = r'[,#<>"!=&.@|\[\]\':)(-;?]'
text = "this piece of text contains..... multiple punctuations!!!???@@@"
print( re.sub(PUNCTUATION_REGEX, "", text))

OUTPUT: this piece of text contains multiple punctuations
```

Part of Speech (POS) tagging: In tasks like named entity recognition and information extraction, POS tags can help identify relevant entities and relationships. For example, recognizing noun phrases (e.g., “New York City”) is easier with POS tagging.

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

text = "Joe Biden is the president of United States of America"
print(pos_tag(word_tokenize(text)))

OUTPUT: [('Joe', 'NNP'), ('Biden', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('president', 'NN'), ('of', 'IN'), ('United', 'NNP'), ('States', 'NNPS'), ('of', 'IN'), ('America', 'NNP')]

Text Representation

In order for Machine learning algorithms to process data, they need to be in numerical form. This conversion of raw text to its numerical form is called Text Representation.

Basic vectorization:

One hot encoding: In one hot encoding, each word w in the corpus vocabulary is given a unique integer ID, that is between 1 and |V|, where V is the set of the corpus vocabulary. Each word is represented by a V-dimensional binary vector of 0s and 1s. Below we present a function that can be used to get one hot embeddings for a piece of text.

def get_one_hot_encoding(text: str,
                         vocab: list):
    one_hot_encoded = list()
    for word in text.split():
        temp = [0] * len(vocab)
        if word in vocab:
            temp[vocab[word] - 1] = 1
        one_hot_encoded.append(temp)
    return one_hot_encoded

text = "hello this is ryan"
vocab = {"hello": 1, "this": 2, "is": 3, "ryan": 4, "greg": 5, "harry": 6, "john": 7}
print(get_one_hot_encoding(text, vocab))

OUTPUT: [[1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0]]

Bag of Words (BOW): In BOW we represent the text as a bag (collection) of words while ignoring the order and context. The intuition being, that it assumes that the text belonging to a given class in a dataset is characterized by a unique set of words. In BOW representation the i component of the vector is simply the number of times the word w occurs in the document.

from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
corpus = ["hello my name is ryan",
        "we are studying natural language processing in python",
        "trying out bag of words text representation",
        "repeating some some words here"]

bow_rep = count_vec.fit_transform(corpus)
print(bow_rep[3].toarray())

OUTPUT: array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1]])

Term Frequency-Inverse Document Frequency (TF-IDF): In the above two approaches all the words in the text are treated with equal importance. There is no concept of some words in the text being more important than others. TF-IDF or term frequency-inverse document frequency addresses this issue. The intuition behind TF-IDF is as follows: if a word appears many times in a document but does not occur much in the rest of the documents, then the word has great importance in the document in which it appeared.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
corpus = ["hello my name is ryan",
        "we are studying natural language processing in python",
        "trying out bag of words text representation",
        "repeating some some words here"]

tfidf_rep = tfidf_vec.fit_transform(corpus)
print(tfidf_rep[1].toarray())

OUTPUT: array([[0.35355339, 0.        , 0.        , 0.        , 0.35355339,
    0.        , 0.35355339, 0.        , 0.        , 0.35355339,
    0.        , 0.        , 0.35355339, 0.35355339, 0.        ,
    0.        , 0.        , 0.        , 0.35355339, 0.        ,
    0.        , 0.35355339, 0.        ]])

Distributed Representations:

The drawbacks of basic vectorization are as follows:
- They are discrete representations: Language is treated as atomic units. This hampers the ability to capture relationships between words
- The feature vectors are sparse and high dimensional: The dimensionality increases with the size of vocabulary, with most values being zero for any vector. This hampers the learning capability and makes them computationally inefficient
- They cannot handle Out of Vocabulary (OOV) words

Word embeddings: Word embeddings use neural network architectures to create dense, low dimensional representataions of words and text. These word embeddings rely on [distributional hypothesis] (https://arxiv.org/abs/1301.3781) that words that occur in similar context have similar meanings.

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")

vec_king = wv['king']
print(vec_king)

Document embeddings: Word embeddings give us compact and dense representations for words in our vocabulary. However, in many NLP applications, we need to deal with sentences/documents. So we need a way to represent larger units of text. One approach can be to break the text into words, take the word embeddings and combine them (sum or average) to form the representation for the text. We will use spaCy library to extraction document vectors.

import spaCy

nlp = spaCy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print(doc[0].vector)

Sentence embeddings: Sentence embeddings generated by models like Sentence-BERT (SBERT) represent sentences or text snippets as fixed-length vectors. SBERT is a variation of the popular BERT (Bidirectional Encoder Representations from Transformers) model fine-tuned specifically for sentence-level tasks. Using transformers to extract embeddings from the sentence. Below we show you an example of extracting sentence embeddings using the SBERT library.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)

NLP Tasks

Text classification: Is the task of assigning a category to a given piece of text from a larger set of possible categories. This is one of the most popular task in NLP and has a wide range of applications. Lets build a simple text classification model. Below we illustrate an example of building a text classification using (scikit-learn)[https://scikit-learn.org/stable/index.html].

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

X = newsgroups.data
y = newsgroups.target

# obtain train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
count_vec = CountVectorizer()
X_train_vec = count_vec.fit_transform(X_train)
X_test_vec = count_vec.transform(X_test)

# train the model
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

# obtain predictions
y_pred = nb_model.predict(X_test_vec)

print(classification_report(y_test, y_pred))

Zero shot classification: Zero-shot text classification is a machine learning task where a model is trained to categorize text into predefined classes or categories without having seen any examples from these categories during its training. In traditional text classification, models are trained on labeled data for each category, which helps them learn the characteristics of each class. In zero-shot classification, the model is expected to generalize and classify text into categories it hasn’t seen before based on a brief description or a set of class labels provided during the task. Below we provide an example of solving a Zero shot classification problem using the (Hugging Face library)[https://huggingface.co/]

from transformers import pipeline

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define candidate labels or classes
labels = ["Sports", "Politics", "Entertainment", "Technology"]

# Input text for classification
text = "Apple announced the launch of a new product yesterday."

# Perform zero-shot classification
result = classifier(text, labels)

# Print the results
print(result)

OUTPUT: {'sequence': 'Apple announced the launch of a new product yesterday.',
 'labels': ['Technology', 'Entertainment', 'Sports', 'Politics'],
 'scores': [0.8737325072288513, 0.06840094178915024, 0.03353743255138397, 0.024329133331775665]}

Information extraction: Task of extracting relevant information from a piece of text. Below we provide practical examples of two fundamental tasks associated with Information extraction and how they can be solved in Python.

Named Entity Recognition (NER): Refers to the task of identifying the entities in a document. Entities are typically names of persons, locations and organziation. NER is an important step for several NLP applications involving information extraction. NLP libraries such as (spaCy)[https://spacy.io/] and (AllenNLP)[https://github.com/allenai/allennlp] are some well known libraries that can be used to incorporate a pre-trained NER model into a product. The code snippet below illustrates using NER from spaCy.

import spaCy

nlp = spaCy.load("en_core_web_sm")

text = """In the bustling city of New York, the Statue of Liberty, a symbol of freedom,
stands tall on Liberty Island, welcoming tourists from around the world.
Nearby, the New York Stock Exchange, the epicenter of global finance, buzzes with traders buying and selling shares. 
Meanwhile, Central Park, a serene oasis in the heart of Manhattan, offers a peaceful escape for residents and visitors alike."""

doc = nlp(text)
for entity in doc.ents:
    print(entity.text, entity.label_)

OUTPUT: New York GPE
        the Statue of Liberty GPE
        Liberty Island LOC
        the New York Stock Exchange ORG
        Central Park GPE
        Manhattan GPE

Keyphrase extraction: This is an IE task which extracts the important words and phrases that captures the gist of the text from a given text document. It is useful for several NLP tasks such as search/information retrieval/ text classification, document tagging; etc. NLP libraries such as texacy and (KeyBERT)[https://github.com/MaartenGr/KeyBERT] are some of the popular libraries used for Keyphrase extraction. The code snippet below illustrates Keyphrase extraction using KeyBERT.

from keybert import KeyBERT

doc = """
        Learning natural language processing (NLP) in Python is an exciting journey into the world of human language and machine understanding.
        Python offers a rich ecosystem of libraries and tools that make NLP accessible to both beginners and seasoned developers.
        Starting with fundamental concepts like tokenization, stemming, and lemmatization, you'll progress to more advanced topics such as sentiment analysis,
        part-of-speech tagging, and named entity recognition.
        NLP libraries like NLTK, spaCy, and the Transformers library by Hugging Face provide pre-trained models
        and a wealth of resources to accelerate your learning. You'll explore techniques for text classification, machine translation,
        chatbots, and more, while gaining a deeper understanding of the complexities of human language. 
        NLP in Python opens the door to countless applications, from social media sentiment analysis to language translation services,
        making it a valuable skill for those eager to delve into the realm of artificial intelligence and linguistics.
    """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 3), stop_words=None)

OUTPUT: [('nlp in python', 0.7352),
        ('nlp libraries like', 0.647),
        ('nlp libraries', 0.6246),
        ('libraries like nltk', 0.6026),
        ('human language nlp', 0.5924)]

Information retrieval Information retrieval is the task of retrieving information based on an input query. It is also about making sense from a wide range of text corpora. Below we provide practical examples of two fundamental tasks associated with Information retrieval and how they can be solved in Python.

Topic modeling: Topic modeling tries to identify the key words (topics) present in a text corpus without prior knowledge about it. Some popular topic modeling algorithms are Latent Dirichlet allocation (LDA), Latent semantic analysis (LSA) and probabilistic latent semantic analysis (PLSA). In practice, the technique that is most commonly used is LDA. Topics are a mixture of keywords with probability distribution and documents are made up of a mixture of topic. NLP libraries such as gensim are some of the popular libraries used for Keyphrase extraction. The code snippet below illustrates Topic Modeling using gensim.

import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing is a field within AI.",
    "LDA is a popular topic modeling technique.",
    "Gensim is a library for NLP tasks.",
    "Artificial intelligence is transforming industries.",
]

# Tokenize the documents
tokenized_docs = [doc.split() for doc in documents]

# Create a dictionary of terms
dictionary = corpora.Dictionary(tokenized_docs)

# Create a bag of words (BoW) representation of the documents
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Train an LDA model
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

# Classify a new document
new_doc = "AI and machine learning are revolutionizing industries."
new_doc_bow = dictionary.doc2bow(new_doc.split())
topic_distribution = lda_model.get_document_topics(new_doc_bow)
print("Topic distribution for the new document:", topic_distribution)

OUTPUT: Topic distribution for the new document: [(0, 0.17182167), (1, 0.8281783)]

Text summarization: Refers to the task of creating a summary of longer piece of text. The goal is to create a summary that captures that key ideas in the text. The code snippet below illustrates Text summarization using Hugging Face.

from transformers import pipeline

text = """
In recent years, advances in artificial intelligence (AI) and machine learning have transformed various industries.
Businesses are leveraging these technologies for automation, data analysis, and decision-making.
In the healthcare sector, AI is being used to improve diagnostic accuracy and streamline patient care.
Meanwhile, e-commerce companies are utilizing machine learning to personalize recommendations and enhance the shopping experience.
Furthermore, AI-driven chatbots are revolutionizing customer support by providing instant assistance.
As the field of AI continues to evolve, it's essential for professionals to stay updated with the latest trends and developments in this ever-changing landscape.
"""

summarizer = pipeline("summarization", model="facebook/bart-large-mnli")
print(summarizer(text, max_length=40, min_length=10, do_sample=False))

OUTPUT: [{'summary_text': " Okawaru.As the field of AI continues to evolve, it's essential for professionals to stay updated with the latest trends and developments in this ever-changing landscape"}]

Final comments

This blog offers an overview of Natural language processing in Python, covering several fundamental concepts and techniques, including text preprocessing, tokenization, stemming and lemmatization, part-of-speech tagging, named-entity recognition, text classification and topic modeling. This blog is a great starting point to get into the field of NLP and can act as a natural transition to some advanced topics of NLP such as:

Deep Learning in NLP: Building chatbots, NER, classification systems
Finetuning Pre-training models
Building Machine translation systems

Keep in mind that Natural Language Processing (NLP) is an expansive and rapidly evolving domain. The secret to becoming proficient in NLP lies in the pursuit of knowledge and a willingness to explore fresh concepts and approaches. Stay curious and keep learning.