Absortio

Email → Summary → Bookmark → Email

Let us Extract some Topics from Text Data — Part IV: BERTopic

Extracto

Topic modeling is a type of Natural Language Processing (NLP) task that utilizes unsupervised learning methods to extract out the main topics of some text data we deal with. The word “Unsupervised”…

Contenido

Learn more about the family member of BERT for topic modelling

Free for Use Photo from Pexels

Introduction

Topic modeling is a type of Natural Language Processing (NLP) task that utilizes unsupervised learning methods to extract out the main topics of some text data we deal with. The word “Unsupervised” here means that there are no training data that have associated topic labels. Instead, the algorithms try to discover the underlying patterns, in this case, the topics, directly from the data itself.

There are various kinds of algorithms that are widely used for topic modelling. In my previous three articles, I introduced to you three algorithms for topic modeling: LDA, GSDMM, and NMF.

In this article, I explain in-depth what BERTopic is and how you can use it for your topic modeling project! Let us dive straight into it!

What is BERTopic

Before we figure out what BERTopic is and what it does, we need to know what BERT is because BERTopic derived from BERT. BERT is an acronym for Bidirectional Encoder Representations from Transformers which was developed in 2018 as transformer-based machine learning model. It has been pre-trained in massive amounts of corpus data and thus performs very well in various NLP tasks. You can check out the original paper for BERT here. BERTopic was devised as a family member of BERT specifically for the purpose of topic modeling.

BERTopic operates via the following few steps.

[Step 1] Documents are represented as embeddings using sentence transformers. You can learn more about sentence transformers here. The default model used for this step is BERT (hence the name BERTopic).

[Step 2] Step 2 is the dimensionality reduction process. The UMAP(Uniform Manifold Approximation & Projection) algorithm is used ass default. Of course, other options including the Principal Component Analysis (PCA) may be used instead depending on what your objective and data is.

[Step 3] Step 3 is the clustering process. This step is the part that actually calculates the similarities among different documents to determine whether they fall under the same topic or not. The Hierarchical Density-Based Spatial Clustering (HDBSCAN) algorithm is used as default. It is a density based clustering algorithm and hence often performs better than algorithms such as K-Means clustering for topc modeling.

[Step 4] After that, the c-TF-IDF algorithm retrieves the most relevant words for each topic. c-TF-IDF is similar to TF-IDF as the name suggests but it differs in that it measures the term frequency in each cluster instead of within each document. The mathematical formulation for c-TF-IDF is as follows.

Source: BERTopic Github Website

[Step 5] The optional final step is to optimize the terms using Maximal Marginal Relevance (MMR). Using this algorithm is beneficial because it improves the coherence among the terms for the same topic and the topic representation.

Below is a nice info-graphic for describing the steps aforementioned.

Source: BERTopic Github Site

Implementation

We will be using the 20 NewsGroup data that is available to everyone via the sklearn package. It uses the Apache Version 2.0 License. Just a reminder that I have used this dataset for my previous three articles regarding topic modeling for the sake of consistency across different tutorials.

We first install the BERTopic package.

!pip install bertopic

We then import the relevant packages we need.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from scipy import linalg
import gensim
from tqdm import tqdm
import re
import matplotlib.pyplot as plt
from bertopic import BERTopic

%matplotlib inline
np.set_printoptions(suppress=True)

Usually, the process of cleaning and pre-processing text is necessary to ensure maximum performance but for this tutorial, we skip this part to focus more on the implementation of the BERTopic itself. Actually, seeing the relatively decent quality of topic modeling outputs without text cleaning may attest to the power and usefulness of BERTopic. Refer to my previous article that talked about NMF for topic modeling if you want to find out more about the text cleaning step.

From the skelarn package, we read in the 20 NewsGroup data. Then, we simply instantiate the BERTopic object while specifying the language as English. You can specify the number of topics you want through the nr_topics parameter. “Auto” means you want the model to automatically determine the appropriate number of topics to be modeled.

docs = fetch_20newsgroups(subset='all',  
remove=('headers', 'footers', 'quotes'))['data']

# Instantiate the model
model = BERTopic(language="english", nr_topics = 'auto')

# Fit and Transform
topics, probs = model.fit_transform(docs)

Note, however, that you can tune various parts of the model for your purposes.

Language

You can specify the language or let the model infer the language using the language parameter.

# Specify the lanugage to be English expliclitly
model = BERTopic(language="english")

# Let the model infer
model = BERTopic(language="multilingual")

Embeddings

You can also tune the embeddings part of BERTopic via the embedding_model parameter. By default, the BERT base model is used but freel free to use other embedding models. The list of options can be found from this documentation.

BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Topic Representation

Instead of using the default TF-IDF vector representation, you can tune it yourself or use a Countervectorizer representation instead.

# Update topic representation by increasing n-gram range and 
# removing english stopwords

model.update_topics(docs, topics, n_gram_range=(1, 3), stop_words="english")

# Use Custom CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 3), stop_words="english")
model.update_topics(docs, topics, vectorizer=cv)

Dimensionality Reduction

You can tune the hyperparameters of the UMAP algorithm before you input it into the BERTopic model. In addition, instead of using the default UMAP algorithm for dimensionality reduction, you can use other algorithms such as PCA.

from umap import UMAP
umap_model = UMAP(n_neighbors=10,
n_components=7,
min_dist=0.0,
metric='cosine',
random_state=42)

model = BERTopic(umap_model=umap_model, language="english")

Refer to this nice article that actually uses PCA instead of UMPA for BERTopic.

Number of Topics

You can either manually specify the number of topics or set the nr_topics argument to be “auto” to let the model automatically determine the appropriate number of topics. However, you may face a situation where you already trained the topic model but find the number of topics to be too much. In that case, you can reduce the number of topics after training using the reduce_topics function.

# Specify number of topics manually
model = BERTopic(nr_topics=20)

# Automatic Topic Reduction
model = BERTopic(nr_topics="auto")

# Reduce the number of topics after training the model
new_topics, new_probs =\
model.reduce_topics(docs, topics, probs, nr_topics=5)

Get overall info about topics

You use the get_topic_info function to get overall information about topics.

freq = model.get_topic_info() 
freq
Source: From the Author

Topic -1 refers to a cluster of documents that could not be possibly classified into other topics and so we can ignore that one. We can see the count distributions of documents into each topic and also some keywords that represent each topic under the name column.

Keywords and c-TF-IDF scores for a specified topic

You can also access the keywords and their c-TF-IDF scores for a specified topic using the get_topic function. For example, we access the keywords and their scores for topic 3 in the following code.

# We access index 4 for topic 3 because we skip index 0 which is topic -1
model.get_topic(freq.iloc[4]["Topic"])
Source: From the Author

We can see from the keywords above that topic 3 is mainly about crimes related to shootings and guns.

Find topics that are similar to specified term

# Return top3 topics that are semantically most similar 
# to an input query term

# 3 most similar topics to specified word
similar_topics, similarity = \
model.find_topics("religion", top_n = 3)

print("Most Similar Topic Info: \n{}".format(model.get_topic(similar_topics[0])))
print("Similarity Score: {}".format(similarity[0]))

print("\n Most Similar Topic Info: \n{}".format(model.get_topic(similar_topics[1])))
print("Similarity Score: {}".format(similarity[1]))

print("\n Most Similar Topic Info: \n{}".format(model.get_topic(similar_topics[2])))
print("Similarity Score: {}".format(similarity[2]))

The output is:

Most Similar Topic Info: 
[('prophecies', 0.035076938400189765), ('prophecy', 0.028848478747937348), ('that', 0.02108670502531178), ('god', 0.02051417591444672), ('lord', 0.020264581769842555), ('of', 0.016688896522909655), ('the', 0.016135781453880685), ('to', 0.015035130690705624), ('scripture', 0.014930014538798414), ('we', 0.014849027662146918)]
Similarity Score: 0.5521519602007433

Most Similar Topic Info:
[('bank', 0.09616363061888215), ('islamic', 0.08725362721875433), ('bcci', 0.07506873356081414), ('banks', 0.04599130160033494), ('islam', 0.03498368962676642), ('interest', 0.03153905791196487), ('an', 0.02707799288051472), ('risk', 0.026608617086657786), ('investor', 0.023625991363580155), ('loans', 0.023098071864865885)]
Similarity Score: 0.5518991782725108

Most Similar Topic Info:
[('moral', 0.04437134134615366), ('objective', 0.04058577723387244), ('morality', 0.03933015749038743), ('is', 0.023387671936210788), ('that', 0.021184421900981805), ('what', 0.017148832156794584), ('you', 0.017133130253694097), ('not', 0.01467957486207896), ('immoral', 0.014518771930711771), ('of', 0.014256652246875072)]
Similarity Score: 0.5515153930656871

As you can see from the example above, we can specify a term and find topics that are most relevant to that input. In this case, we used the term “religion”. The top three topics that were most semantically similar to the term religion were mainly about prophecies, Islamic religion and morality.

BERTopic also has some nice visualization functions as part of the package.

# Intertopic Distance Map
model.visualize_topics( )
Source: From the Author

The visualize_topics function displays the intertopic distnace map which we have seen before in our LDA tutorial. It is basically a visualization that shows bubbles of different topics and how similar they are to one another. The close the two bubbles are to each other, the more semantically similar the two topics are. Topic 2, for instance, which is the bigger red bubble on the lower right hand corner is similar to topic 10 which is the the smaller bubble inside of topic 2. Topic 10 contains keywords including vitamin and infection while topic 2 contains keywords such as cancer and drugs and we see that those two topics are closely related to each other.

# Topic Word Scores in Bar Chart
model.visualize_barchart()
Source: From the Author

The visualize_barchart allows us to create bar graphs that indicate different scores for keywords in each topic. The only downside is that it only displays the first 8 topics,

# Similarity Matrix
model.visualize_heatmap( )
Source: From the Author

The visualize_heatmap function returns a similar matrix of each pairwise topic similarities. It might not be too informative if you have too many topics since there would be excessive amounts of information in one plot.

# Probability distribution of first document across different topics
model.visualize_distribution(probs[5])

The visualize_distribution function returns horizontal bar plots that indicate probabilities of a certain document falling under each topic. In the example above, it returns the probability of the sixth document falling under each topic. Note that you need to specifiy the calculate_probabilities parameter in the BERTopic object to be True before training because the default value is False and if so, the visualize_distribution function will return an error.

# Hierarchical Topics
hierarchical_topics = model.hierarchical_topics(docs)
model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
Source: From the Author

If you want to understand the hierarchical relationships among different topics, the visualize_hierarchy function will do the trick.

topics_over_time =\
model.topics_over_time(docs, topics, timestamp, nr_bins=20)

model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

The dataset we used did not contain any time elements but if your dataset does, you can also make use of the topics_over_time function to visualize the frequency of topics over time in the form of a line graph.

Conclusion

In this article, I introduced to you the BERTopic algorithm which derived from BERT and is comprised of multiple steps that make use of various algorithms from c-TF-IDF, UMAP to HDBSCAN and MRR. In order to fully understand how this model works, you would need to understand other machine learning and NLP related tasks such as dimensionality reduction and embeddings of text data. Nevertheless, it provides us with a powerful and easy way to perform topic modeling even without text cleaning or pre-processing. That does not mean text cleaning is unnecessary. Note, in most cases, text cleaning and pre-processing will be crucial for ensuring the quality of topics being modeled. To learn more about BERTopic, please refer to this main documentation.

If you found this post helpful, consider supporting me by signing up on medium via the following link : )

joshnjuny.medium.com

You will have access to so many useful and interesting articles and posts from not only me but also other authors!

About the Author

Data Scientist. 1st Year PhD student in Informatics at UC Irvine. Main research interest is applying SOTA ML/DL/NLP methods on health and medical related big data to extract interesting insights that will inform patients, doctors and policy makers.

Former research area specialist at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan, working on statistical report generation, automated data quality review, building data pipelines and data standardization & harmonization. Former Data Science Intern at Spotify. Inc. (NYC).

He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ, our Lord. Checkout his website!

Fuente: Towards Data Science