In the previous tutorial, we explained how we can apply LDA Topic Modelling with Gensim. Today, we will provide an example of Topic Modelling with Non-Negative Matrix Factorization (NMF) using Python. If you want to get more information about NMF you can have a look at the post of NMF for Dimensionality Reduction and Recommender Systems in Python.
Again we will work with the ABC News dataset and we will create 10 topics. Let’s start coding by loading the data and the required libraries!
import pandas as pd from sklearn.decomposition import NMF from sklearn.feature_extraction.text import TfidfVectorizer documents = pd.read_csv('news-data.csv', error_bad_lines=False) documents.head()
Note that the dataset contains 1,103,663 documents
NFM for Topic Modelling
The idea is to take the documents and to create the TF-IDF which will be a matrix of M rows, where M is the number of documents and in our case is 1,103,663 and N columns, where N is the number of unigrams, let’s call them “words”. Then, from this matrix, we try to generate another two matrices (matrix decomposition) such as the Features which will be of M rows and 10 columns, where 10 is the number of topics and the Components which will be of 10 rows (topics) and N columns (words). The product of Features and Components will approximate the TF-IDF.
So keep in mind that:
- The Components Matrix represents topics
- The Features Matrix combines topics into documents
Build the TF-IDF Matrix
The first step is to create the TF-IDF matrix as follows.
# use tfidf by removing tokens that don't appear in at least 50 documents vect = TfidfVectorizer(min_df=50, stop_words='english') # Fit and transform X = vect.fit_transform(documents.headline_text)
Build the NMF Model
At this point, we will build the NMF model which will generate the Feature and the Component matrices.
# Create an NMF instance: model # the 10 components will be the topics model = NMF(n_components=10, random_state=5) # Fit the model to TF-IDF model.fit(X) # Transform the TF-IDF: nmf_features nmf_features = model.transform(X)
It important to check the dimensions of the 3 tables:
TF-IDF Dimensions:
X.shape
(1103663, 11213)
Features Dimensions:
nmf_features.shape
(1103663, 10)
Components Dimensions:
model.components_.shape
(10, 11213)
We should add the column names to the Components matrix since these are the tokens (words) from the TF-IDF. Note, that based on our tokenizer we included any token of two or more characters that is why you will see some numbers. We could have also removed them. Both approaches are correct.
# Create a DataFrame: components_df components_df = pd.DataFrame(model.components_, columns=vect.get_feature_names()) components_df
Get the Words of the Highest Value for each Topic
We have created the 10 topics using NMF. Let’s have a look at the 10 more important words for each topic.
for topic in range(components_df.shape[0]): tmp = components_df.iloc[topic] print(f'For topic {topic+1} the words with the highest value are:') print(tmp.nlargest(10)) print('\n')
Output:
For topic 1 the words with the highest value are:
man 8.396817
charged 3.117248
murder 1.367349
jailed 0.891809
missing 0.880894
stabbing 0.727232
guilty 0.637090
arrested 0.600157
death 0.587411
sydney 0.532504
Name: 0, dtype: float64
For topic 2 the words with the highest value are:
interview 7.471284
extended 0.393083
michael 0.383856
david 0.226665
john 0.222362
james 0.211161
nrl 0.202279
smith 0.179707
ben 0.172380
andrew 0.169546
Name: 1, dtype: float64
For topic 3 the words with the highest value are:
police 6.886678
probe 0.814901
investigate 0.795326
missing 0.679711
search 0.637454
death 0.497555
hunt 0.420141
officer 0.329780
seek 0.313185
shooting 0.300862
Name: 2, dtype: float64
For topic 4 the words with the highest value are:
new 8.436840
zealand 0.571164
laws 0.402800
year 0.368542
home 0.265268
york 0.242442
centre 0.234176
hospital 0.233369
deal 0.220119
opens 0.190645
Name: 3, dtype: float64
For topic 5 the words with the highest value are:
rural 6.200143
national 2.804131
qld 2.662080
news 2.342136
nsw 1.211054
podcast 0.768686
reporter 0.525871
sa 0.477365
health 0.234591
north 0.219903
Name: 4, dtype: float64
For topic 6 the words with the highest value are:
abc 6.010566
news 2.209571
weather 2.192629
business 1.925450
sport 1.640363
market 1.310986
entertainment 0.926865
analysis 0.924629
talks 0.227601
speaks 0.194431
Name: 5, dtype: float64
For topic 7 the words with the highest value are:
crash 5.455308
car 2.480800
dies 1.841566
killed 1.837934
fatal 1.451367
woman 1.178756
road 0.986225
driver 0.875721
plane 0.778262
injured 0.651799
Name: 6, dtype: float64
For topic 8 the words with the highest value are:
court 7.931729
accused 2.308421
face 2.058921
murder 1.515629
faces 1.053453
charges 1.052557
told 0.869612
case 0.766514
high 0.689719
sex 0.600428
Name: 7, dtype: float64
For topic 9 the words with the highest value are:
country 3.850864
hour 3.468195
wa 1.608666
nsw 1.492966
2014 1.090257
2015 1.075701
podcast 1.054856
vic 0.755764
tas 0.658372
2013 0.643352
Name: 8, dtype: float64
For topic 10 the words with the highest value are:
govt 2.101926
council 2.040503
says 1.651489
plan 1.185716
water 1.081075
health 0.794900
urged 0.715581
australia 0.661426
report 0.548353
funding 0.529309
Name: 9, dtype: float64
As we can see the topics appear to be meaningful. For example, Topic 3 seems to be about missing persons and investigations (police, probe, investigation, missing, search, seek etc)
Get the Topic of a Document
Since we defined the topics, we will show how you can get the topic of each document. Let’s say that we want to get the topic of the 55th document:
‘funds to help restore cossack’
my_document = documents.headline_text[55] my_document
Output:
'funds to help restore cossack'
We will need to work with the Features matrix. So let’s get the 55th row:
pd.DataFrame(nmf_features).loc[55]
Output:
0 0.000000
1 0.000000
2 0.001271
3 0.000000
4 0.000000
5 0.000000
6 0.000000
7 0.000000
8 0.000000
9 0.011652
Name: 55, dtype: float64
We look for the Topic with the maximum value which is the one of index 9 which is the 10th in our case (note that we started from 1 instead of 0). If we see the most important words of Topic 10 we will see that it contains the “funding“!
Note that if we wanted to get the index in once, we could have typed:
pd.DataFrame(nmf_features).loc[55].idxmax()
Output:
9
Finally, if we want to see the number of documents for each topic we can easily get it by typing:
pd.DataFrame(nmf_features).idxmax(axis=1).value_counts()
Output:
9 766109
6 57739
2 55393
7 47209
0 41849
8 39497
3 30593
1 23834
5 21533
4 19907
dtype: int64
How to Predict the Topic of a New Document
Let’s say that we want to assign a topic of a new unseen document. Then, we will need to take the document, to transform the TF-IDF model and finally to transform the NMF model. Let’s take the actual new head title from ABC news.
my_news = """15-year-old girl stabbed to death in grocery store during fight with 4 younger girls Authorities said they gathered lots of evidence from videos on social media""" # Transform the TF-IDF X = vect.transform([my_news]) # Transform the TF-IDF: nmf_features nmf_features = model.transform(X) pd.DataFrame(nmf_features)
And if we want to get the index of the topic with the highest score:
pd.DataFrame(nmf_features).idxmax(axis=1)
Output:
0 9
dtype: int64
As expected, this document was classified as Topic 10 (with index 9).
The Takeaway
We provided a walk-through example of Topic Modelling using NMF. We need to stress out that the number of topics is arbitrary and it is difficult to find the optimum one. In our example, we can see that some topics can be merged, which implies that it would better to choose fewer topics. Finally, keep in mind that Matrix Factorization is a very powerful tool that has many applications in Data Science.
More Data Science Hacks?
You can follow us on Medium for more Data Science Hacks
6 thoughts on “Topic Modelling with NMF in Python”
This is a wonderful tutorial! Thank you for making this.
Thank you!
How can i get topics from a new document and their keywords using a saved model
You have to load the required models (TF-IDF and NNMF)
Great articles. But I think this part is incorrect “Finally, if we want to see the number of documents for each topic we can easily get it by typing pd.DataFrame(nmf_features).idxmax()”
this line of code will return the index (of the documents) that have the highest score for the certain topics.
Yes, you are right, I fixed it. Many thanks Lam