Predictive Hacks

LDA Topic Modelling with Gensim

topic modelling

We will provide an example of how you can use Gensim’s LDA (Latent Dirichlet Allocation) model to model topics in ABC News dataset.

Let’s load the data and the required libraries:

import pandas as pd
import gensim
from sklearn.feature_extraction.text import CountVectorizer


documents = pd.read_csv('news-data.csv', error_bad_lines=False);

documents.head()
 
# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(documents.headline_text)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

ldamodel = gensim.models.LdaMulticore(corpus=corpus, id2word=id_map, passes=2,
                                               random_state=5, num_topics=10, workers=2)

Explore the Topics

For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in ldamodel.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")
Topic: 0 
Words: 0.034*"australia" + 0.025*"day" + 0.024*"world" + 0.015*"cup" + 0.014*"hour" + 0.014*"wins" + 0.013*"final" + 0.011*"2015" + 0.011*"test" + 0.010*"victorian"


Topic: 1 
Words: 0.033*"trump" + 0.017*"donald" + 0.014*"water" + 0.014*"market" + 0.013*"open" + 0.009*"australian" + 0.009*"share" + 0.008*"river" + 0.008*"rugby" + 0.008*"takes"


Topic: 2 
Words: 0.013*"tasmanian" + 0.013*"child" + 0.013*"house" + 0.012*"say" + 0.011*"live" + 0.010*"new" + 0.009*"report" + 0.009*"says" + 0.009*"abuse" + 0.009*"public"


Topic: 3 
Words: 0.015*"rural" + 0.013*"state" + 0.012*"national" + 0.012*"nsw" + 0.011*"says" + 0.011*"minister" + 0.010*"budget" + 0.010*"school" + 0.009*"nrl" + 0.009*"indigenous"


Topic: 4 
Words: 0.020*"north" + 0.018*"coast" + 0.017*"year" + 0.016*"new" + 0.012*"gold" + 0.012*"afl" + 0.011*"league" + 0.009*"royal" + 0.009*"west" + 0.009*"island"


Topic: 5 
Words: 0.021*"country" + 0.015*"people" + 0.014*"trial" + 0.011*"john" + 0.011*"young" + 0.010*"future" + 0.009*"men" + 0.009*"rio" + 0.009*"party" + 0.008*"sea"


Topic: 6 
Words: 0.042*"man" + 0.024*"court" + 0.018*"police" + 0.015*"charged" + 0.015*"murder" + 0.014*"woman" + 0.013*"interview" + 0.013*"crash" + 0.013*"car" + 0.013*"years"


Topic: 7 
Words: 0.038*"police" + 0.035*"sydney" + 0.016*"tasmania" + 0.014*"missing" + 0.012*"death" + 0.012*"perth" + 0.011*"community" + 0.011*"work" + 0.010*"hobart" + 0.010*"search"


Topic: 8 
Words: 0.032*"government" + 0.025*"election" + 0.013*"turnbull" + 0.012*"2016" + 0.011*"says" + 0.011*"killed" + 0.011*"news" + 0.010*"war" + 0.009*"drum" + 0.008*"png"


Topic: 9 
Words: 0.026*"queensland" + 0.019*"south" + 0.019*"canberra" + 0.015*"china" + 0.014*"australia" + 0.012*"power" + 0.010*"record" + 0.008*"korea" + 0.008*"victoria" + 0.008*"media"

We can see the key words of each topic. For example the Topic 6 contains words such as “court“, “police“, “murder” and the Topic 1 contains words such as “donald“, “trump” etc

Topic Distribution

Let’s say that we want get the probability of a document to belong to each topic. Let’s take an arbitrary document from our data:

my_document = documents.headline_text[17]
my_document
 
'british combat troops arriving daily in kuwait'
def topic_distribution(string_input):
    string_input = [string_input]
    # Fit and transform
    X = vect.transform(string_input)

    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

    output = list(ldamodel[corpus])[0]

    return output
 


topic_distribution(my_document)
 
[(0, 0.016683402),
 (1, 0.1833612),
 (2, 0.01668264),
 (3, 0.01668264),
 (4, 0.18311548),
 (5, 0.01668264),
 (6, 0.016683547),
 (7, 0.01668264),
 (8, 0.5167432),
 (9, 0.016682643)]

As we can see, this document is more likely to belong to topic 8 with a 51% probability. It makes sense because this document is related to “war” since it contains the word “troops” and topic 8 is about war. Let’s recall topic 8:

Topic: 8
Words: 0.032*”government” + 0.025*”election” + 0.013*”turnbull” + 0.012*”2016″ + 0.011*”says” + 0.011*”killed” + 0.011*”news” + 0.010*”war” + 0.009*”drum” + 0.008*”png”

Topic Prediction

Let’s say that we want to assign the most likely topic to each document which is essentially the argmax of the distribution above.

def topic_prediction(my_document):
    string_input = [my_document]
    X = vect.transform(string_input)
    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    output = list(ldamodel[corpus])[0]
    topics = sorted(output,key=lambda x:x[1],reverse=True)
    return topics[0][0]

topic_prediction(my_document)
 

Output:

8

As expected, it returned 8, which is the most likely topic.

In Closing

That was an example of Topic Modelling with LDA. We could have used a TF-IDF instead of Bags of Words. Also, we could have applied lemmatization and/or stemming. There are many different approaches. Our goal was to provide a walk-through example and feel free to try different approaches.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

4 thoughts on “LDA Topic Modelling with Gensim”

    • Thank you. My code was throwing out an error in the “topics=sorted(output, key=lambda x:x[1],reverse=True)” part with [0] in the line mentioned by you. The error was “TypeError: ‘<' not supported between instances of 'int' and 'tuple' " But now I have got a different issue, even though I'm getting an output, it's showing me an output similar to the one shown in the "topic distribution" part in the article above. This is my output: [(0, 0.60980225), (1, 0.055161662), (2, 0.02830643), (3, 0.3067296)]. My model has 4 topics. Could you tell me how can I directly get the topic number 0 as my output without any probability/weights of the respective topics. Thank you in advance 🙂

      Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s