Predictive Hacks

LDA Topic Modelling with Gensim

topic modelling

We will provide an example of how you can use Gensim’s LDA (Latent Dirichlet Allocation) model to model topics in ABC News dataset.

Let’s load the data and the required libraries:

import pandas as pd
import gensim
from sklearn.feature_extraction.text import CountVectorizer


documents = pd.read_csv('news-data.csv', error_bad_lines=False);

documents.head()
 
LDA Topic Modelling with Gensim 1
# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(documents.headline_text)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

ldamodel = gensim.models.LdaMulticore(corpus=corpus, id2word=id_map, passes=2,
                                               random_state=5, num_topics=10, workers=2)

Explore the Topics

For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in ldamodel.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")
Topic: 0 
Words: 0.034*"australia" + 0.025*"day" + 0.024*"world" + 0.015*"cup" + 0.014*"hour" + 0.014*"wins" + 0.013*"final" + 0.011*"2015" + 0.011*"test" + 0.010*"victorian"


Topic: 1 
Words: 0.033*"trump" + 0.017*"donald" + 0.014*"water" + 0.014*"market" + 0.013*"open" + 0.009*"australian" + 0.009*"share" + 0.008*"river" + 0.008*"rugby" + 0.008*"takes"


Topic: 2 
Words: 0.013*"tasmanian" + 0.013*"child" + 0.013*"house" + 0.012*"say" + 0.011*"live" + 0.010*"new" + 0.009*"report" + 0.009*"says" + 0.009*"abuse" + 0.009*"public"


Topic: 3 
Words: 0.015*"rural" + 0.013*"state" + 0.012*"national" + 0.012*"nsw" + 0.011*"says" + 0.011*"minister" + 0.010*"budget" + 0.010*"school" + 0.009*"nrl" + 0.009*"indigenous"


Topic: 4 
Words: 0.020*"north" + 0.018*"coast" + 0.017*"year" + 0.016*"new" + 0.012*"gold" + 0.012*"afl" + 0.011*"league" + 0.009*"royal" + 0.009*"west" + 0.009*"island"


Topic: 5 
Words: 0.021*"country" + 0.015*"people" + 0.014*"trial" + 0.011*"john" + 0.011*"young" + 0.010*"future" + 0.009*"men" + 0.009*"rio" + 0.009*"party" + 0.008*"sea"


Topic: 6 
Words: 0.042*"man" + 0.024*"court" + 0.018*"police" + 0.015*"charged" + 0.015*"murder" + 0.014*"woman" + 0.013*"interview" + 0.013*"crash" + 0.013*"car" + 0.013*"years"


Topic: 7 
Words: 0.038*"police" + 0.035*"sydney" + 0.016*"tasmania" + 0.014*"missing" + 0.012*"death" + 0.012*"perth" + 0.011*"community" + 0.011*"work" + 0.010*"hobart" + 0.010*"search"


Topic: 8 
Words: 0.032*"government" + 0.025*"election" + 0.013*"turnbull" + 0.012*"2016" + 0.011*"says" + 0.011*"killed" + 0.011*"news" + 0.010*"war" + 0.009*"drum" + 0.008*"png"


Topic: 9 
Words: 0.026*"queensland" + 0.019*"south" + 0.019*"canberra" + 0.015*"china" + 0.014*"australia" + 0.012*"power" + 0.010*"record" + 0.008*"korea" + 0.008*"victoria" + 0.008*"media"

We can see the key words of each topic. For example the Topic 6 contains words such as “court“, “police“, “murder” and the Topic 1 contains words such as “donald“, “trump” etc

Topic Distribution

Let’s say that we want get the probability of a document to belong to each topic. Let’s take an arbitrary document from our data:

my_document = documents.headline_text[17]
my_document
 
'british combat troops arriving daily in kuwait'
def topic_distribution(string_input):
    string_input = [string_input]
    # Fit and transform
    X = vect.transform(string_input)

    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

    output = list(ldamodel[corpus])[0]

    return output
 


topic_distribution(my_document)
 
[(0, 0.016683402),
 (1, 0.1833612),
 (2, 0.01668264),
 (3, 0.01668264),
 (4, 0.18311548),
 (5, 0.01668264),
 (6, 0.016683547),
 (7, 0.01668264),
 (8, 0.5167432),
 (9, 0.016682643)]

As we can see, this document is more likely to belong to topic 8 with a 51% probability. It makes sense because this document is related to “war” since it contains the word “troops” and topic 8 is about war. Let’s recall topic 8:

Topic: 8
Words: 0.032*”government” + 0.025*”election” + 0.013*”turnbull” + 0.012*”2016″ + 0.011*”says” + 0.011*”killed” + 0.011*”news” + 0.010*”war” + 0.009*”drum” + 0.008*”png”

Topic Prediction

Let’s say that we want to assign the most likely topic to each document which is essentially the argmax of the distribution above.

def topic_prediction(my_document):
    string_input = [my_document]
    X = vect.transform(string_input)
    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    output = list(ldamodel[corpus])[0]
    topics = sorted(output,key=lambda x:x[1],reverse=True)
    return topics[0][0]

topic_prediction(my_document)
 

Output:

8

As expected, it returned 8, which is the most likely topic.

In Closing

That was an example of Topic Modelling with LDA. We could have used a TF-IDF instead of Bags of Words. Also, we could have applied lemmatization and/or stemming. There are many different approaches. Our goal was to provide a walk-through example and feel free to try different approaches.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore