We will provide an example of how you can use Gensim’s LDA (Latent Dirichlet Allocation) model to model topics in ABC News dataset.
Let’s load the data and the required libraries:
import pandas as pd import gensim from sklearn.feature_extraction.text import CountVectorizer documents = pd.read_csv('news-data.csv', error_bad_lines=False); documents.head()

# Use CountVectorizor to find three letter tokens, remove stop_words, # remove tokens that don't appear in at least 20 documents, # remove tokens that appear in more than 20% of the documents vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', token_pattern='(?u)\\b\\w\\w\\w+\\b') # Fit and transform X = vect.fit_transform(documents.headline_text) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) # Mapping from word IDs to words (To be used in LdaModel's id2word parameter) id_map = dict((v, k) for k, v in vect.vocabulary_.items()) # Use the gensim.models.ldamodel.LdaModel constructor to estimate # LDA model parameters on the corpus, and save to the variable `ldamodel` ldamodel = gensim.models.LdaMulticore(corpus=corpus, id2word=id_map, passes=2, random_state=5, num_topics=10, workers=2)
Explore the Topics
For each topic, we will explore the words occuring in that topic and its relative weight
for idx, topic in ldamodel.print_topics(-1): print("Topic: {} \nWords: {}".format(idx, topic)) print("\n")
Topic: 0
Words: 0.034*"australia" + 0.025*"day" + 0.024*"world" + 0.015*"cup" + 0.014*"hour" + 0.014*"wins" + 0.013*"final" + 0.011*"2015" + 0.011*"test" + 0.010*"victorian"
Topic: 1
Words: 0.033*"trump" + 0.017*"donald" + 0.014*"water" + 0.014*"market" + 0.013*"open" + 0.009*"australian" + 0.009*"share" + 0.008*"river" + 0.008*"rugby" + 0.008*"takes"
Topic: 2
Words: 0.013*"tasmanian" + 0.013*"child" + 0.013*"house" + 0.012*"say" + 0.011*"live" + 0.010*"new" + 0.009*"report" + 0.009*"says" + 0.009*"abuse" + 0.009*"public"
Topic: 3
Words: 0.015*"rural" + 0.013*"state" + 0.012*"national" + 0.012*"nsw" + 0.011*"says" + 0.011*"minister" + 0.010*"budget" + 0.010*"school" + 0.009*"nrl" + 0.009*"indigenous"
Topic: 4
Words: 0.020*"north" + 0.018*"coast" + 0.017*"year" + 0.016*"new" + 0.012*"gold" + 0.012*"afl" + 0.011*"league" + 0.009*"royal" + 0.009*"west" + 0.009*"island"
Topic: 5
Words: 0.021*"country" + 0.015*"people" + 0.014*"trial" + 0.011*"john" + 0.011*"young" + 0.010*"future" + 0.009*"men" + 0.009*"rio" + 0.009*"party" + 0.008*"sea"
Topic: 6
Words: 0.042*"man" + 0.024*"court" + 0.018*"police" + 0.015*"charged" + 0.015*"murder" + 0.014*"woman" + 0.013*"interview" + 0.013*"crash" + 0.013*"car" + 0.013*"years"
Topic: 7
Words: 0.038*"police" + 0.035*"sydney" + 0.016*"tasmania" + 0.014*"missing" + 0.012*"death" + 0.012*"perth" + 0.011*"community" + 0.011*"work" + 0.010*"hobart" + 0.010*"search"
Topic: 8
Words: 0.032*"government" + 0.025*"election" + 0.013*"turnbull" + 0.012*"2016" + 0.011*"says" + 0.011*"killed" + 0.011*"news" + 0.010*"war" + 0.009*"drum" + 0.008*"png"
Topic: 9
Words: 0.026*"queensland" + 0.019*"south" + 0.019*"canberra" + 0.015*"china" + 0.014*"australia" + 0.012*"power" + 0.010*"record" + 0.008*"korea" + 0.008*"victoria" + 0.008*"media"
We can see the key words of each topic. For example the Topic 6 contains words such as “court“, “police“, “murder” and the Topic 1 contains words such as “donald“, “trump” etc
Topic Distribution
Let’s say that we want get the probability of a document to belong to each topic. Let’s take an arbitrary document from our data:
my_document = documents.headline_text[17] my_document
'british combat troops arriving daily in kuwait'
def topic_distribution(string_input): string_input = [string_input] # Fit and transform X = vect.transform(string_input) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) output = list(ldamodel[corpus])[0] return output topic_distribution(my_document)
[(0, 0.016683402),
(1, 0.1833612),
(2, 0.01668264),
(3, 0.01668264),
(4, 0.18311548),
(5, 0.01668264),
(6, 0.016683547),
(7, 0.01668264),
(8, 0.5167432),
(9, 0.016682643)]
As we can see, this document is more likely to belong to topic 8 with a 51% probability. It makes sense because this document is related to “war” since it contains the word “troops” and topic 8 is about war. Let’s recall topic 8:
Topic: 8
Words: 0.032*”government” + 0.025*”election” + 0.013*”turnbull” + 0.012*”2016″ + 0.011*”says” + 0.011*”killed” + 0.011*”news” + 0.010*”war” + 0.009*”drum” + 0.008*”png”
Topic Prediction
Let’s say that we want to assign the most likely topic to each document which is essentially the argmax of the distribution above.
def topic_prediction(my_document): string_input = [my_document] X = vect.transform(string_input) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) output = list(ldamodel[corpus])[0] topics = sorted(output,key=lambda x:x[1],reverse=True) return topics[0][0] topic_prediction(my_document)
Output:
8
As expected, it returned 8, which is the most likely topic.
In Closing
That was an example of Topic Modelling with LDA. We could have used a TF-IDF instead of Bags of Words. Also, we could have applied lemmatization and/or stemming. There are many different approaches. Our goal was to provide a walk-through example and feel free to try different approaches.