We will provide an example of how you can use Gensim’s LDA (Latent Dirichlet Allocation) model to model topics in ABC News dataset.
Let’s load the data and the required libraries:
import pandas as pd import gensim from sklearn.feature_extraction.text import CountVectorizer documents = pd.read_csv('news-data.csv', error_bad_lines=False); documents.head()
# Use CountVectorizor to find three letter tokens, remove stop_words, # remove tokens that don't appear in at least 20 documents, # remove tokens that appear in more than 20% of the documents vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', token_pattern='(?u)\\b\\w\\w\\w+\\b') # Fit and transform X = vect.fit_transform(documents.headline_text) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) # Mapping from word IDs to words (To be used in LdaModel's id2word parameter) id_map = dict((v, k) for k, v in vect.vocabulary_.items()) # Use the gensim.models.ldamodel.LdaModel constructor to estimate # LDA model parameters on the corpus, and save to the variable `ldamodel` ldamodel = gensim.models.LdaMulticore(corpus=corpus, id2word=id_map, passes=2, random_state=5, num_topics=10, workers=2)
Explore the Topics
For each topic, we will explore the words occuring in that topic and its relative weight
for idx, topic in ldamodel.print_topics(-1): print("Topic: {} \nWords: {}".format(idx, topic)) print("\n")
Topic: 0
Words: 0.034*"australia" + 0.025*"day" + 0.024*"world" + 0.015*"cup" + 0.014*"hour" + 0.014*"wins" + 0.013*"final" + 0.011*"2015" + 0.011*"test" + 0.010*"victorian"
Topic: 1
Words: 0.033*"trump" + 0.017*"donald" + 0.014*"water" + 0.014*"market" + 0.013*"open" + 0.009*"australian" + 0.009*"share" + 0.008*"river" + 0.008*"rugby" + 0.008*"takes"
Topic: 2
Words: 0.013*"tasmanian" + 0.013*"child" + 0.013*"house" + 0.012*"say" + 0.011*"live" + 0.010*"new" + 0.009*"report" + 0.009*"says" + 0.009*"abuse" + 0.009*"public"
Topic: 3
Words: 0.015*"rural" + 0.013*"state" + 0.012*"national" + 0.012*"nsw" + 0.011*"says" + 0.011*"minister" + 0.010*"budget" + 0.010*"school" + 0.009*"nrl" + 0.009*"indigenous"
Topic: 4
Words: 0.020*"north" + 0.018*"coast" + 0.017*"year" + 0.016*"new" + 0.012*"gold" + 0.012*"afl" + 0.011*"league" + 0.009*"royal" + 0.009*"west" + 0.009*"island"
Topic: 5
Words: 0.021*"country" + 0.015*"people" + 0.014*"trial" + 0.011*"john" + 0.011*"young" + 0.010*"future" + 0.009*"men" + 0.009*"rio" + 0.009*"party" + 0.008*"sea"
Topic: 6
Words: 0.042*"man" + 0.024*"court" + 0.018*"police" + 0.015*"charged" + 0.015*"murder" + 0.014*"woman" + 0.013*"interview" + 0.013*"crash" + 0.013*"car" + 0.013*"years"
Topic: 7
Words: 0.038*"police" + 0.035*"sydney" + 0.016*"tasmania" + 0.014*"missing" + 0.012*"death" + 0.012*"perth" + 0.011*"community" + 0.011*"work" + 0.010*"hobart" + 0.010*"search"
Topic: 8
Words: 0.032*"government" + 0.025*"election" + 0.013*"turnbull" + 0.012*"2016" + 0.011*"says" + 0.011*"killed" + 0.011*"news" + 0.010*"war" + 0.009*"drum" + 0.008*"png"
Topic: 9
Words: 0.026*"queensland" + 0.019*"south" + 0.019*"canberra" + 0.015*"china" + 0.014*"australia" + 0.012*"power" + 0.010*"record" + 0.008*"korea" + 0.008*"victoria" + 0.008*"media"
We can see the key words of each topic. For example the Topic 6 contains words such as “court“, “police“, “murder” and the Topic 1 contains words such as “donald“, “trump” etc
Topic Distribution
Let’s say that we want get the probability of a document to belong to each topic. Let’s take an arbitrary document from our data:
my_document = documents.headline_text[17] my_document
'british combat troops arriving daily in kuwait'
def topic_distribution(string_input): string_input = [string_input] # Fit and transform X = vect.transform(string_input) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) output = list(ldamodel[corpus])[0] return output topic_distribution(my_document)
[(0, 0.016683402),
(1, 0.1833612),
(2, 0.01668264),
(3, 0.01668264),
(4, 0.18311548),
(5, 0.01668264),
(6, 0.016683547),
(7, 0.01668264),
(8, 0.5167432),
(9, 0.016682643)]
As we can see, this document is more likely to belong to topic 8 with a 51% probability. It makes sense because this document is related to “war” since it contains the word “troops” and topic 8 is about war. Let’s recall topic 8:
Topic: 8
Words: 0.032*”government” + 0.025*”election” + 0.013*”turnbull” + 0.012*”2016″ + 0.011*”says” + 0.011*”killed” + 0.011*”news” + 0.010*”war” + 0.009*”drum” + 0.008*”png”
Topic Prediction
Let’s say that we want to assign the most likely topic to each document which is essentially the argmax of the distribution above.
def topic_prediction(my_document): string_input = [my_document] X = vect.transform(string_input) # Convert sparse matrix to gensim corpus. corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) output = list(ldamodel[corpus])[0] topics = sorted(output,key=lambda x:x[1],reverse=True) return topics[0][0] topic_prediction(my_document)
Output:
8
As expected, it returned 8, which is the most likely topic.
In Closing
That was an example of Topic Modelling with LDA. We could have used a TF-IDF instead of Bags of Words. Also, we could have applied lemmatization and/or stemming. There are many different approaches. Our goal was to provide a walk-through example and feel free to try different approaches.
4 thoughts on “LDA Topic Modelling with Gensim”
In Topic Prediction part use output = list(ldamodel[corpus])
without [0] index
Thank you. My code was throwing out an error in the “topics=sorted(output, key=lambda x:x[1],reverse=True)” part with [0] in the line mentioned by you. The error was “TypeError: ‘<' not supported between instances of 'int' and 'tuple' " But now I have got a different issue, even though I'm getting an output, it's showing me an output similar to the one shown in the "topic distribution" part in the article above. This is my output: [(0, 0.60980225), (1, 0.055161662), (2, 0.02830643), (3, 0.3067296)]. My model has 4 topics. Could you tell me how can I directly get the topic number 0 as my output without any probability/weights of the respective topics. Thank you in advance 🙂
Should I write output = list(ldamodel[corpus])[0][0] ?
Hi Roma, thanks for reading our posts. We cannot provide any help when we do not have a reproducible example