Introduction to LDA Topic Modeling and Python Implementation

I. Introduction to the LDA Theme Model

The LDA topic model is mainly used to infer the topic distribution of documents, which can give the topic of each document in the document set in the form of a probability distribution according to the topic for topic clustering or text classification.

The LDA topic model does not care about the order of words in a document, and usually uses bag-of-words feature to represent the document. An introduction to bag-of-words modeling can be found in this article:Text Vectorized Representation – Bag of Words Model – Knowledgeable

To understand the LDA model, we need to first understandGenerative model for LDA, how does LDA think an article is formed?

The LDA model suggests thatA topic can be represented by a lexical distribution, and an article can be represented by a thematic distribution。

There are two themes, for example.gourmet foodandbeauty careThe LDA says that two themes can be represented by lexical distributions and they are:

{bread: 0.4, fondue: 0.5, eyebrow pencil: 0.03, blush: 0.07}
{Brow Pencil: 0.4, Blush: 0.5, Bread: 0.03, Fondue: 0.07}

Similarly, for two articles, LDA argues that the articles can be represented by the topic distribution this way:

The Beauty Diary {Beauty: 0.8, Food: 0.1, Other: 0.1}

Culinary Explorations {Food: 0.8, Beauty: 0.1, Other: 0.1}

So to generate an article, you can pick one of the above topics with a certain probability, then pick a word under that topic with a certain probability, and keep repeating these two steps to generate the final article.

In the LDA model, a document is generated as follows:

Introduction to LDA Topic Modeling and Python Implementation

In this case, the Beta-like distribution is the conjugate prior probability distribution of the binomial distribution, while the Dirichlet distribution is the conjugate prior probability distribution of the polynomial distribution.

Introduction to LDA Topic Modeling and Python Implementation

If we want to generate a document, the probability that each word in it occurs is:

Introduction to LDA Topic Modeling and Python Implementation

A more detailed mathematical derivation can be found in:Understanding LDA Topic Models in Layman’s terms_Structural Method Algorithmic Method-Blog_lda Models

Seeing an article and inferring its hidden distribution of topics is what modeling is all about. In other words, humans have written all kinds of articles based on the document generation model and then thrown them to the computer, which is equivalent to the computer seeing an article that has already been written. Now the computer needs to generalize the topic of the article based on the series of words it sees in an article, and then derive the different probabilities of occurrence of each topic: the topic distribution.

As for how exactly the LDA theme model is implemented in the computer, we can also not have to look into the details, now there are many packages that can be used directly for LDA theme analysis, we can just use them directly. (Yes, I’m the package switcher)

II. Python implementation

Before performing the LDA topic modeling analysis in Python, I processed the documents for lemmatization and deactivation of words (see my previous post for more details:Segmentation of a single micro blog file with python – jieba segmentation (add reserved words and stop words) _ A Jieu is a jieba blog – blog_jieba stop words）

The input file I have below is also a file with the words already sorted out

1. Import algorithm package

import gensim
from gensim import corpora
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

2. Load data

The document is first transformed into a binary list, where each sub-list represents a tweet:

PATH = "E:/data/output.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #Read the contents line by line
data_set=[] #create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list : #read each line of segregation
        result.append(w)
    data_set.append(result)
print(data_set)

Constructing a lexicon with a vectorized representation of the corpus:

dictionary = corpora.Dictionary(data_set) # Build dictionary
corpus = [dictionary.doc2bow(text) for text in data_set] # indicates how many times it appears for the first word

3. Construct LDA model

ldamodel = LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30,random_state =1) # is divided into 10 topics
print(ldamodel.print_topics(num_topics=num_topics, num_words=15)) #Output 15 words per topic

This is to determine the number of topics when the LDA model construction method, generally we can use indicators to evaluate the model is good or bad, but also can use these indicators to determine the optimal number of topics. The general indicators used to evaluate LDA topic models are perplexity and coherence, the lower the perplexity or the higher the coherence, the better the model. Some studies have shown that perplexity is not a good indicator, so I generally use coherence to evaluate the model and select the optimal topic, but I use both methods in the code below.

#Calculate the degree of confusion
def perplexity(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=15))
    print(ldamodel.log_perplexity(corpus))
    return ldamodel.log_perplexity(corpus)
# Calculate coherence
def coherence(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=10))
    ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')
    print(ldacm.get_coherence())
    return ldacm.get_coherence()

4. Plot theme-coherence curves and select the optimal number of themes

x = range(1,15)
# z = [perplexity(i) for i in x] # pick this if you want to use perplexity
y = [coherence(i) for i in x]
plt.plot(x, y)
plt.xlabel('Number of topics')
plt.ylabel('coherence size ')
plt.rcParams['font.sans-serif']=['SimHei']
matplotlib.rcParams['axes.unicode_minus']=False
plt.title('Theme-coherence change profile')
plt.show()

Eventually it is possible to get a distribution of words for each topic and a graph like this:

Introduction to LDA Topic Modeling and Python Implementation

5. Output and visualization of results

With the above topic evaluation, we find that we can choose 5 as the number of topics, next we can run the model again, set the number of topics to 5, and output the most likely corresponding topics for each document

from gensim.models import LdaModel
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora, models
import csv

# Prepare data
PATH = "E:/data/output1.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #read line by line
data_set=[] #create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :# read each line of segregation
        result.append(w)
    data_set.append(result)

dictionary = corpora.Dictionary(data_set) # Build dictionary
corpus = [dictionary.doc2bow(text) for text in data_set]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1)
topic_list=lda.print_topics()
print(topic_list)

for i in lda.get_document_topics(corpus)[:]:
    listj=[]
    for j in i:
        listj.append(j[1])
    bz=listj.index(max(listj))
    print(i[bz][0])

Also we can visualize the LDA model results with pyLDAvis:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, 'E:/data/3topic.html')

It probably gets that way:

Introduction to LDA Topic Modeling and Python Implementation

The left circle indicates the theme, and the right side indicates how much each word contributes to the theme.

All codes are listed below:

import gensim
from gensim import corpora
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel



 # Prepare data
PATH = "E:/data/output.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #Read the contents line by line
data_set=[] #create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :# read each line of segregation
        result.append(w)
    data_set.append(result)
print(data_set)


dictionary = corpora.Dictionary(data_set) # Build document-term matrix
corpus = [dictionary.doc2bow(text) for text in data_set]
# the Lda. = gensim models. Ldamodel. Ldamodel # create Lda object

#Calculate the degree of confusion
def perplexity(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=15))
    print(ldamodel.log_perplexity(corpus))
    return ldamodel.log_perplexity(corpus)

# Calculate coherence
def coherence(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=10))
    ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')
    print(ldacm.get_coherence())
    return ldacm.get_coherence()

# Plot a line graph of the degree of confusion
x = range(1,15)
# z = [perplexity(i) for i in x]
y = [coherence(i) for i in x]
plt.plot(x, y)
plt.xlabel('Number of topics')
plt.ylabel('coherence size ')
plt.rcParams['font.sans-serif']=['SimHei']
matplotlib.rcParams['axes.unicode_minus']=False
plt.title('Theme-coherence change profile')
plt.show()

from gensim.models import LdaModel
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora, models
import csv

# Prepare data
PATH = "E:/data/output1.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n') #Read the contents line by line
data_set=[] #create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :# read each line of segregation
        result.append(w)
    data_set.append(result)

dictionary = corpora.Dictionary(data_set) # Build document-term matrix
corpus = [dictionary.doc2bow(text) for text in data_set]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1)
topic_list=lda.print_topics()
print(topic_list)

result_list =[]
for i in lda.get_document_topics(corpus)[:]:
    listj=[]
    for j in i:
        listj.append(j[1])
    bz=listj.index(max(listj))
    result_list.append(i[bz][0])
print(result_list)

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, 'E:/data/topic.html')

Pick up if you need to.

Also follow me, I’ll be posting more dry articles about data analytics after that~

Introduction to LDA Topic Modeling and Python Implementation

I. Introduction to the LDA Theme Model

II. Python implementation

1. Import algorithm package

2. Load data

3. Construct LDA model

4. Plot theme-coherence curves and select the optimal number of themes

5. Output and visualization of results

All codes are listed below:

Recommended Today

uniapp and applet set tabBar and show and hide tabBar