How to perform topic modeling using machine learning in Python?

September 01, 2024

In the first example, we are going to perform topic modeling using Latent Dirichlet Allocation (LDA) in Python.

# Step 1: Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Step 2: Create a CountVectorizer object and fit_transform the data
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)

# Step 3: Initialize and fit the LDA model
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)

# Step 4: Print the top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-5 - 1:-1]]
    print(f"Topic {topic_idx}:", top_words)

In the second example, we are going to perform topic modeling using Non-negative Matrix Factorization (NMF) in Python.

# Step 1: Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Step 2: Create a TfidfVectorizer object and fit_transform the data
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(data)

# Step 3: Initialize and fit the NMF model
nmf = NMF(n_components=5, random_state=42)
nmf.fit(tfidf)

# Step 4: Print the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-5 - 1:-1]]
    print(f"Topic {topic_idx}:", top_words)

These examples demonstrate how to perform topic modeling using LDA and NMF in Python by following a series of steps including data preprocessing, model initialization, and extracting the top words for each topic.

Quick Source Codes

How to perform topic modeling using machine learning in Python?

Comments

Post a Comment

Popular posts from this blog

What are the different evaluation metrics used in machine learning?

Sorting Algorithms in Python? - with practical example

What is the difference between a module and a package in Python?