NLTK

Main Topic Identification

Introduction

Unstructured text data play a crucial role in people's daily lives. However, the sheer volume of data generated makes it unfeasible to label manually. Therefore, it is necessary to develop a machine learning model to perform this task.

Main Topic Identification helps us cluster data with similar words to assign labels and categorize the data. This can be achieved using various techniques, with one of the most popular being Non-Negative Matrix Factorization (NMF).

NMF creates "k" features from a sparse representation, using a TF-IDF matrix for text. Afterward, we identify the top-10 words for each "k" feature developed. With human inspection of the obtained clusters, we assign labels to the clusters and then apply these labels to all documents.

Data information:
  1. The data used come from the following Kaggle data set.
  2. The labels available in the dataset were ignored because the purpose of this project was unsupervised learning.
Data treatment and modeling
  • Data was loaded and analyzed to determine the types and the presence of null values.
  • The text was cleaned to avoid inconsistencies in the results.
  • All texts were converted to lowercase, and all numbers were removed using regular expressions.
  • The text was tokenized, stop words were removed, and words with a length of less than 3 characters were eliminated.
  • The text was lemmatized to reduce redundancy by grouping words with the same root.
  • Finally, the text was vectorized using the TF-IDF method.
  • After data preprocessing, the NMF model was trained with k topics. Numerous tests were conducted to find the most suitable "k" clusters.
  • This process was done for the titles and abstrats separately.
Workflow used
Results

    The results shows that the optimal numbers of clusters for the titles were three.

    The three topics of the titles identified are math, deep learning, and reinforcement learning.

    Regarding the clustering for the abstracts, the optimal number were four.

    The topics identified for the abstracts were machine learning, data, math, and physics.

Public code is available in the following GitHub repo.

Sebastián

Sarasti

Follow me on my social media channels to know more about my projects.

Follow Us

Get In Touch

Pujilí, Cotopaxi, Ecuador

sebitas.alejo@hotmail.com

© Sebastián Sarasti Zambonino. All Rights Reserved.

Designed by HTML Codex

Edited by Sebastián Sarasti and Angel Bastidas