Discovering Topic Modeling with NMF

Stella Cherotich
6 min readSep 12, 2023

--

Photo by Claudio Schwarz on Unsplash

Picture this: You’re lost in a digital forest of texts, drowning in a sea of words, desperately seeking insights. Or, you’ve been tasked to get insights from a plethora of customer reviews, but you aren’t sure where to begin? Don’t worry! In this blog post, I will take you through on how Non-Negative Matrix Factorisation (NMF) can be utilised to carry out topic modeling.

With the large amounts of text data that is available to us today, it makes understanding the underlying themes and patterns even more crucial in the world of data science and the ability to make data informed decisions from the uncovered insights.

Understanding Topic Modeling

Before we dive into NMF’s applications, let’s grasp the essence of topic modeling. At its core, topic modeling is a statistical technique used to identify and extract underlying themes from a collection of text documents. It’s like a magnifying glass for your text data, revealing recurring themes, hidden patterns, and overarching ideas within the vast sea of words.

Topic modeling finds relevance across various domains, from content organization and sentiment analysis to market research and academic exploration. While there are other methods like Latent Dirichlet Allocation (LDA) for this purpose, NMF offers unique advantages.

Benefits of NMF Over Other Methods.

So, what sets Non-Negative Matrix Factorization apart? Here are some key benefits:

  1. Interpretability: NMF enforces non-negativity, resulting in topics that are easier to understand, as negative values often hold less meaning in real-world contexts.
  2. Simplicity: NMF’s concept of representing topics as combinations of non-negative terms is intuitive and straightforward, making it user-friendly.
  3. Versatility: NMF adapts to various data types, including text and images, while some other methods are more specialized.
  4. Localised Topics: NMF excels at discovering specific topics within documents, making it suitable for fine-grained analysis where precise topic identification is essential.
  5. Sparse Data Handling: NMF effectively manages datasets with many zero entries, common in text analysis, without losing meaningful information.
  6. No Fixed Topic Count: Unlike methods like LDA, NMF doesn’t require specifying the number of topics in advance, offering more flexibility in exploratory analysis and topic discovery.

Non-Negative Matrix Factorisation (NMF)

I’ve talked about NMF quite a bit already, but what is it really? Non-Negative Matrix Factorization (NMF) is the art of simplifying complex data, especially in text analysis. It’s like breaking down a big puzzle into smaller, non-negative pieces. In the text world, it takes a document-term matrix and splits it into two matrices: W for topics and H for words, all while ensuring that every number is non-negative.

Now, why the non-negativity rule? It’s there to keep things simple and interpretable. Imagine mixing paint — you start with primary colours, and it’s easy to understand how they blend to create new shades. NMF works similarly, finding additive combinations that reveal topics in a straightforward way.

Mathematically, NMF uses iterative algorithms to adjust W and H until they best recreate the original data. Think of it as assembling a jigsaw puzzle. This process uncovers hidden topics in your text data, making NMF a valuable tool for text analysis, as we’ll see in real-world examples ahead.

Preparing Your Text Data

In data science, you’re likely familiar with “Garbage In, Garbage Out.” This underscores the importance of data preprocessing. Without cleaning and structuring your text data, the magic of topic modeling could be lost.

While the steps that you’d carry out would depend on the nature of your text data, one of the first steps in data preprocessing is tokenization. This process breaks down the text into individual words or tokens, much like dividing a sentence into its constituent words. These tokens are the building blocks of our analysis.

Next, we often perform stop word removal, which involves eliminating common words like “the,” “and,” or “is” that don’t carry much meaning. Removing these words helps us focus on the more important ones that define the topics.

To further refine our text, we can also employ stemming or lemmatization. These techniques reduce words to their root forms, so variations like “running” and “ran” become “run.” This simplifies the data, ensuring that NMF doesn’t treat similar words as separate entities.

The crown jewel of data preprocessing in topic modeling is the creation of a document-term matrix. This matrix represents the frequency of words in documents, turning the text into a numerical form that NMF can work with. Each row in the matrix represents a document, each column a word, and the values indicate how often each word appears in each document. It’s like converting our words into a language that NMF can understand.

Implementing NMF For Topic Modeling

Now that we have cleaned our data and laid the groundwork, we can now get into the fun part of implementing NMF, finally.

Step 1: Choosing the Number of Topics (k)

One of the key decisions you’ll face is determining how many topics you want NMF to uncover, denoted as “k.” This decision is both an art and a science. Too few topics may oversimplify the underlying themes, while too many can lead to confusion. It’s often helpful to start with a range of values for “k” and iteratively refine it based on the quality and interpretability of the topics. Think of it as adjusting the lens on a camera until the picture is sharp and clear.

Step 2: Fitting NMF to Your Document-Term Matrix

Python libraries like scikit-learn make the implementation of NMF remarkably accessible. You’ll initiate the process by feeding your document-term matrix into scikit-learn’s NMF module. Here, NMF performs its iterative magic, adjusting the matrices W and H until they best approximate the original data. It’s akin to finding the perfect combination of puzzle pieces that, when assembled, recreate the complete picture of your text corpus.

Step 3: Visualising the Topics

With NMF at work, it’s time to visualise the fruits of its labour. Visualisation is where the magic truly comes to life. You can represent the resulting topics as word clouds or bar charts, where the size or height of words indicates their importance within each topic. These visual representations provide an immediate and intuitive understanding of what each topic is about, making it easier for you to interpret and communicate the findings.

Practical Use Cases

The true potential of NMF shines brightly when we delve into its practical applications spanning diverse domains. My personal journey with NMF began during my data science internship at British Airways, where I honed my skills. However, it was when I sought to apply these newfound skills to a project closer to my heart that the remarkable utility of NMF came to the forefront.

My project revolved around conducting sentiment analysis on customer reviews for Kenya Airways. This real-world application allowed me to witness firsthand how NMF could unveil valuable insights from a sea of textual data. It enabled me to understand the sentiments of passengers, gain actionable feedback. This experience showcased the versatility of NMF, transcending industries and continents, and it exemplified its ability to empower decision-makers with actionable intelligence derived from the depths of text data. Furthermore, in the world of sentiment analysis, NMF plays a crucial role in understanding public opinion. By analysing customer reviews or social media posts, businesses can gauge sentiment about their products or services, providing insights to improve customer satisfaction.

Even in the realm of news and media, NMF shines as a tool for text classification. News agencies use it to automatically categorise articles into sections like “Politics,” “Technology,” or “Sports.” NMF discerns the underlying themes and assigns articles to the appropriate category, streamlining content management.

Conclusion

In summary, our exploration of topic modeling using Non-Negative Matrix Factorization (NMF) has demonstrated its practical utility in extracting valuable insights from text data. We’ve emphasized the crucial role of data preprocessing as the foundation, much like preparing a canvas before the artist’s work begins. NMF, acting as the artist’s brush, shapes these insights, revealing the essence of our data.

NMF’s strengths in topic modeling lie in its simplicity, interpretability, and versatility across various fields, from e-commerce to healthcare. It serves as a reliable tool for extracting valuable information from text.

I encourage you to embark on your own NMF journey. Dive deeper, explore advanced techniques, and apply NMF to your text analysis projects. If you seek further insights or wish to connect, please don’t hesitate to reach out to me on LinkedIn. With NMF as your ally, the realm of hidden knowledge within text awaits your exploration. Happy hacking!

--

--

Stella Cherotich

I like staring into numbers, until it whispers its secrets 🤭 Connect with me on LinkedIn - www.linkedin.com/in/stella-cherotich/