are utilized in companies to categorise brand-related textual content datasets (comparable to product and web site evaluations, surveys, and social media feedback) and to trace how buyer satisfaction metrics change over time.
There’s a myriad of latest subject fashions one can select from: the broadly used BERTopic by Maarten Grootendorst (2022), the latest FASTopic introduced ultimately yr’s NeurIPS, (Xiaobao Wu et al.,2024), the Dynamic Subject Mannequin by Blei and Lafferty (2006), or a recent semi-supervised Seeded Poisson Factorization mannequin (Prostmaier et al., 2025).
For a enterprise use case, coaching subject fashions on buyer texts, we regularly get outcomes that aren’t equivalent and generally even conflicting. In enterprise, imperfections value cash, so the engineers ought to place into manufacturing the mannequin that gives one of the best resolution and solves the issue most successfully. On the identical tempo that new subject fashions seem available on the market, strategies for evaluating their high quality utilizing new metrics additionally evolve.
This sensible tutorial will concentrate on bigram subject fashions, which give extra related info and determine higher key qualities and issues for enterprise selections than single-word fashions (“supply” vs. “poor supply”, “abdomen” vs. “delicate abdomen”, and so on.). On one facet, bigram fashions are extra detailed; on the opposite, many analysis metrics weren’t initially designed for his or her analysis. To offer extra background on this space, we are going to discover intimately:
- Learn how to consider the standard of bigram subject fashions
- Learn how to put together an e mail classification pipeline in Python.
Our instance use case will present how bigram subject fashions (BERTopic and FASTopic) assist prioritize e mail communication with prospects on sure subjects and scale back response instances.
1. What are subject mannequin high quality indicators?
The analysis job ought to goal the best state:
The perfect subject mannequin ought to produce subjects the place phrases or bigrams (two consecutive phrases) in every subject are extremely semantically associated and distinct for every subject.
In observe, which means that the phrases predicted for every subject are semantically related to human judgment, and there’s low duplication of phrases between subjects.
It’s customary to calculate a set of metrics for every skilled mannequin to make a certified resolution on which mannequin to position into manufacturing or use for a enterprise resolution, evaluating the mannequin efficiency metrics.
- Coherence metrics consider how nicely the phrases found by a subject mannequin make sense to people (have related semantics in every subject).
- Subject range measures how totally different the found subjects are from each other.
Bigram subject fashions work nicely with these metrics:
- NPMI (Normalized Level-wise Mutual Data) makes use of chances estimated in a reference corpus to calculate a [-1:1] rating for every phrase (or bigram) predicted by the mannequin. Learn [1] for extra particulars.
The reference corpus could be both inner (the coaching set) or exterior (e.g., an exterior e mail dataset). A big, exterior, and comparable corpus is a more sensible choice as a result of it might assist scale back bias in coaching units. As a result of this metric works with phrase frequencies, the coaching set and the reference corpus needs to be preprocessed the identical approach (i.e., if we take away numbers and stopwords within the coaching set, we also needs to do it within the reference corpus). The combination mannequin rating is the common of phrases throughout subjects.
- SC (Semantic Coherence) doesn’t want a reference corpus. It makes use of the identical dataset as was used to coach the subject mannequin. Learn extra in [2].
Let’s say we’ve got the Prime 4 phrases for one subject: “apple”, “banana”, “juice”, “smoothie” predicted by a subject mannequin. Then SC appears in any respect combos of phrases within the coaching set going from left to proper, beginning with the primary phrase {apple, banana}, {apple, juice}, {apple, smoothie} then the second phrase {banana, juice}, {banana, smoothie}, then final phrase {juice, smoothie} and it counts the variety of paperwork that include each phrases, divided by the frequency of paperwork that include the primary phrase. Total SC rating for a mannequin is the imply of all topic-level scores.
PUV (Proportion of Distinctive Phrases) calculates the share of distinctive phrases throughout subjects within the mannequin. PUV = 1 signifies that every subject within the mannequin incorporates distinctive bigrams. Values near 1 point out a well-shaped, high-quality mannequin with small phrase overlap between subjects. [3].
The nearer to 0 the SC and NIMP scores are, the extra coherent the mannequin is (bigrams predicted by the subject mannequin for every subject are semantically related). The nearer to 1 PUV is, the better the mannequin is to interpret and use, as a result of bigrams between subjects don’t overlap.
2. How can we prioritize e mail communication with subject fashions?
A big share of buyer communication, not solely in e-commerce companies, is now solved with chatbots and private shopper sections. But, it is not uncommon to speak with prospects by e mail. Many e mail suppliers supply builders broad flexibility in APIs to customise their e mail platform (e.g., MailChimp, SendGrid, Brevo). On this place, subject fashions make mailing extra versatile and efficient.
On this use case, the pipeline takes the enter from the incoming emails and makes use of the skilled subject classifier to categorize the incoming e mail content material. The result is the categorised subject that the Buyer Care (CC) Division sees subsequent to every e mail. The principle goal is to permit the CC workers to prioritize the classes of emails and scale back the response time to probably the most delicate requests (that immediately have an effect on margin-related KPIs or OKRs).

3. Information and mannequin set-ups
We are going to practice FASTopic and Bertopic to categorise emails into 8 and 10 subjects and consider the standard of all mannequin specs. Learn my earlier TDS tutorial on subject modeling with these cutting-edge subject fashions.
As a coaching set, we use a synthetically generated Buyer Care Electronic mail dataset accessible on Kaggle with a GPL-3 license. The prefiltered knowledge covers 692 incoming emails and appears like this:

3.1. Information preprocessing
Cleansing textual content in the correct order is crucial for subject fashions to work in observe as a result of it minimizes the bias of every cleansing operation.
Numbers are sometimes eliminated first, adopted by emojis, until we don’t want them for particular conditions, comparable to extracting sentiment. Stopwords for a number of languages are eliminated afterward, adopted by punctuation in order that stopwords don’t break up into two tokens (“we’ve” -> “we” + ‘ve”). Extra tokens (firm and folks’s names, and so on.) are eliminated within the subsequent step within the clear knowledge earlier than lemmatization, which unifies tokens with the identical semantics.

“Supply” and “deliveries”, “field” and “Containers”, or “Worth” and “costs” share the identical phrase root, however with out lemmatization, subject fashions would mannequin them as separate elements. That’s why buyer emails needs to be lemmatized within the final step of preprocessing.
Textual content preprocessing is model-specific:
- FASTopic works with clear knowledge on enter; some cleansing (stopwords) could be performed in the course of the coaching. The only and only approach is to make use of the Washer, a no-code app for textual content knowledge cleansing that provides a no-code approach of information preprocessing for textual content mining initiatives.
- BERTopic: the documentation recommends that “removing cease phrases as a preprocessing step isn’t suggested because the transformer-based embedding fashions that we use want the complete context to create correct embeddings”. For that reason, cleansing operations needs to be included within the mannequin coaching.
3.2. Mannequin compilation and coaching
You possibly can verify the complete codes for FASTopic and BERTopic’s coaching with bigram preprocessing and cleansing in this repo. My earlier TDS tutorials (4) and (5) clarify all steps intimately.
We practice each fashions to categorise 8 subjects in buyer e mail knowledge. A easy inspection of the subject distribution exhibits that incoming emails to FASTopic are fairly nicely distributed throughout subjects. BERTopic classifies emails erratically, conserving outliers (uncategorized tokens) in T-1 and a big share of incoming emails in T0.

Listed below are the anticipated bigrams for each fashions with subject labels:


As a result of the e-mail corpus is an artificial LLM-generated dataset, the naive labelling of the subjects for each fashions exhibits subjects which can be:
- Comparable: Time Delays, Latency Points, Person Permissions, Deployment Points, Compilation Errors,
- Differing: Unclassified (BERTopic classifies outliers into T-1), Enchancment Options, Authorization Errors, Efficiency Complaints (FASTopic), Cloud Administration, Asynchronous Requests, Common Requests (BERTopic)
For enterprise functions, subjects needs to be labelled by the corporate’s insiders who know the shopper base and the enterprise priorities.
4. Mannequin analysis
If three out of eight categorised subjects are labeled in a different way, then which mannequin needs to be deployed? Let’s now consider the coherence and variety for the skilled BERTopic and FASTopic T-8 fashions.
4.1. NPMI
We want a reference corpus to calculate an NPMI for every mannequin. The Buyer IT Assist Ticket Dataset from Kaggle, distributed with Attribution 4.0 Worldwide license, supplies comparable knowledge to our coaching set. The info is filtered to 11923 English e mail our bodies.
- Calculate an NPMI for every bigram within the reference corpus with this code.
- Merge bigrams predicted by FASTopic and BERTopic with their NPMI scores from the reference corpus. The less NaNs are within the desk, the extra correct the metric is.

3. Common NPMIs inside and throughout subjects to get a single rating for every mannequin.
4.2. SC
With SC, we study the context and semantic similarity of bigrams predicted by a subject mannequin by calculating their place within the corpus in relation to different tokens. To take action, we:
- Create a document-term matrix (DTM) with a depend of what number of instances every bigram seems in every doc.
- Calculate subject SC scores by looking for bigram co-occurrences within the DTM and the bigrams predicted by subject fashions.
- Common subject SC to a mannequin SC rating.
4.3. PUV
Subject range PUV metric checks the duplicates of bigrams between subjects in a mannequin.
- Be a part of bigrams into tokens by changing areas with underscores within the FASTopic and BERTopic tables of predicted bigrams.

2. Calculate subject range as depend of distinct tokens/ depend of tokens within the tables for each fashions.
4.4. Mannequin comparability
Let’s now summarize the coherence and variety analysis in Picture 9. BERTopic fashions are extra coherent however much less various than FASTopic. The variations usually are not very giant, however BERTopic suffers from uneven distribution of incoming emails into the pipeline (see charts in Picture 5). Round 32% of categorised emails fall into T0, and 15% into T-1, which covers the unclassified outliers. The fashions are skilled with a min. of 20 tokens per subject. Rising this parameter causes the mannequin to be unable to coach, in all probability due to the small knowledge measurement.
For that reason, FASTopic is a more sensible choice for subject modelling in e mail classification with small coaching datasets.

The final step is to deploy the mannequin with subject labels within the e mail platform to categorise incoming emails:

Abstract
Coherence and variety metrics examine fashions with related coaching setups, the identical dataset, and cleansing technique. We can not examine their absolute values with the outcomes of various coaching periods. However they assist us resolve on one of the best mannequin for our particular use case. They provide a relative comparability of varied mannequin specs and assist resolve which mannequin needs to be deployed within the pipeline. Subject fashions analysis ought to all the time be the final step earlier than mannequin deployment in enterprise observe.
How does buyer care profit from the subject modelling train? After the subject mannequin is put into manufacturing, the pipeline sends a categorised subject for every e mail to the e-mail platform that Buyer Care makes use of for speaking with prospects. With a restricted workers, it’s now potential to prioritize and reply sooner to probably the most delicate enterprise requests (comparable to “time delays” and “latency points”), and alter priorities dynamically.
Information and full codes for this tutorial are right here.
Petr Korab is a Python Engineer and Founding father of Textual content Mining Tales with over eight years of expertise in Enterprise Intelligence and NLP.
Acknowledgments: I thank Tomáš Horský (Lentiamo, Prague), Martin Feldkircher, and Viktoriya Teliha (Vienna College of Worldwide Research) for helpful feedback and ideas.
References
[1] Blei, D. M., Lafferty, J. D. 2006. Dynamic subject fashions. In Proceedings of the twenty third worldwide convention on Machine studying (pp. 113–120).
[2] Dieng A.B., Ruiz F. J. R., and Blei D. M. 2020. Subject Modeling in embedding areas. Transactions of the Affiliation for Computational Linguistics, 8:439-453.
[3] Grootendorst, M. 2022. Bertopic: Neural Subject Modeling With A Class-Based mostly TF-IDF Process. Pc Science.
[4] Korab, P. Subject Modelling in Enterprise Intelligence: FASTopic and BERTopic in Code. In the direction of Information Science. 22.1.2025. Accessible from: hyperlink.
[5] Korab, P. Subject Modelling with BERTtopic in Python. In the direction of Information Science. 4.1.2024. Accessible from: hyperlink.
[6] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. 2024. FASTopic: A Quick, Adaptive, Steady, and Transferable Subject Modeling Paradigm. arXiv preprint: 2405.17978.
[7] Mimno, D., Wallach, H., M., Talley, E., Leenders, M, McCallum. A. 2011. Optimizing Semantic Coherence in Subject Fashions. Proceedings of the 2011 Convention on Empirical Strategies in Pure Language Processing.
[8] Prostmaier, B., Vávra, J., Grün, B., Hofmarcher., P. 2025. Seeded Poisson Factorization: Leveraging area data to suit subject fashions. arXiv preprint: 2405.17978.