Subject modeling stays a essential software within the AI and NLP toolbox. Whereas giant language fashions (LLMs) deal with textual content exceptionally properly, extracting high-level matters from huge datasets nonetheless requires devoted subject modeling strategies. A typical workflow consists of 4 core steps: embedding, dimensionality discount, clustering, and subject illustration.
frameworks at the moment is BERTopic, which simplifies every stage with modular parts and an intuitive API. On this submit, I’ll stroll by means of sensible changes you may make to enhance clustering outcomes and increase interpretability based mostly on hands-on experiments utilizing the open-source 20 Newsgroups dataset, which is distributed underneath the Inventive Commons Attribution 4.0 Worldwide license.
Challenge Overview
We’ll begin with the default settings really useful in BERTopic’s documentation and progressively replace particular configurations to spotlight their results. Alongside the best way, I’ll clarify the aim of every module and easy methods to make knowledgeable choices when customizing them.
Dataset Preparation
We load a pattern of 500 information paperwork.
import random
from datasets import load_dataset
dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = listing(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.pattern(text_label, 500)
Because the information originates from informal Usenet discussions, we apply cleansing steps to strip headers, take away muddle, and protect solely informative sentences.
This preprocessing ensures higher-quality embeddings and a smoother downstream clustering course of.
import re
def clean_for_embedding(textual content, max_sentences=5):
traces = textual content.cut up("n")
traces = [line for line in lines if not line.strip().startswith(">")]
traces = [line for line in lines if not re.match
(r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
textual content = " ".be a part of(traces)
textual content = re.sub(r"s+", " ", textual content).strip()
textual content = re.sub(r"[!?]{3,}", "", textual content)
sentence_split = re.cut up(r'(?<=[.!?]) +', textual content)
sentence_split = [
s for s in sentence_split
if len(s.strip()) > 15 and not s.strip().isupper()
]
return " ".be a part of(sentence_split[:max_sentences])
texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]
Preliminary BERTopic Pipeline
Utilizing BERTopic’s modular design, we configure every part: SentenceTransformer for embeddings, UMAP for dimensionality discount, HDBSCAN for clustering, and CountVectorizer + KeyBERT for subject illustration. This setup yields only some broad matters with noisy representations, highlighting the necessity for fine-tuning to realize extra coherent outcomes.
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.textual content import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.illustration import KeyBERTInspired
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Step 2 - Scale back dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# Step 3 - Cluster diminished embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
# Step 4 - Tokenize matters
vectorizer_model = CountVectorizer(stop_words="english")
# Step 5 - Create subject illustration
ctfidf_model = ClassTfidfTransformer()
# Step 6 - (Non-obligatory) Superb-tune subject representations with
# a `bertopic.illustration` mannequin
representation_model = KeyBERTInspired()
# All steps collectively
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Scale back dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster diminished embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize matters
ctfidf_model=ctfidf_model, # Step 5 - Extract subject phrases
representation_model=representation_model # Step 6 - (Non-obligatory) Superb-tune subject representations
)
matters, probs = topic_model.fit_transform(texts_clean)
This setup yields only some broad matters with noisy representations. This outcome highlights the necessity for finetuning to realize extra coherent outcomes.
Parameter Tuning for Granular Matters
n_neighbors from UMAP module
UMAP is the dimensionality discount module to scale back origin embedding to a decrease dimension dense vector. By adjusting UMAP’s n_neighbors, we management how domestically or globally the info is interpreted throughout dimensionality discount. Reducing this worth uncovers finer-grained clusters and improves subject distinctiveness.
umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
matters, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

min_cluster_size and cluster_selection_method from HDBSCAN module
HDBSCAN is the clustering module set by default for BerTopic. By modifying HDBSCAN’s min_cluster_size and switching the cluster_selection_method from “eom” to “leaf” additional sharpens subject decision. These settings assist uncover smaller, extra centered themes and stability the distribution throughout clusters.
hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
matters, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()
The variety of clusters will increase to 30 by setting cluster_selection_method to leaf and min_cluster_size to five.

Controlling Randomness for Reproducibility
UMAP is inherently non-deterministic, which means it may possibly produce completely different outcomes on every run except you explicitly set a hard and fast random_state. This element is commonly omitted in instance code, so make sure to embody it to make sure reproducibility.
Equally, in the event you’re utilizing a third-party embedding API (like OpenAI), be cautious. Some APIs introduce slight variations on repeated calls. For reproducible outputs, cache embeddings and feed them straight into BERTopic.
from bertopic.backend import BaseEmbedder
import numpy as np
class CustomEmbedder(BaseEmbedder):
"""Lightweight wrapper to name NVIDIA's embedding endpoint by way of OpenAI SDK."""
def __init__(self, embedding_model, shopper):
tremendous().__init__()
self.embedding_model = embedding_model
self.shopper = shopper
def encode(self, paperwork): # kind: ignore[override]
response = self.shopper.embeddings.create(
enter=paperwork,
mannequin=self.embedding_model,
encoding_format="float",
extra_body={"input_type": "passage", "truncate": "NONE"},
)
embeddings = np.array([embed.embedding for embed in response.data])
return embeddings
topic_model.embedding_model = CustomEmbedder()
matters, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)
Each dataset area could require completely different clustering settings for optimum outcomes. To streamline experimentation, contemplate defining analysis standards and automating the tuning course of. For this tutorial, we’ll use the cluster configuration that units n_neighbors to five, min_cluster_size to five, and cluster_selection_method to “eom”. This can be a mixture that strikes a stability between granularity and coherence.
Enhancing Subject Representations
Illustration performs a vital function in making clusters interpretable. By default, BERTopic generates unigram-based representations, which frequently lack adequate context. Within the subsequent part, we’ll discover a number of strategies to complement these representations and enhance subject interpretability.
Ngram
n-gram vary
In BERTopic, CountVectorizer is the default software to transform textual content information into bag-of-words representations. As a substitute of counting on generic unigrams, change to bigrams or trigrams utilizing ngram_range in CountVectorizer. This easy change provides a lot wanted context.
Since we’re solely updating illustration, BerTopic presents the update_topics operate to keep away from redoing the modeling another time.
topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()

Customized Tokenizer
Some bigrams are nonetheless exhausting to interpret e.g. 486dx 50, ac uk, dxf doc,… For better management, implement a customized tokenizer that filters n-grams based mostly on part-of-speech patterns. This removes meaningless combos and elevates the standard of your subject key phrases.
import spacy
from typing import Checklist
class ImprovedTokenizer:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
self.MEANINGFUL_BIGRAMS = {
("ADJ", "NOUN"),
("NOUN", "NOUN"),
("VERB", "NOUN"),
}
# Maintain solely essentially the most significant syntactic bigram patterns
def __call__(self, textual content: str, max_tokens=200) -> Checklist[str]:
doc = self.nlp(textual content[:3000]) # truncate lengthy docs for velocity
tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
bigrams = []
for i in vary(len(tokens) - 1):
word1, lemma1, pos1 = tokens[i]
word2, lemma2, pos2 = tokens[i + 1]
if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
# Optionally lowercase each phrases to normalize
bigrams.append(f"{lemma1} {lemma2}")
return bigrams
topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()

LLM
Lastly, you possibly can combine LLMs to generate coherent titles or summaries for every subject. BERTopic helps OpenAI integration straight or by means of customized prompting. These LLM-based summaries drastically enhance explainability.
import openai
from bertopic.illustration import OpenAI
shopper = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(shopper, mannequin="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()
The representations are actually all significant sentences.

You may also write your individual operate for getting the LLM-generated title, and replace it again to the subject mannequin object by utilizing update_topic_labels operate. Please consult with the instance code snippet under.
import openai
from typing import Checklist
def generate_topic_titles_with_llm(
topic_model,
docs: Checklist[str],
api_key: str,
mannequin: str = "gpt-4o"
) -> Dict[int, Tuple[str, str]]:
shopper = openai.OpenAI(api_key=api_key)
topic_info = topic_model.get_topic_info()
topic_repr = {}
matters = topic_info[topic_info.Topic != -1].Subject.tolist()
for subject in tqdm(matters, desc="Producing titles"):
indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
if not indices:
proceed
top_doc = docs[indices[0]]
immediate = f"""You're a useful summarizer for subject clustering.
Given the next textual content that represents a subject, generate:
1. A brief **title** for the subject (2–6 phrases)
2. A one or two sentence **abstract** of the subject.
Textual content:
"""
{top_doc}
"""
"""
strive:
response = shopper.chat.completions.create(
mannequin=mannequin,
messages=[
{"role": "system", "content": "You are a helpful assistant for summarizing topics."},
{"role": "user", "content": prompt}
],
temperature=0.5
)
output = response.decisions[0].message.content material.strip()
traces = output.cut up('n')
title = traces[0].exchange("Title:", "").strip()
abstract = traces[1].exchange("Abstract:", "").strip() if len(traces) > 1 else ""
topic_repr[topic] = (title, abstract)
besides Exception as e:
print(f"Error with subject {subject}: {e}")
topic_repr[topic] = ("[Error]", str(e))
return topic_repr
topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
topic_repr_dict = {
subject: topic_repr.get(subject, "Subject")
for subject in subject.get_topic_info()["Topic"]
}
topic_model.set_topic_labels(topic_repr_dict)
Conclusion
This information outlined actionable methods to spice up subject modeling outcomes utilizing BERTopic. By understanding the function of every module and tuning parameters in your particular area, you possibly can obtain extra centered, steady, and interpretable matters.
Illustration issues simply as a lot as clustering. Whether or not it’s by means of n-grams, syntactic filtering, or LLMs, investing in higher representations makes your matters simpler to know and extra helpful in apply.
BERTopic additionally presents superior modeling strategies past the fundamentals lined right here. In a future submit, we’ll discover these capabilities in depth. Keep tuned!