ArgMap

On this page

  • Import Packages and Setup Environment
    • Verify GPU Availability
    • Load Language Model
    • Run Model to Generate Titles
  • Statements and Embeddings
  • Dimensionality Reduction
    • n_neighbors
    • min_dist
    • n_components
    • metric
    • Selected Parameters
  • Clustering
  • Vectorizer and c-TF-IDF
    • Stop Words
  • Topic Representation
  • Creating Topic Model
    • List of generated topics
  • Generate Topic Headings using Language Model
  • Outlier Analysis
    • Comment Distribution
    • Topic Distribution
    • Assign outliers to closest topic
    • Update Topic Model with Outlier Assignments
    • Generate Hierarchical Topic Tree
  • Visual and Interactive Analysis of Topics
    • Comment Distribution without Outliers
    • Inter-topic Distance Map
    • Hierarchical Clustering
    • Topic Correlation Heatmap
    • Hierarchical Organization of Comments
  • View source

Topic Modeling

Use BERTopic to analyze comments, separate them into clusters and assign topic labels
Author

Sonny Bhatia

Published

March 12, 2024

We use the BERTopic library for topic modeling, leveraging a technique based on transformers embeddings and c-TF-IDF statistics to divide the comments into separate topics. These comments are grouped into topics based on their embeddings using UMAP and HDBSCAN. UMAP reduces dimensions of the text embedding vectors and allows HDBSCAN to better identify high-density clusters in the data. Each cluster is then assigned a set of representative keywords based on the most frequent words in it. Once labeled, the topics are assigned titles for identification and visualization. The topics are also organized in a hierarchy, which enables us to model relationship between different topics.

graph TD
    Vectors[Embedding Vectors]
    Clusters[Hierarchical Clusters]
    BagOfWords["Bag of Words (Per-Cluster)"]
    Keywords[Significant Keywords]
    Labels[Topic Labels]
    Representation[Topic Representation]

    Statements -- Transformers --> Vectors
    Vectors -- UMAP + HDBSCAN --> Clusters
    Clusters -- Vectorizer --> BagOfWords
    BagOfWords -- c-TF-IDF --> Keywords
    Keywords -- MMR + POS Filtering --> Representation
    Representation -- LLM --> Labels

Import Packages and Setup Environment

Code
import os
from pprint import pprint
from tqdm.notebook import tqdm

import polars as pl
from argmap.dataModel import Summary, Comments, Topics, HierarchicalTopics

from dotenv import load_dotenv
load_dotenv()

# this allows categorical data from various sources to be combined and handled gracefully; performance cost is acceptable
pl.enable_string_cache()

EMBED_MODEL_ID = os.getenv("EMBED_MODEL_ID")
Code
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance, PartOfSpeech
from bertopic.vectorizers import ClassTfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import numpy as np

from umap import UMAP
from hdbscan import HDBSCAN

import torch

# prefer GPU for spacy if available
if torch.cuda.is_available():
    spacy.prefer_gpu()
    print("CUDA available; spacy will prefer GPU")
CUDA available; spacy will prefer GPU

Verify GPU Availability

Code
from argmap.helpers import printCUDAMemory, getTorchDeviceVersion

print(getTorchDeviceVersion())

printCUDAMemory()
Device: Orin
Python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:26:55) [GCC 12.3.0]
PyTorch: 2.2.0a0+6a974be
CUDA: 12.2
CUDNN: 8904

CUDA Memory: 51.1 GB free, 0.0 GB allocated, 61.4 GB total

Load Language Model

Code
import os

from argmap.helpers import loadLanguageModel

CUDA_MINIMUM_MEMORY_GB = os.getenv("CUDA_MINIMUM_MEMORY_GB")
MODEL_ID = os.getenv("MODEL_ID")
MODEL_REVISION = os.getenv("MODEL_REVISION") or None

if MODEL_ID is None:
    raise Exception("MODEL_ID environment variable is required.")

if 'languageModel' not in globals():
    languageModel = loadLanguageModel(MODEL_ID, MODEL_REVISION, CUDA_MINIMUM_MEMORY_GB)
Initializing language model: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ gptq-4bit-32g-actorder_True...
Language model initialized.
CUDA Memory: 51.5 GB free, 26.2 GB allocated, 79.1 GB total

Run Model to Generate Titles

Code
from tqdm.notebook import tqdm

import guidance
from guidance import user, assistant, instruction, gen, select
import re

from argmap.guidance import generate_phrase


@guidance
def generate_topic_headings_titles(lm, summary, topics, generate_titles=False, temperature=0, progress_bar=None):

    if progress_bar is not None:
        lm.echo = False

    # avoid repeating anything from conversation title
    taboo_words = re.split(r'\W+', summary.topic)

    with instruction():
        lm += f"""\
Assign a detailed title and a short heading to best represent each given topic.
Start with a noun or adjective.
Avoid repetitive words or phrases such as "Enhancing" or "Improving".
Avoid using these words: {', '.join(taboo_words)}

KEYWORDS: [a set of keywords that describe the topic]
STATEMENTS: [a set of statements that best represent the topic]
TITLE: [a descriptive sentence that represents the topic and starts with a noun]
HEADING: [terse phrase]
"""

    topic_titles = []
    topic_headings = []

    for topic, keywords, docs in topics.df.select('Topic', 'Representation', 'Representative_Docs').iter_rows():
        if topic == -1:
            topic_titles.append("Outliers")
            topic_headings.append("Outliers")
            progress_bar.update() if progress_bar is not None else None
            continue

        with user():
            lm_topic = lm + f"""
            # Topic {topic}
            KEYWORDS: {', '.join(keywords)}
            STATEMENTS: {'; '.join(docs)}
            """
        with assistant():
            if generate_titles:
                lm_topic += f"TITLE: " + generate_phrase('title', temperature, 50) + '\n'
            lm_topic += f"HEADING: " + generate_phrase('heading', temperature, 12) + '\n'

        if generate_titles:
            topic_titles.append(lm_topic['title'])

        topic_headings.append(lm_topic['heading'])

        progress_bar.update() if progress_bar is not None else None

    if generate_titles:
        topics.addColumns(pl.Series('Title', topic_titles))

    topics.addColumns(pl.Series('Heading', topic_headings))

    return lm

Statements and Embeddings

We load our dataset and embeddings that we computed during data ingestion. For this experiment, we continue to use the intfloat/e5-mistral-7b-instruct embedding model.

Code
from IPython.display import display_markdown

DATASET = "american-assembly.bowling-green"

comments = Comments(DATASET).load_from_parquet()
summary = Summary(DATASET)

comments_df = (
    comments.df
    .filter(pl.col('moderated') >= 0)
    .sort('agreeability')
    .select('commentId', 'commentText', embedding=f'embedding-{EMBED_MODEL_ID}')
)

documents = comments_df.get_column('commentText').to_list()
embeddings = comments_df.get_column('embedding').to_numpy()

# Convert list of numpy arrays to 2D numpy array
embeddings = np.array([np.array(embedding) for embedding in embeddings])

display_markdown(f"""
### Dataset: {DATASET}
#### {summary.topic}
#### {summary.get('conversation-description')}
#### Full Report: [{summary.url}]({summary.url})
#### Embedding Model: {EMBED_MODEL_ID}
#### Dimensions: {len(embeddings[0])}
""", raw=True)

Dataset: american-assembly.bowling-green

Improving Bowling Green / Warren County

What do you believe should change in Bowling Green/Warren County in order to make it a better place to live, work and spend time?

Full Report: https://pol.is/9wtchdmmun

Embedding Model: WhereIsAI/UAE-Large-V1

Dimensions: 1024

Dimensionality Reduction

We use Uniform Manifold Approximation and Projection (UMAP) to reduce the text embedding vectors dimensions. Our embeddings from intfloat/e5-mistral-7b-instruct model have 4096 dimensions while WhereIsAI/UAE-Large-V1 model outputs 1024 dimensions. We employ a density based clustering algorithm and high dimensionality makes it nearly impossible to identify any high-density regions. Therefore we translate these embeddings to a lower dimension space below 100 dimensions as recommended in the official UMAP documentation. This significantly improves the performance efficiency of the clustering algorithm, allowing it to better identify high-density clusters in the data.

In the following experiments, we attempt to optimize the following UMAP parameters:

  • n_neighbors
  • n_components
  • min_dist
  • metric
Code
import matplotlib.pyplot as plt
import umap.plot
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

umap.plot.output_notebook()


def plot_umap_diagnostic(embeddings, diagnostic_type='pca', topics=None, ax=None, **kwargs):
    args = kwargs | {'n_components': 2}
    mapper = UMAP(**args).fit(embeddings)
    # labels = np.array(topics) if topics is not None else None
    # umap.plot.connectivity(mapper, ax=ax, show_points=True, labels=labels)
    umap.plot.diagnostic(mapper, diagnostic_type=diagnostic_type, ax=ax)

n_neighbors

This parameter constrains the size of local neighborhoods, balancing local details against global structure of data. Higher values favor global structure. Clustering algorithms typically depend on the local relationships between points, so a lower value is preferred. We experiment with values \([2, 4, 8, 16, 32, 64]\). At n_neighbors = 2 and 4, the graphs fails to reveal any meaningful clusters. We select n_neighbors = 8 as the optimal value as the graph begins to take a more meaningful shape.

Code
# list powers of 2 up to quarter of the total number of points
n_neighbors_values = [2**i for i in range(1, int(np.log2(comments_df.height / 4)))]

num_plots = len(n_neighbors_values)
num_columns = 3
num_rows = num_plots // num_columns + (num_plots % num_columns > 0)

fig, axs = plt.subplots(num_rows, num_columns, figsize=(18, num_rows * 6))
axs = axs.flatten()  # Flatten the array to easily iterate over it

for i, n in enumerate(n_neighbors_values):
    plot_umap_diagnostic(embeddings, diagnostic_type='pca', ax=axs[i], n_neighbors=n, densmap=True, random_state=42)
    axs[i].set_title(f'n_neighbors: {n}')

plt.tight_layout()
plt.show()

min_dist

This parameter specifies how closely the points can be located in a low-dimensional space. Higher values typically result in more evenly distributed points at the expense of local structure. Lower values allow clumping, which is not ideal for visualization but is more useful for clustering. We experiment with values \([0.0, 0.1, 0.25, 0.5, 0.8, 0.99]\) to demonstrate the effect of this parameter. We select min_dist = 0 to best preserve local structure for clustering.

Code
min_dist_values = [0.0, 0.1, 0.25, 0.5, 0.8, 0.99]

num_plots = len(min_dist_values)
num_columns = 3
num_rows = num_plots // num_columns + (num_plots % num_columns > 0)

fig, axs = plt.subplots(num_rows, num_columns, figsize=(18, num_rows * 6))
axs = axs.flatten()  # Flatten the array to easily iterate over it

for i, min_dist in enumerate(min_dist_values):
    plot_umap_diagnostic(embeddings, diagnostic_type='pca', ax=axs[i], min_dist=min_dist, n_neighbors=8, random_state=42)
    axs[i].set_title(f'min_dist: {min_dist}')

plt.tight_layout()
plt.show()

n_components

This parameter specifies the number of dimensions in the reduced space. Since the 2D graphs are always limited to two dimensions, in this experiment, we first reduce the data to our selected value of n_components and then further reduce it to 2D for visualization. We experiment with values \([8, 16, 32, 64, 128, 256]\) to visualize the effect of this parameter. UMAP documentation recommends setting this anywhere from \(10\) to \(100\). An excessively high value might defeat the purpose of dimensionality reduction. Either \(16\) or \(32\) is a reasonable choice. We set it to \(32\) to allow for a high accuracy while still allowing HDBSCAN to identify high-density clusters.

Code
n_components_values = [8, 16, 32, 64, 128, 256]

num_plots = len(n_components_values)
num_columns = 3
num_rows = num_plots // num_columns + (num_plots % num_columns > 0)

fig, axs = plt.subplots(num_rows, num_columns, figsize=(18, num_rows * 6))
axs = axs.flatten()

for i, n in enumerate(n_components_values):
    reduced_embeddings = UMAP(n_components=n, n_neighbors=8, min_dist=0).fit(embeddings).embedding_
    plot_umap_diagnostic(reduced_embeddings, diagnostic_type='pca', ax=axs[i], n_neighbors=8, min_dist=0, densmap=True, random_state=42)
    axs[i].set_title(f'n_components: {n}')

plt.tight_layout()
plt.show()

metric

Finally, we test various distance metrics to determine the best one for our data. We experiment with euclidean, minkowski, and cosine metrics. All three metrics appear to perform similarly. However, since we are dealing with a high dimensional space and our vectors may not be normalized, cosine metric will perform the best to recognize semantic similarity. We select cosine metric as the optimal choice for our data.

Code
metrics_values = ['euclidean', 'manhattan', 'cosine']

num_plots = len(metrics_values)
num_columns = 3
num_rows = num_plots // num_columns + (num_plots % num_columns > 0)

fig, axs = plt.subplots(num_rows, num_columns, figsize=(18, num_rows * 6))
axs = axs.flatten()

for i, metric in enumerate(metrics_values):
    plot_umap_diagnostic(embeddings, diagnostic_type='pca', ax=axs[i], metric=metric, n_neighbors=8, min_dist=0, densmap=True, random_state=42)
    axs[i].set_title(f'metric: {metric}')

plt.tight_layout()
plt.show()

Selected Parameters

  • n_neighbors: 8
  • min_dist: 0.0
  • n_components: 32
  • metric: cosine

Additionally, setting densmap=True significantly improves the clustering results. It “computes estimates of the local density and uses those estimates as a regularizer in the optimization of the low dimensional representation”, which significantly helps HDBSCAN identify clusters.

With these parameters, we generate a final projection of our embeddings dataset. The first figure shows the 2D projection of the embeddings resulting from the UMAP model. The second figure shows a network connectivity graph that highlights the distance between various points in a 2D space. The edge thickness represents the distance between the points. This graph shows a high-density core and several outliers, which is consistent with a scale-free network and is ideal for clustering.

Reference:

  • densMAP: Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability
Code
from sklearn.decomposition import PCA
import umap.plot

umap_params = dict(
    n_neighbors=8,
    min_dist=0.0,
    n_components=32,  # this is the new dimensionality
    metric='cosine',
    random_state=42,
    densmap=True,
)

umap_model = UMAP(**umap_params)
Code
def plot_umap_result(embeddings, umap_model):
    reduced_embeddings = umap_model.fit_transform(embeddings)

    # use PCA to represent the original embeddings in RGB color space while generally preserving global structure
    pca_result = PCA(n_components=3).fit_transform(embeddings)
    pca_color = (pca_result - np.min(pca_result, axis=0)) / np.ptp(pca_result, axis=0)

    # project the embeddings into a 2D space for visualization
    mapper = UMAP(**(umap_params | {'n_components': 2})).fit(reduced_embeddings)
    umap.plot.points(mapper, values=pca_color, width=1200, height=1200)
    umap.plot.connectivity(mapper, show_points=True, edge_bundling='hammer', width=1200, height=1200)

plot_umap_result(embeddings, umap_model)

Clustering

We use HDBSCAN to cluster the reduced dimensions of the text embedding vectors. HDBSCAN is a density-based clustering algorithm particularly well-suited for clustering data with noise and varying densities. We tune the parameters for clustering using trial and error to optimize the following outputs: - Number of clusters / topics - Number of outliers

Future work will include using OPTICS and DBSCAN for comparison and the use of silhouette score to evaluate the quality of the clusters.

Code
hdbscan_params = dict(
    min_cluster_size=comments_df.height // 50, # 2% of the statement count
    min_samples=1,  # a higher default value makes clustering more conservative
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True,
    gen_min_span_tree=True,
)

hdbscan_model = HDBSCAN(**hdbscan_params)

pprint(hdbscan_params)
{'cluster_selection_method': 'eom',
 'gen_min_span_tree': True,
 'metric': 'euclidean',
 'min_cluster_size': 12,
 'min_samples': 1,
 'prediction_data': True}

Vectorizer and c-TF-IDF

We use a count-based vectorizer to convert the text into a matrix of token counts while conisdering n-grams of up to two words. We then apply a term frequency-inverse document frequency (TF-IDF) transformation to the count matrix.

Stop Words

Vectorizer uses a list of stop words to filter out common words that are unlikely to be useful for clustering. Stop words are the most frequent words that provide structure but add little meaning to the specific topic. We consider them as noise and remove them from the text before clustering.

We use Spacy’s stop words list for English language. Additionally, we add custom stop words from the discussion summary, since they apply to the entire discussion and are likely to appear in each topic without adding specific semantic value. For example, for our currently select dataset, the discussion summary is “Improving Bowling Green / Warren County”, and all of these words tend to appear in many statement. We add these words to our stop words list to minimize noise and focus on signal.

This technique improves the signal-to-noise ratio (SNR) and allows us to focus on the most meaningful words in each statement.

Code
import re

summary_stop_words = re.split(r'\W+', summary.topic.lower())

summary_stop_words
['improving', 'bowling', 'green', 'warren', 'county']
Code
# TODO: consider using number of votes to prioritize more agreeable statements
vectorizer_model = CountVectorizer(stop_words=(list(STOP_WORDS) + summary_stop_words), ngram_range=(1, 2))

# TODO try bm25_weighting=True to reduce frequent words
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

Topic Representation

We use Spacy to apply a part-of-speech filter to only include nouns and adjective-noun pairs in our representative keywords. We then employ Maximal Marginal Relevance (MMR) to select the most representative keywords for each topic while maintaining diversity.

Code
pos_patterns = [
            [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
            [{'POS': 'NOUN'}]
]

representation_model = [
    PartOfSpeech("en_core_web_lg", pos_patterns=pos_patterns),
    MaximalMarginalRelevance(diversity=0.3),
]

Creating Topic Model

Future work involves identifing appropriate validation metrics, e.g. cluster size, silhouette score, largest topic, number of outliers etc. and using GridSearchCV to optimize the HDBSCAN hyperparameters. Additionally, clusters larger than a certain size may be further divided to improve granularity.

References-

  • https://towardsdatascience.com/tuning-with-hdbscan-149865ac2970
  • https://github.com/MaartenGr/BERTopic/issues/611
  • https://gist.github.com/drob-xx/1d8408f24f4091c72e8d03dc7a451be7
Code
from sklearn.metrics import silhouette_score

topic_model = BERTopic(
    umap_model=umap_model,                   # Reduce dimensionality
    hdbscan_model=hdbscan_model,                # Cluster reduced embeddings
    vectorizer_model=vectorizer_model,       # Tokenize topics
    ctfidf_model=ctfidf_model,               # Extract topic words
    representation_model=representation_model,  # Fine-tune topic representations
    # calculate_probabilities=True,
    nr_topics="auto",
)

topic_assignments, probabilities = topic_model.fit_transform(documents, embeddings)

print(f"""
Number of comments: {len(documents)}
Number of topics: {len(topic_model.get_topic_freq()) - 1}
Outliers: {topic_assignments.count(-1)}
Largest Cluster: {topic_assignments.count(0)}
Silhouette Score: {silhouette_score(embeddings, topic_assignments, metric='cosine')}
Relative Validity: {topic_model.hdbscan_model.relative_validity_}
""")

Number of comments: 607
Number of topics: 14
Outliers: 229
Largest Cluster: 70
Silhouette Score: 0.008188310079276562
Relative Validity: 0.16012470901699025

List of generated topics

Code
for topic, words in topic_model.get_topics().items():
    if topic == -1:
        continue
    keywords = ', '.join([word[0] for word in words])
    print(f'Topic {topic}: {keywords}')
Topic 0: tax, vote, taxes, wage, officials, homelessness, city, laws, limits, homeless people
Topic 1: traffic, road, lane, flow, bypass, roads, turn, lights, congestion, cemetery
Topic 2: ordinance, fairness, city, noise, lawns, separation, code, food, officials, improvement
Topic 3: parking, campus, students, garage, college, miles, food, police, downtown, lots
Topic 4: planning, zoning, parks, small businesses, development, neighborhoods, single family, family, apartments, city
Topic 5: drug, opioid, doctors, deaths, pain, crisis, dealers, problem, dependency, opioids
Topic 6: school, schools, arts, choice, kids, districts, public school, charter, education, parents
Topic 7: complex, university, facilities, sports, programs, community, psych, educational establishments, sport, cultural activities
Topic 8: sidewalks, trails, bike, walking, bicycle, connectivity, future sidewalks, ramps, paved trail, biking
Topic 9: water, fire, lines, residents, expensive insurance, maps, new roads, idea, southern end, bills
Topic 10: internet, fiber, cable, competition, rates, option, companies, service, providers, gouging
Topic 11: marijuana, smoking, revenue, cannabis, region, meth, substantial tax, feet, resolution, massive leadership
Topic 12: recycling, litter, trash, bins, services, yards, equipment, cans, contract, dumping
Topic 13: buses, transportation, transit, public transportation, system, train, service, trolley, unmanned trolley, accessibility

Generate Topic Headings using Language Model

Here, we simply call certain pre-defined functions to generate the headings and titles. We explore these specific techniques with various language models in the next notebook.

Code
topics = Topics(
    DATASET,
    df=pl.from_pandas(topic_model.get_topic_info())
)

progress_bar = tqdm(
    total=topics.df.height,
    desc='Topic Titles',
    unit="topics",
    smoothing=0.1,
)

languageModel + generate_topic_headings_titles(
    summary,
    topics,
    progress_bar=progress_bar
)

progress_bar.close()

topic_model.set_topic_labels(
    topics.df.get_column('Heading').to_list()
)
Code
from IPython.display import display_markdown

output = "### Generated Headings:\n- " + \
    "\n- ".join(topics.df.get_column('Heading').to_list())
display_markdown(output, raw=True)

Generated Headings:

  • Outliers
  • Addressing Homelessness and Taxation
  • Traffic Management
  • Opposing Fairness Ordinance
  • College Campus Improvements
  • Improving City Planning and Zoning
  • Opioid Crisis and Doctors
  • Improving Arts Education in Public Schools
  • University-Based Community Enrichment
  • Enhanced Pedestrian Infrastructure
  • Water and Fire Protection in Tuckertown
  • Expanding Internet Options
  • Legalized Marijuana Benefits
  • Improving Waste Management
  • Improved Public Transit

Outlier Analysis

Comment Distribution

The following graph shows the comment embeddings projected in 2D space. The color represents the topic. Several outliers are visible in gray color indicating a lack of topic assignment.

Code
topic_model.visualize_documents(
    documents,
    embeddings=embeddings,
    title="",
    custom_labels=True
)

Topic Distribution

The following distribution shows a reasonably balanced number of comments per topic with the exception of first topic. There are also a significant number of outliers with the label \(-1\). In the following steps, we will assign these outliers to the most similar topic.

Code
import matplotlib.pyplot as plt

plt.figure(figsize=(18, 6))
plt.xticks(range(int(min(topic_assignments)), int(max(topic_assignments)) + 1))
plt.hist(topic_assignments, bins=len(topic_assignments), width=0.5)
plt.show()

Assign outliers to closest topic

We assign outlier comments to the nearest topic based on the probability distribution calculated by BERTopic. If outliers still remain that do not appear to belong to any topic, we assign them based on embedding similarity to the topic centroids. This ensures that all comments are considered in our further analysis and not silently discarded.

Future work will include further analysis of the outliers to isolate noise and improve the quality of the topics.

Code
outliers = topic_assignments.count(-1)

print(f"Outliers: {outliers}")

if topic_assignments.count(-1) > 0:
    print("Assigning outliers to topics using embeddings...")
    topic_assignments = topic_model.reduce_outliers(
        documents,
        topic_assignments,
        strategy='embeddings',
        embeddings=embeddings
    )
    print("Outliers remaining:", topic_assignments.count(-1))

if topic_assignments.count(-1) > 0:
    print("Assigning outliers to topics using c-TF-IDF based probability distributions...")
    topic_assignments = topic_model.reduce_outliers(
        documents,
        topic_assignments,
        strategy='distributions'
    )
    print("Outliers remaining:", topic_assignments.count(-1))

plt.figure(figsize=(18, 6))
plt.xticks(range(int(min(topic_assignments)), int(max(topic_assignments)) + 1))
plt.hist(topic_assignments, bins=len(topic_assignments), width=0.5)
plt.show()
Outliers: 229
Assigning outliers to topics using embeddings...
Outliers remaining: 0

Update Topic Model with Outlier Assignments

Code
topic_model.update_topics(documents, topics=topic_assignments, ctfidf_model=ctfidf_model,
                         vectorizer_model=vectorizer_model, representation_model=representation_model)

print(f'\nNumber of topics: {len(topic_model.get_topic_freq())}\n')

for topic, words in topic_model.get_topics().items():
    keywords = ', '.join([word[0] for word in words])
    print(f'Topic {topic}: {keywords}')

# for topic, keywords in enumerate(topic_model.get_topic_info()['Representation']):
#    print(f"Topic {topic}: {', '.join(keywords)}")
2024-03-14 23:19:02,051 - BERTopic - WARNING: Using a custom list of topic assignments may lead to errors if topic reduction techniques are used afterwards. Make sure that manually assigning topics is the last step in the pipeline.Note that topic embeddings will also be created through weightedc-TF-IDF embeddings instead of centroid embeddings.

Number of topics: 14

Topic 0: homeless, pay, tax, taxes, city, people, vote, temp, officials, wage
Topic 1: traffic, road, flow, bypass, roads, lane, drivers, lights, intersection, turn
Topic 2: ordinance, fairness, city, noise, separation, lawns, code, food, officials, improvement
Topic 3: parking, campus, food, restaurants, students, garage, lots, apartment, space, college
Topic 4: planning, zoning, neighborhoods, parks, development, apartments, family, single family, neighbors, activities
Topic 5: drug, opioid, doctors, pain, deaths, crisis, treatment, dealers, problem, dependency
Topic 6: schools, school, education, students, parents, public school, public schools, kids, funding, arts
Topic 7: museum, complex, community, university, sports, concerts, youth, programs, art, activities
Topic 8: sidewalks, bike, trails, walking, trail, pedestrian, river, shade, trees, cars
Topic 9: water, fire, lines, counties, river, tourism, residents, waste, septic tanks, features
Topic 10: internet, cable, fiber, competition, rates, tv, choices, option, companies, service
Topic 11: marijuana, smoking, revenue, cannabis, region, meth, substantial tax, feet, resolution, massive leadership
Topic 12: recycling, litter, trash, bins, services, yards, equipment, cans, contract, dumping
Topic 13: buses, transportation, public transportation, transit, system, service, train, accessible transportation, rural areas, daily transport

Store Topic Assignments in Comment Dataframe

Code
comments_df = comments_df.with_columns(topicId=pl.Series(topic_assignments))
comments.join_in_place(comments_df.select('commentId', 'topicId'), 'commentId', dropColumns='topicId').save_to_parquet()
print(f"Saved to {comments.filename}")
Saved to ./data-polis/american-assembly.bowling-green/comments.parquet

Re-generate topic headings based on updated model

Code
topics = Topics(
    DATASET,
    df=pl.from_pandas(topic_model.get_topic_info())
)

progress_bar = tqdm(
    total=topics.df.height,
    desc='Topic Titles',
    unit="topics",
    smoothing=0.1,
)

languageModel + generate_topic_headings_titles(
    summary,
    topics,
    generate_titles=True,
    progress_bar=progress_bar
)

topics.save_to_parquet()
progress_bar.close()

topic_model.set_topic_labels(
    topics.df.get_column('Heading').to_list()
)
Code
from IPython.display import display_markdown

output = "### Updated Headings:\n- " + \
    "\n- ".join(topics.df.get_column('Heading').to_list())
display_markdown(output, raw=True)

Updated Headings:

  • Addressing Homelessness with Occupational Taxes
  • Improving Traffic Flow
  • No to Fairness Ordinance
  • College Campus Improvements
  • Improving City Planning and Zoning
  • Opioid Epidemic and Healthcare
  • Public School Funding and Accountability
  • Community Enrichment Hub
  • Pedestrian-Friendly Infrastructure
  • Enhancing Community Safety and Tourism
  • Internet Choices and Competition
  • Marijuana Legalization
  • Enhancing Recycling and Waste Management
  • Improved Transit Services

Generate Hierarchical Topic Tree

Code
hierarchical_topics = topic_model.hierarchical_topics(documents)
hTopics = HierarchicalTopics(DATASET, df=hierarchical_topics)
hTopics.save_to_parquet()
print(f"Saved to {hTopics.filename}")
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.73it/s]
Saved to ./data-polis/american-assembly.bowling-green/hierarchicaltopics.parquet

Visual and Interactive Analysis of Topics

Comment Distribution without Outliers

Code
topic_model.visualize_documents(
    documents,
    embeddings=embeddings,
    title="",
    custom_labels=True
)

Inter-topic Distance Map

Code
topic_model.visualize_topics(
    custom_labels=True,
    title='',
    width=1200,
)

Hierarchical Clustering

Code
# print(topic_model.get_topic_tree(hierarchical_topics)) # print a text representation of the topic tree
# hierarchical_topics = topic_model.hierarchical_topics(documents)


topic_model.visualize_hierarchy(
    custom_labels=True,
    orientation='left',
    title='',
    width=1000,
)

Topic Correlation Heatmap

Code
topic_model.visualize_heatmap(
    custom_labels=True,
    title='',
    width=1200,
    )

Hierarchical Organization of Comments

Code
topic_model.visualize_hierarchical_documents(
    documents,
    hierarchical_topics,
    embeddings=embeddings,
    custom_labels=True,
    title='',
    width=1200,
)
 

© 2024 Aaditya Bhatia

  • View source