Code
import sys
import os
import polars as pl
from argmap.dataModel import Summary, Comments
from dotenv import load_dotenv
load_dotenv()
'EMBED_MODEL_ID') os.getenv(
'WhereIsAI/UAE-Large-V1'
Sonny Bhatia
March 1, 2024
Polis datasets are publicly available at github.com/compdemocracy/openData. We download these datasets and read the CSV files using Polars DataFrame library. Once we have the data available in our Python environment, we use Sentence Transformers to compute embeddings for each comment in the dataset and store that alongside the original data in our DataFrame. Then we save the DataFrame to a parquet file for further analysis.
Here we consider several embedding models based on HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard. The following models are considered:
2024-03-19 23:37:31.029977 Initializing embedding model: WhereIsAI/UAE-Large-V1...
No sentence-transformers model found with name WhereIsAI/UAE-Large-V1. Creating a new one with MEAN pooling.
CUDA Memory: 50.3 GB free, 1.2 GB allocated, 61.4 GB total
CUDA Memory: 50.3 GB free, 1.2 GB allocated, 61.4 GB total
2024-03-19 23:37:33.498794 Embedding model initialized.
from argmap.dataModel import Summary, Comments
DATASETS = [
"american-assembly.bowling-green",
"march-on.operation-marchin-orders",
"scoop-hivemind.biodiversity",
"scoop-hivemind.freshwater",
"scoop-hivemind.taxes",
"scoop-hivemind.ubi",
"scoop-hivemind.affordable-housing",
"ssis.land-bank-farmland.2rumnecbeh.2021-08-01",
]
EMBED_MODEL_ID = os.getenv('EMBED_MODEL_ID')
for dataset in DATASETS:
summary = Summary(dataset)
comments = Comments(dataset)
if os.path.exists(comments.filename):
comments.load_from_parquet()
print(f"{dataset}: Loaded {comments.df.height} comments from Parquet DataFrame.")
else:
comments.load_from_csv()
print(f"{dataset}: Loaded {comments.df.height} comments from original dataset CSV.")
print(f"Topic: {summary.get('topic')}")
embeddings = calculate_embeddings(comments, model, show_progress_bar=True)
comments.addColumns(pl.Series(embeddings).alias(f'embedding-{EMBED_MODEL_ID}'))
comments.save_to_parquet()
print(f"{dataset}: Saved {comments.df.height} comments with embeddings to Parquet DataFrame.")
print()
american-assembly.bowling-green: Loaded 896 comments from Parquet DataFrame.
Topic: Improving Bowling Green / Warren County
american-assembly.bowling-green: Saved 896 comments with embeddings to Parquet DataFrame.
march-on.operation-marchin-orders: Loaded 2162 comments from Parquet DataFrame.
Topic: Operation Marching Orders
march-on.operation-marchin-orders: Saved 2162 comments with embeddings to Parquet DataFrame.
scoop-hivemind.biodiversity: Loaded 316 comments from Parquet DataFrame.
Topic: Protecting and Restoring NZ's Biodiversity
scoop-hivemind.biodiversity: Saved 316 comments with embeddings to Parquet DataFrame.
scoop-hivemind.freshwater: Loaded 80 comments from Parquet DataFrame.
Topic: HiveMind - Freshwater Quality in NZ
scoop-hivemind.freshwater: Saved 80 comments with embeddings to Parquet DataFrame.
scoop-hivemind.taxes: Loaded 148 comments from Parquet DataFrame.
Topic: Tax HiveMind Window
scoop-hivemind.taxes: Saved 148 comments with embeddings to Parquet DataFrame.
scoop-hivemind.ubi: Loaded 71 comments from Parquet DataFrame.
Topic: A Universal Basic Income for Aotearoa NZ?
scoop-hivemind.ubi: Saved 71 comments with embeddings to Parquet DataFrame.
scoop-hivemind.affordable-housing: Loaded 165 comments from Parquet DataFrame.
Topic: ScoopNZ Hivemind on affordable housing
scoop-hivemind.affordable-housing: Saved 165 comments with embeddings to Parquet DataFrame.
ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Loaded 297 comments from Parquet DataFrame.
Topic: JOIN THE DISCUSSION BELOW: Land use and conservation in the San Juan Islands
ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Saved 297 comments with embeddings to Parquet DataFrame.