ArgMap

On this page

  • Import Packages and Setup Environment
  • Initialize Embedding Model
    • Verify GPU Availability
    • Load Model
  • Calculate and Store Embeddings
  • View source

Data Ingestion

Read Polis data, calculate embeddings, and store in Polars Dataframe
Author

Sonny Bhatia

Published

March 1, 2024

Polis datasets are publicly available at github.com/compdemocracy/openData. We download these datasets and read the CSV files using Polars DataFrame library. Once we have the data available in our Python environment, we use Sentence Transformers to compute embeddings for each comment in the dataset and store that alongside the original data in our DataFrame. Then we save the DataFrame to a parquet file for further analysis.

Import Packages and Setup Environment

Code
import sys
import os
import polars as pl

from argmap.dataModel import Summary, Comments

from dotenv import load_dotenv
load_dotenv()

os.getenv('EMBED_MODEL_ID')
'WhereIsAI/UAE-Large-V1'

Initialize Embedding Model

Here we consider several embedding models based on HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard. The following models are considered:

  • intfloat/e5-mistral-7b-instruct - 4096 dimensions, requires 14.5 GB RAM
  • WhereIsAI/UAE-Large-V1 - 1024 dimensions, requires 1.5 GB RAM
  • Salesforce/SFR-Embedding-Mistral - consistently scores top, untested
  • OpenAI/text-embedding-3-large - hosted by OpenAI, not open source
  • OpenAI/text-embedding-ada-002 - hosted by OpenAI, not open source

Verify GPU Availability

Code
from argmap.helpers import printTorchDeviceVersion

printTorchDeviceVersion()
Device: Orin
Python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:26:55) [GCC 12.3.0]
PyTorch: 2.3.0a0+ebedce2
CUDA: 12.2
CUDNN: 8904

Load Model

Code
from sentence_transformers import SentenceTransformer
from argmap.helpers import ensureCUDAMemory, printCUDAMemory, loadEmbeddingModel

if os.getenv("EMBED_MODEL_ID") is None:
    print("EMBED_MODEL_ID environment variable is required.")
    sys.exit(3)

model = loadEmbeddingModel()
2024-03-19 23:37:31.029977 Initializing embedding model: WhereIsAI/UAE-Large-V1...
No sentence-transformers model found with name WhereIsAI/UAE-Large-V1. Creating a new one with MEAN pooling.
CUDA Memory: 50.3 GB free, 1.2 GB allocated, 61.4 GB total
CUDA Memory: 50.3 GB free, 1.2 GB allocated, 61.4 GB total
2024-03-19 23:37:33.498794 Embedding model initialized.

Calculate and Store Embeddings

Code
def calculate_embeddings(comments, model, show_progress_bar=False):
    documents = comments.df.get_column('commentText').to_list()
    embeddings = model.encode(documents, show_progress_bar=show_progress_bar)
    return embeddings
Code
from argmap.dataModel import Summary, Comments

DATASETS = [
    "american-assembly.bowling-green",
    "march-on.operation-marchin-orders",
    "scoop-hivemind.biodiversity",
    "scoop-hivemind.freshwater",
    "scoop-hivemind.taxes",
    "scoop-hivemind.ubi",
    "scoop-hivemind.affordable-housing",
    "ssis.land-bank-farmland.2rumnecbeh.2021-08-01",
]

EMBED_MODEL_ID = os.getenv('EMBED_MODEL_ID')

for dataset in DATASETS:

    summary = Summary(dataset)
    comments = Comments(dataset)

    if os.path.exists(comments.filename):
        comments.load_from_parquet()
        print(f"{dataset}: Loaded {comments.df.height} comments from Parquet DataFrame.")
    else:
        comments.load_from_csv()
        print(f"{dataset}: Loaded {comments.df.height} comments from original dataset CSV.")

    print(f"Topic: {summary.get('topic')}")

    embeddings = calculate_embeddings(comments, model, show_progress_bar=True)
    comments.addColumns(pl.Series(embeddings).alias(f'embedding-{EMBED_MODEL_ID}'))
    comments.save_to_parquet()
    print(f"{dataset}: Saved {comments.df.height} comments with embeddings to Parquet DataFrame.")
    print()
american-assembly.bowling-green: Loaded 896 comments from Parquet DataFrame.
Topic: Improving Bowling Green / Warren County
american-assembly.bowling-green: Saved 896 comments with embeddings to Parquet DataFrame.

march-on.operation-marchin-orders: Loaded 2162 comments from Parquet DataFrame.
Topic: Operation Marching Orders
march-on.operation-marchin-orders: Saved 2162 comments with embeddings to Parquet DataFrame.

scoop-hivemind.biodiversity: Loaded 316 comments from Parquet DataFrame.
Topic: Protecting and Restoring NZ's Biodiversity
scoop-hivemind.biodiversity: Saved 316 comments with embeddings to Parquet DataFrame.

scoop-hivemind.freshwater: Loaded 80 comments from Parquet DataFrame.
Topic: HiveMind - Freshwater Quality in NZ
scoop-hivemind.freshwater: Saved 80 comments with embeddings to Parquet DataFrame.

scoop-hivemind.taxes: Loaded 148 comments from Parquet DataFrame.
Topic: Tax HiveMind Window
scoop-hivemind.taxes: Saved 148 comments with embeddings to Parquet DataFrame.

scoop-hivemind.ubi: Loaded 71 comments from Parquet DataFrame.
Topic: A Universal Basic Income for Aotearoa NZ?
scoop-hivemind.ubi: Saved 71 comments with embeddings to Parquet DataFrame.

scoop-hivemind.affordable-housing: Loaded 165 comments from Parquet DataFrame.
Topic: ScoopNZ Hivemind on affordable housing
scoop-hivemind.affordable-housing: Saved 165 comments with embeddings to Parquet DataFrame.

ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Loaded 297 comments from Parquet DataFrame.
Topic: JOIN THE DISCUSSION BELOW: Land use and conservation in the San Juan Islands
ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Saved 297 comments with embeddings to Parquet DataFrame.
 

© 2024 Aaditya Bhatia

  • View source