Using LLMs to Structure and Visualize Policy Discourse

Aaditya (Sonny) Bhatia

United States Military Academy, West Point, NY

Advisor: Dr. Gita Sukthankar

University of Central Florida, Orlando, FL

Sunday, August 4, 2024

Context

Policy decisions pose a complex, wicked problem ¹ ²

Effectiveness determined by solving it; single attempt
Measuring impact will shift problem
Public discourse helps shape solutions; crucial for policy-making

Determining public opinion

Surveys and polls -> Social Media -> Discussion Platforms
People willing to express freely
Digital platforms provide a wealth of data
Unstructured, vast, and complex

First, let’s understand the problem space.

What are wicked problems and why do they matter?

A problem that is difficult or impossible to perfectly solve because of incomplete, contradictory, and changing requirements that are often difficult to recognize.
An example would be building a highway to connect two ends of a city.
You can go through the city or around it
and each option has its own unique implications.
There is no way to find out which solution is better without trying it out
and the solution will change the problem itself.

Most policy decisions are wicked problems

public discourse helps shape solutions to wicked problems
Historically, public opinion was determined through surveys and polls.
Social media provided researchers with a new avenue but limited by each platform’s constraints.
Specific discussion platforms provide a wealth of data, but it is often unstructured, vast, and complex.

Problem Statement

How can LLMs enable us to

ingest massive streams of unstructured information
incorporate diverse perspectives, and
distill them into actionable insights, that
align with public opinion?

What are the inherent risks associated with the deployment of LLMs?

gIBIS

Networked decision support system (J. Conklin and Begeman 1988)
Structured conversation using
- Issues
- Positions
- Arguments
Helped identify underlying assumptions
Promoted divergent and convergent thinking
Limited by structure, learning curve, scalability
Provided the basis for several other tools

Issue Based Information System, or IBIS
is a discussion facilitation technique that relies on a shared display
great for clarifying wicked problems
it’s the most commonly used approach for argumentation
and provides the basis for several argument mapping tools and discussion platforms
gIBIS = graphical IBIS, a UI developed as a networked decision support system
Allowed multiple users to hold structured discussions and establish a common operation picture
Three types on nodes (point at picture)
- Issues represent questions or problems
- Positions - potential answers to those issues
- Arguments - that support or refute those answers
evolved into powerful method for research thinking and design deliberation
Helps identify underlying assumptions
Promotes divergent thinking by encouraging exploration
Enables convergent thinking by allowing the group to reach consensus

Polis

“Real-time system for gathering, analyzing and understanding” public opinion (Small 2021)
Developed as an open source platform for public discourse
Published several case studies
Participants post short messages and vote on others
Polis algorithm ensures exposure to diverse opinions
\(\vec{comments} \times \vec{votes} =\) opinion matrix
- fed into statistical models
- understand where people agree or disagree

Polis is a real-time system for analyzing public opinion
It’s been developed as an open-source platform and used by several governments and organizations across the world
When an organization hosts a discussion
- they post a discussion topic and question for the public
- participants respond to the discussion using short comments - 140 characters, like a tweet
- these comments are expected to either introduce an issue or suggest a solution
- cannot reply
- vote on other comments to indicate how they feel
Polis’s underlying mechanism ensures that people are exposed to diverse opinions
On the right, you can see a screenshot from a live report from a town hall meeting in Bowling Green, Kentucky
(describe it maybe)
This voting activity results in an opinion matrix
- this is fed into statistical models
- specifically understand where people tend to agree or disagree
We use Polis because it provides a rich dataset of public opinion that we can structure and analyze

D-Agree Crowd-Scale Discussion

Automated agent to facilitate online discussion (Ito, Hadfi, and Suzuki 2022)
IBIS-based discussion representation
Extracts and analyzes discussion structures from online discussions
Posts facilitation messages to incentivize participants and grow IBIS tree
Best results when agent augmented human facilitators (Hadfi and Ito 2022)
Results
- Use of the agent produced more ideas for any given issue
- Agent had 1.4 times more replies and 6 times shorter response intervals
- Increased user satisfaction and sense of accomplishment

Methodology

Data

Summary Statistics: conversation topic, number of participants, total comments, total votes
Comments: author, comment text, moderated, agree votes, disagree votes
Votes: voter ID, comment ID, timestamp, vote
Participant-Vote Matrix: participant ID, group ID, n-votes, n-agree, n-disagree, comment ID…
Stats History: votes, comments, visitors, voters, commenters

Summary of datasets used in the study
Dataset	Participants	Comments	Accepted
american-assembly.bowling-green	2031	896	607
scoop-hivemind.biodiversity	536	314	154
scoop-hivemind.taxes	334	148	91
scoop-hivemind.affordable-housing	381	165	119
scoop-hivemind.freshwater	117	80	51
scoop-hivemind.ubi	234	78	71

Embeddings

Calculated at comment level using Sentence Transformers library
Models considered
- intfloat/e5-mistral-7b-instruct
- WhereIsAI/UAE-Large-V1
- OpenAI/text-embedding-ada-002
- OpenAI/text-embedding-3-large

Language Model Selection Criteria
- Open weights
- Clustering performance on HuggingFace MTEB
- Memory footprint

Once we get the data, we calculate its embeddings.
Embeddings are numerical vectors that represent the semantical meaning of a word or sentence.
Transformer embeddings are calculated based on the context in which it appears.
These are used by language models to infer the next word in a sentence.
We calculated embeddings at the comment level, which was done by averaging the embeddings of all the words in a comment.
A comment could be 2-3 sentences, limited to 140 characters, and should contain only one idea.
This ensures that the embeddings capture the meaning of the entire comment.

Models

We considered several models, including a Mistral-based model from Microsoft Research, UAE-Large, and OpenAI’s models.
We selected these models based on their performance on the HuggingFace MTEB leaderboard, memory footprint, and availability of open weights.
Ultimately, we selected UAE-Large for our study due to its superior performance with an extremely low memory footprint.

Text Generation

Models Considered
Guidance
- Python-based framework developed by Microsoft Research
- Constrain generation using regular expressions, context-free grammars
- Interleave control and generation seamlessly

lm += f"""\
The following is a character profile for an RPG game in JSON format.
```json
{{
    "id": "{id}",
    "description": "{description}",
    "name": "{gen('name', stop='"')}",
    "age": {gen('age', regex='[0-9]+', stop=',')},
    "armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
    "weapon": "{select(options=valid_weapons, name='weapon')}",
    "class": "{gen('class', stop='"')}",
    "mantra": "{gen('mantra', stop='"')}",
    "strength": {gen('strength', regex='[0-9]+', stop=',')},
    "items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
}}```"""

Example output produced by guidance. The green highlighted text is generated by the LLM, while rest is programatically inserted into the context. Inference is significantly faster since the model produces fewer tokens. Output format is strictly enforced using stop criteria, regular expressions, and fixed options.

Text Generation Models

We considered the three models listed here
and ended up using a quantized version of Mixtral
Mistral performed quite well but Mixtral showed better reasoning capabilities

Guidance

To control the language model’s output, we used a framework called Guidance
Traditionally, you would prompt the language model, receive and parse the response, and hope that it is in the correct format
Guidance lets you specify exactly what you want using regular expressions, context-free grammars, or just a simple list of options
You can also pause generation, insert text into the context, and then ask the model to generate more
Also, since the model has to come up with fewer tokens, the process is much much faster
In the example below, we ask the model to generate a structured JSON object
The output is on the right.
The highlighted text in the output is generated by the language model, while the rest is outputted from the string.
(next) For the name, we tell the model to stop generating when it reaches a double quote
(next) For the age, we only allow numbers and stop when it tries to produce a comma
(next) For armor, we provide a list of options and ask the model to select one
This way, the output format is perfectly enforced in a very efficient way

Topic Modeling

Topic Outlier Assignment

Each bar represents a topic cluster, with y-axis representing statement count; Topic `-1` is reserved for outliers that do not initially belong to a cluster.

Upon reassigning outliers, each topic has a larger number of statements; we reassign all topics ensuring that no statement is discarded as noise.

Statement Distribution

Statement Distribution after outlier reassignment

Insight Generation

Actionable insights that urge for specific actions to address issues
Problems and solutions proposed by participants
LLM synthesizes insights from comments within each topic
Advocate for specific insights urging actions to address issues
Filter these insights to derive actionable items

An argument is a statement that advocates for a specific position or action.
the goal was to identify the most important issues and potential solutions
our key idea is to generate a lot of arguments, score them based on acceptability, and then select the best ones
I will walk you through the diagram at the bottom
first, we filtered the comments based on their agreeability
- agreeability is the proportion of agree votes to total votes
- this threshold lets us control what proportion of statements are used to form high-quality arguments
Filtering carries the risk of excluding comments and opinions.
- We do not want to exclude any valid opinions
- but also eliminate the ones that are just very unpopular
We feed these comments and topic description into the language model
- and ask it to identify areas of improvement
for each area of improvement
- we generate a list of problems discussed by the participants
- and potential solutions proposed by the participants
we are using this to interpret these problems and solutions
- we do not want the model to come up with new solutions
- to minimize hallucinations, we kept our instructions very simple and clear
once we get a list of actionable insights, we select the top ones from each topic based on their acceptability
- and use them to generate the argument map

Insight Scoring

Goal: Quantify acceptance of each generated insight
Task: Identify comments that support each insight
Count the individuals that voted positively on supporting comments
Calculate an “acceptance” factor to indicate the degree of consensus

Potential Biases in Insight Generation

Some comments, especially those posted earlier, may receive more votes than others
- Use the ratio of agreement votes to total votes
Certain topics are more popular and have more comments than others
- Generate a balanced number of insights for each topic
Certain controversial topics are heavily downvoted
- Comments: Filter by quantiles instead of fixed thresholds within each topic
- Insights: Select fixed number of “best insights” from each topic
Some people vote more than others
- Count the individuals that support an insight over hard vote count

When generating insights, we consider these potential biases
Some comments have more votes than others
- We normalize this by calculating the agreeability factor
- which is the ratio of agree votes to total votes
Some topics are more popular or controversial, and have more comments
- We normalize this by generating a balanced number of insights for each topic
Some topics are heavily downvoted for various reasons, so the votes across topics are not comparable
- For example, a topic in our dataset about legalizing weed was heavily downvoted overall, but still have valid opinions
- Had we compared them against other topics, those comments would never have surfaced
- To control for that, we use of quantiles instead of fixed thresholds within each topic.
- In our case, we use the 0.1 quantile to only trim the worst 10% comments within the topic.
Some people vote more than others, which can skew our insight support calculation
- If someone agreed to a comment, they are likely to agree with other supporting positions.
- We count the individuals that voted on at least one supporting comment instead of the total number of votes

Argument Mapping

Used Argdown syntax to generate argument maps
Developed a grammar generator to convert data into Argdown format
Generated argument maps for each topic to visualize the structure of the debate

===
sourceHighlighter:
    removeFrontMatter: true
webComponent:
    withoutMaximize: true
    height: 500px
===

# Argdown Syntax Example

[Statement]: Argdown is a simple syntax for defining argumentative structures, inspired by Markdown.
  + Writing a list of **pros & cons** in Argdown is as simple as writing a twitter message.
  + But you can also **logically reconstruct** more complex relations.
  + You can export Argdown as a graph and create **argument maps** of whole debates.
  - Not a tool for creating visualizations, but for **structuring arguments**.

<Argument>: Argdown is an excellent tool and should be used by the city of Bowling Green, KY.

[Statement]
  +> <Argument>

Argument Generation and Scoring

Opioid Epidemic and Healthcare

Argument Generation and Scoring

Community Enrichment

Conclusion

Chaining simple tasks for complex reasoning
Discovering topics in a large dataset and new generating valuable insights
Risk of hallucinations and incorrect output
LLMs’ limitations in processing complex instructions and sentences
- Complex instructions
- Relationship modeling based on double and triple negatives
Reliability and bias
- Critical need for ethical and inclusive technology deployment

Future Research Directions

Semantic extraction and reasoning during discourse
Exploring connections across topics
Generalizing techniques to platforms like Kialo, Hacker News

Overall, I found that chaining these simple tasks can help us create complex reasoning pipelines
We were able to discover topics in a large dataset and generate valuable insights
There is a risk of hallucinations and incorrect output
Hallucinations were mitigated by using different chain-of-thought techniques
and most importantly, by keeping our instructions very simple
use of guidance to control the generation of text really helped as well
incorrect information is a risk with any machine learning model
the best we can do is to pair these models with humans to catch these errors early
Overall, this helps enable public discourse
Which can be a powerful tool for enhancing democratic processes
depending on how it is used
But we also need to be careful about how we deploy these technologies

In this paper-

We developed a novel argument generation framework that synthesizes arguments from user-generated comments
and we introduced a scoring mechanism to quantify the acceptance of each synthesized argument
A key innovation is the way we produce argument maps.
These distill complex relationships between topics and insights and make it easier for policy-makers to understand what the public is saying
Recognized limitations in current LLMs’ ability to fully comprehend the nuanced dynamics of online debates
Acknowledged challenges in maintaining an unbiased moderation process, highlighting the delicate balance between automated and human moderation

Using LLMs to Structure and Visualize Policy Discourse

Context

Policy decisions pose a complex, wicked problem ¹ ²

Determining public opinion

Problem Statement

How can LLMs enable us to

What are the inherent risks associated with the deployment of LLMs?

gIBIS

Polis

D-Agree Crowd-Scale Discussion

Methodology

Data

Embeddings

Text Generation

Topic Modeling

Topic Outlier Assignment

Statement Distribution

Statement Distribution after outlier reassignment

Insight Generation

Insight Scoring

Potential Biases in Insight Generation

Argument Mapping

Argument Generation and Scoring

Opioid Epidemic and Healthcare

Argument Generation and Scoring

Community Enrichment

Conclusion

Future Research Directions

Questions?

Thank you!

Using LLMs to Structure and Visualize Policy Discourse

Context

Policy decisions pose a complex, wicked problem 1 2

Determining public opinion

Problem Statement

How can LLMs enable us to

What are the inherent risks associated with the deployment of LLMs?

Related Works

gIBIS

Polis

D-Agree Crowd-Scale Discussion

Methodology

Data

Embeddings

Text Generation

Topic Modeling

Topic Outlier Assignment

Statement Distribution

Statement Distribution after outlier reassignment

Insight Generation

Insight Scoring

Potential Biases in Insight Generation

Argument Mapping

Argument Generation and Scoring

Opioid Epidemic and Healthcare

Argument Generation and Scoring

Community Enrichment

Conclusion

Future Research Directions

Questions?

Thank you!

Policy decisions pose a complex, wicked problem ¹ ²