Can I use these text analyses to understand the content of the bill, without reading 400 pages?

Visualizing anti-trans legislation using RStudio

Bianca Thompson
Westminster University

While I was working with Kenan İnce on their open source active learning book, Quantitative Reasoning for Social Justice, we discussed their dream chapter on using text data visualization to understand legislation about trans people in the US as a source for trans rights activism. İnce was trained as a poet and topologist and identified as non-binary (they/them), and I am trained as a number theorist and identify as a queer, biracial Latina (she/they). The nature of the project, demanding expertise in data analysis and familiarity with many different experiences of gender, made us both feel out of our depth. Regardless, we pushed on because this chapter is key to bringing culturally relevant topics into our classroom. Living in Salt Lake City, UT, with its vibrant queer communities, our goal has always been to create and defend habitable spaces in Utah for LGBTQ+ community members.

İnce passed away before we could finish this chapter. İnce was an activist poet-scholar and was committed to challenging legislation intent on erasing Utah’s queer communities. To quote from Heartstopper, “Trans people aren’t a debate. We’re human beings”. To honor İnce’s memory and legacy, their collaborators and I are working to finish Quantitative Reasoning for Social Justice.

Colorful rainbow flags and arches in front of a downtown Salt Lake City clock tower
A Pride celebration in Salt Lake City. (Public domain photo by the Bureau of Land Management.)

Since 2023, Utah’s legislature has introduced 6 anti-trans bills: HB132 (failed), HB209, HB257, SB0016, SB0039, SB0093 (see Track Trans Legislation for details). The 5 bills that passed have been harmful to my students, especially recent graduates, who have experienced acceptance on campus but face harassment in their new careers. I was inspired by Visualizing Text Data: Techniques and Applications by Geeks for Geeks, which argues text data visualization is good for simplifying complex data and will facilitate enhanced comprehension, pattern recognition, improved communications, exploratory data analysis, and decision making. This pushed me to take analysis beyond an in-class activity and follow in İnce’s footsteps to develop tools for activism. 

So, why am I using math to study this? The shortest of these bills is around 10 pages and the longest is over 300 pages all written in legislation speak. To be an informed citizen I’d need to find time to analyze 400 pages in the past year alone. And that’s just to keep up with the anti-trans legislation! There are more bills going through Utah’s House and Senate about issues that impact my friends and family. If I can use mathematics to do as the Geeks for Geeks article suggests, then I can be more efficient with my time. Let’s go on this journey together and see what we can learn from doing a text analysis. 

Throughout the following discussion, I’ll provide code in RStudio. This is an application you can download to make using the statistical programming language R easier.

My questions are 

  • What were the major language themes these bills had associated with their bill types such as: youth athletics, ID updates, healthcare, nondiscrimination protections, public facilities, school/education and any other topic missed by these broad categories? These categories are the ones created on the Track Trans Legislation website.
  • Can I use these text analyses to understand the content of the bill, without reading 400 pages?

Using RStudio tutorial to clean the data

To start with, I pasted the text into .txt files. I loaded my libraries:

library(tidytext) # provides additional text mining
library(stringr)   # text cleaning and regular expressions
library(stopwords) # identifies stopwords
library(dplyr) # allows me to filter
library(tidyverse) # where the graph commands live
library(igraph) #make ngrams
library(ggraph) #visualize the networks
library(forcats) #refinding some of the graphs

There are lots of ways we can clean the text once it’s loaded in, but we can use a mix of regular expressions and built in commands to make a clean_text_function().

clean_text_function <- function(text) {
  # Clean each line of text:
  cleaned_text %
	str_to_lower() %>%  # Convert text to lowercase
	str_replace_all("[^[:alnum:]\\s]", " ") %>%  # Remove non-alphanumeric characters (except spaces)
	str_replace_all("\\d+", " ") %>% #deletes all numbers
	str_squish()  # Remove extra spaces

  return(cleaned_text)
}

This clean_text_function() has no finesse. If you’re familiar with regular expressions (regex), you can make one more tailored to your .txt files. I noticed that this function would delete things like the “12” in “k12” and it leaves in all the names of list items, like “a”, “j”, and “ii”. When I tried to refine my code, it would delete too much, so I figured better the enemy I know.

Now that we have defined the cleaning function, we can load our .txt files, clean them and turn them into data frames:

bill_lines_0257 <-read_lines("HB0257.txt")
bill_lines_0093 <- read_lines("SB0093.txt")
bill_lines_0016 <- read_lines("SB0016.txt")
bill_lines_0039 <- read_lines("SB0039.txt")
bill_lines_0209 <- read_lines("HB0209.txt")
bill_lines_0132_fail <- read_lines("HB0132-failed.txt") 

clean_line_0257_df <- data.frame(text = clean_text_function(bill_lines_0257), stringsAsFactors = FALSE)
clean_line_0093_df <- data.frame(text = clean_text_function(bill_lines_0093), stringsAsFactors = FALSE)
clean_line_0016_df <- data.frame(text = clean_text_function(bill_lines_0016), stringsAsFactors = FALSE)
clean_line_0039_df <- data.frame(text = clean_text_function(bill_lines_0039), stringsAsFactors = FALSE)
clean_line_0209_df <- data.frame(text = clean_text_function(bill_lines_0209), stringsAsFactors = FALSE)
clean_line_0132_fail_df <- data.frame(text = clean_text_function(bill_lines_0132_fail), stringsAsFactors = FALSE)

combined_bills_df % mutate(book = "HB0257"),
  clean_line_0093_df %>% mutate(book = "SB0093"),
  clean_line_0016_df %>% mutate(book = "SB0016"),
  clean_line_0039_df %>% mutate(book = "SB0039"),
  clean_line_0209_df %>% mutate(book = "HB0209"),
  clean_line_0132_fail_df %>% mutate(book = "HB0132_fail")
)

Using R to Visualize the data in bigrams

A bigram is a pair of words. Using the unnest_tokens() function we can break up the text into a data frame that keeps track of bigrams (or any $n$-gram, but I’m looking at bigrams). We can then visualize all 6 bills’ top 10 bigrams at once. The separate() function allows us to break up a bigram into two categorical variables. We use drop_na(), because cleaning the text creates spaces being counted as words. The filter command in conjunction with stop_words helps remove words like “the” and “a” or “an”. Then since we want to know how many of each we have, we use the count() function. We use unite() since we separated our bigrams earlier in order to clean out stop words. Then we use group_by() so we can study the bigrams in each different legislation. To visualize, we use a bar graph created with ggplot(). The commands are just to clean up the visualization so that it appears nice.

bigram_data %
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  drop_na() %>%
  filter(!word1 %in% stop_words$word,
     	!word2 %in% stop_words$word) %>%
  count(book, word1, word2, sort = TRUE) %>%
  unite("bigram", c(word1, word2), sep = " ") %>%
  group_by(book) %>%
  top_n(10) 

bigram_data %>%
  ggplot(aes(reorder_within(bigram, n, book), n, fill = book)) +
  geom_bar(stat = "identity", alpha = .9, show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~ book, ncol = 2, scales = "free") +
  coord_flip()

Fun fact: when I print the bar graph image in R, all the text overlays it, so I needed to launch the image in its own window to read it. We can also see that the cleaning of the data was not good enough because “section section” appears more than once, and “section” is just the headers that are being used in the documents. I assume SB0093 has more than 10 since several bigrams had the same frequency and it’s clear SB0039 is the longest document. But even so, through this image we can see that HB132, SB0039, and SB0016 all have “health care” and other related health words, SB093 also talks about “health care”, but we can see “birth certificate” appears often in the document. Moreover, HB209 looks like it’s focused on K12 and “extracurricular activity.” It’s a little unclear what HB257 is meant to focus on since it looks like the data wasn’t cleaned that well, but “privacy space” and “sex designated” are near the top of the list. 

Let’s view some of these bigrams as networks. I want to look at HB257, and SB0039 more since it wasn’t clear what the documents were about. First we’ll make a couple of functions—count_bigrams() and visualize_bigrams()—that make the bigram lists and then visualize them as networks. The darker arrows mean the connections appeared more often. You’ll see the commands in the function look familiar as we did something similar earlier with the full data frame. 

count_bigrams %
	unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
	separate(bigram, c("word1", "word2"), sep = " ") %>%
	filter(!word1 %in% stop_words$word,
       	!word2 %in% stop_words$word) %>%
	count(word1, word2, sort = TRUE)
}

visualize_bigrams <- function(bigrams) {
  set.seed(2016)
  a %
	graph_from_data_frame() %>%
	ggraph(layout = "fr") +
	geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
	geom_node_point(color = "lightblue", size = 3) +
	geom_node_text(aes(label = name), repel=TRUE) +
	theme_void()
}

Here $n$ is the number of times the bigram appears in the document. So we’re filtering out the bigrams that appear less than 5 times. 

HB257_bigrams %
 		 count_bigrams()%>%
 		 drop_na()
	HB257_bigrams %>%
 filter(n > 5,
     	!str_detect(word1, "\\d"),
     	!str_detect(word2, "\\d")) %>%
 		 visualize_bigrams()

A word cloud showing relationships between words such as privacy and space or government and entity

This created a disconnected graph of related words. A connection between two words means they are a bigram, and the directional arrow is used to demonstrate which is word 1 and which is word 2. Looking at the darker arrows and seeing what networks they are connected to leads to looking at “sex”, “designat(ed/ion)”, “privacy”, “changing”, “space”, and a few other words. This leads me to suspect this bill is about bathroom spaces, since there is a path from “sex” to “space” and “changing”. 

For SB0039, I changed the command a little since this is the one that was over 300 pages, so leaving $n$ small left a giant unreadable network.

SB0039_bigrams%
 count_bigrams()%>%
 drop_na()

Here’s an example of content in the bigrams table. Here word1 is the first word in the bigram, word2 is the second, and n is the frequency that bigram appeared in the document. 

word1 word2 n
utah chapter 2088
medicaid program 878
care facility 833
nursing care 754
health care 577
medicaid expansion 465
managed care 348
health insurance 329
medical assistance 328
medicaid waiver 321
SB0039_bigrams %>%
 filter(n > 100,
     	!str_detect(word1, "\\d"),
     	!str_detect(word2, "\\d")) %>%
  visualize_bigrams()

A word cloud showing relationships between words such as government and behavioral or care and facility.

There are not as many dark arrows here. The grey ones seem to flow from “nursing” to “program” . This connected network seems to talk a lot about healthcare, facilities, and programs. But I’m still unsure of the overall content of this bill.

Conclusion

After pulling the bigrams and their networks to analyze their contents, my goal was to find the major themes of the legislation. For example, SB0016 and HB0132 are clearly about healthcare, which matches the categorization on Track Trans Legislation. But for others, like SB0093, SB0039, and HB0257 were more difficult to categorize using bigrams and their networks alone. HB209 is coded as “youth athletics” and maybe the bigrams containing “extracurricular activities” should have clued me in, but to be more certain for future conclusions I need to look at more “youth athletic bills”. Further, for the healthcare bills, I could identify that they were about health care and transgender access to health care, but I couldn’t tell what restrictions were being put in place.  

When İnce and I first proposed this project, it felt incredibly intimidating since I predominantly use SageMath for all my math needs, and very rarely venture into other tools. After doing this exploration, I felt confident in using these tools but realized that I need questions informed by the community impacted by these bills so that this data analysis is not for this community but with this community. As someone who is working to do better analyzing data using CARE principles for Indigenous data governance and data feminism, I need my practices to be informed by the community. 

Reading Further

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags are not allowed.

59,563 Spambots Blocked by Simple Comments