David Austin
Grand Valley State University
In 2009, crypto miner James Howells of southern Wales mistakenly threw away a hard drive containing 8,000 bitcoins. That’s over $100 million even in today’s sinking crypto landscape. Thirteen years later, Howells has a plan, backed by venture capital, to search the 110,000 tons of garbage in the local landfill using robotic dogs and artificial intelligence.
This is a fairly common problem. Not searching landfills for cryptocurrency, of course, but searching for something in a vast sea of possibilities. That’s what we’ll be doing in this month’s column.
Let’s begin by breaking a code. More specifically, suppose we have a message encoded with a substitution cipher where each character in the message is substituted with another. To keep things simple, we’ll just imagine that the letters of the alphabet are permuted so that whenever we see an “a”, say, we write “m” instead. Here’s our message:
ziatwxafbssuladwafb uanubeablwdfunaywwmabywxdakfnzgdwsfunan wyzlatwxarbtanururyunadfbdafuawlkuafbeabagoblawnadfuagoblaf beakfnzgdwsfunanwyzlazaewldamlwoaofzkfableadfbdafuaxgueadwa kbjjadfzgagoblaswwfaadfbdaobgabajwlhadzruabhwableaofulaouag bzeahwweytuaouadwwmadfualbruaozdfaxgabgaouaezeldadfzlmadfua goblaowxjeaobldazdabltarwnuaoujjaofulaueobneayubnagbzeadfbd afuaowxjeajzmuablauqkzdzlhalbruabjjadwafzrgujiakfnzgdwsfuna nwyzlagbzeabdawlkuaozdfwxdagdwsszlhadwadfzlmadfbdafuaobgaoz llzudfuswwfableafuaobgagwabgazafb uauqsjbzlueadfuaswwfasbnd azaozjjalwoauqsjbzladfuanugdawiazd
Our task is to find the permutation that decodes the message. Of course, we could generate every permutation and look for one that generates something readable just as Mr. Howells could turn over every piece of rubbish in the landfill. However, since there are 27 characters, the letters in the alphabet along with a space, there are
$$
27! = 10,888,869,450,418,352,160,768,000,000 \approx 10^{19}
$$ permutations. That’s more than the number of seconds in the lifetime of the universe. We clearly need a better strategy to search through the permutations.
We know that letters in an English text occur at different frequencies. For instance, in Project Gutenberg’s translation of War and Peace, about 18% of the very large number of characters are spaces, 10% are “e”, 7% are “t” and so forth. In our encoded message, about 20% of the characters are “a” so it seems reasonable to suppose that “a” represents a space, about 8% are “u”, which probably represents an “e”, and so forth. This generates a permutation, which when applied to our message gives:
hk pow ntccei ao ntxe detl tioaned mooy tmowa gndhraocned d omhi pow ftp defefmed anta ne oige ntl t rsti od ane rsti n tl gndhraocned domhi h loia yios snhgn til anta ne wrel ao gtuu anhr rsti coon anta str t uoib ahfe tbo til snei se r thl boolmpe se aooy ane itfe shan wr tr se lhlia anhiy ane rsti sowul stia ha tip fode seuu snei elstdl metd rthl anta ne sowul uhye ti evghahib itfe tuu ao nhfreuk gndhraocned d omhi rthl ta oige shanowa raocchib ao anhiy anta ne str shi iheanecoon til ne str ro tr h ntxe evcuthiel ane coon ctda h shuu ios evcuthi ane dera ok ha
Hmmm. Rather than using just the frequency of single characters, perhaps we could look at the frequency of bigrams, pairs of consecutive characters. To that end, let $T(x,y)$ be the probability that $x$ is followed by $y$ in English text. For instance, we expect $T(\text{q}, \text{u})$ to be relatively large while $T(\text{q}, \text{b})$ should be relatively small, if not zero.
Suppose that our encoded message consists of the sequence of characters $c_i$ and that $\pi$ is the permutation of the symbols that decodes the message. If $c_ic_{i+1}$ is a bigram in the message, then $\pi(c_i)\pi(c_{i+1})$ is a bigram in a piece of English text, which means that $T(\pi(c_i), \pi(c_{i+1}))$ should be relatively high. This allows us to measure how likely a permutation is as a decoding key by defining the plausibility of a permutation to be
$$
P(\pi) = \prod_{i} T(\pi(c_i),\pi(c_{i+1})).
$$ Permutations with higher plausibilities are more likely to decode the message.
Now our task is to search through the $27!$ permutations looking for ones that are highly plausible. The following algorithm will generate a sequence of highly plausible permutations $\pi_n$.
When I ran this algorithm, the permutation $\pi_{1000}$ applied to the encoded message produced:
ik mof tappen so tage wead anostew voou avofs ctwirsoptew w ovin mof bam webebvew stas te once tad a rhan ow ste rhan t ad ctwirsoptew wovin i dons unoh htict and stas te fred so call stir rhan poot stas har a lony sibe ayo and hten he r aid yoodvme he soou ste nabe hist fr ar he didns stinu ste rhan hofld hans is anm bowe hell hten edhawd veaw raid stas te hofld liue an excisiny nabe all so tibrelk ctwirsoptew w ovin raid as once histofs rsoppiny so stinu stas te har hin niestepoot and te har ro ar i tage explained ste poot paws i hill noh explain ste wers ok is
and $\pi_{2500}$ gave:
if you happen to have read another book about christopher r obin you may remember that he once had a swan or the swan h ad christopher robin i dont know which and that he used to call this swan pooh that was a long time ago and when we s aid goodbye we took the name with us as we didnt think the swan would want it any more well when edward bear said that he would like an exciting name all to himself christopher r obin said at once without stopping to think that he was win niethepooh and he was so as i have explained the pooh part i will now explain the rest of it
These results look pretty reasonable, and the whole thing, including scanning War and Peace, took about five seconds on my laptop. It’s remarkable that we can find the decoding permutation, out of the set of $10^{19}$ permutations, with so little effort.
This is an example of what Persi Diaconis has called the Markov Chain Monte Carlo revolution. In his 2007 survey article, Diaconis describes how a psychologist came to the Stanford statistical consulting service with a collection of coded messages exchanged between incarcerated people in the California prison system. Guessing the message was encoded using a substitution cipher, Stanford student Marc Coram applied this algorithm to decode it.
There are a few things to be said before we investigate more deeply. First, the algorithm only worked about half of the time; I’ve just chosen to present here one of the times that it did work. Other times, it seemed to get stuck with a permutation that produced gibberish. The odds of success went up when the initial permutation was chosen to be the one that matched symbols based on the frequency with which they appeared in the message rather than choosing a permutation at random.
Even when the message was successfully decoded, continuing to run the algorithm for more iterations frequently gave something like:
if you happen to have read another mook amout christopher romin
Let’s look more carefully at the algorithm to see what’s going on. At each step, we generate a new permutation. If that permutation is more plausible, we accept it. However, there is a still a chance that a less plausible permutation will be accepted. So while the algorithm generally tends to increase the plausibility of the permutations it produces, there’s a way for it to escape a local maximum.
There is an alternative point of view that’s helpful and that we’ll develop more fully. Rather than searching for the most plausible permutation, the algorithm is actually sampling from the permutations in such a way that more plausible permutations are more likely to appear. More specifically, the probability of obtaining a permutation $\pi$ is proportional to its plausibility so that the sampling distribution is
$$
s(\pi) = \frac{P(\pi)}{Z}
$$ where $Z$ is the normalizing constant
$$
Z = \sum_\pi P(\pi).
$$
This alternative point of view will become more clear after we make a digression through the world of Markov chains.
Let’s begin with a network consisting of nodes joined by edges and suppose we traverse the network at random by moving along its edges. In particular, if we are at a node labeled $x$, the probability that we move along an edge to node $y$ will be denoted $K(x,y)$. If there are $m$ nodes in our network, $K$ is an $m\times m$ matrix we refer to as the transition matrix.
For example, the nodes in the network could be the 27 characters in a piece of English text with $K(x,y)$ being the probability that character $x$ is followed by character $y$.
Since we’ll move from $x$ to some other node, we have $\sum_y K(x,y) = 1$ so that the sum across any row of $K$ is 1. We say that $K$ is row stochastic and note that a row $K(x,-)$ represents the distribution describing our location one step after beginning at $x$.
Notice that $K(x,y)K(y,z)$ is the probability that we begin at $x$, pass through $y$, and then move to $z$. Therefore,
$$
K^2(x,z) = \sum_y K(x,y)K(y,z)
$$ represents the probability that we begin at $x$ and end up at $z$ after two steps. Likewise, $K^n(x,z)$ represents the probability that we begin at $x$ and end up at $z$ after $n$ steps.
As we randomly walk around the network, suppose that the distribution of locations at step $n$ is given by the row vector $s_n$; that is, $s_n(x)$ gives the probability we are at node $x$ at that step . The product $\sum_x s_n(x)K(x,y)$ describes the probability that we are node $y$ after the next step. In other words, the product $s_nK$ provides the new distribution, $s_{n+1} = s_nK$.
As a simple example, consider the network with transition matrix
$K=\begin{bmatrix} 5/6 & 1/6 \\ 1/2 & 1/2 \\ \end{bmatrix}.$ |
This means that if we’re at node $x$, there is a $5/6$ chance we stay at $x$ and a $1/6$ chance we move to $y$. On the other hand, if we are at $y$, we move to $x$ or stay at $y$ with equal probability.
If we begin at $x$, the initial distribution is described by the row vector $s_0 = \begin{bmatrix}1 & 0\end{bmatrix}$. After one step, the distribution is $s_1=s_0K = \begin{bmatrix}5/6 & 1/6 \end{bmatrix}$. Here’s how the first few steps proceed:
$$
\begin{array}{c|c}
{\bf n} & {\bf s_n} \\
\hline
0 & \begin{bmatrix}1.000 & 0.000 \end{bmatrix} \\
1 & \begin{bmatrix}0.833 & 0.167 \end{bmatrix} \\
2 & \begin{bmatrix}0.778 & 0.222 \end{bmatrix} \\
3 & \begin{bmatrix}0.759 & 0.241 \end{bmatrix} \\
4 & \begin{bmatrix}0.753 & 0.247 \end{bmatrix} \\
\end{array}
$$
If we continue, something remarkable happens. The distributions $s_n$ converge to $s_n\to\overline{s} = \begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$. This means that after a long time, we spend three-quarters of our time at node $x$ and one-quarter of our time at node $y$.
Since $s_{n+1} = s_nK$, this says that $$
\overline{s} = \overline{s}K
$$ and we call $\overline{s}$ a stationary distribution. In fact, the powers of $K$ converge $$
K^n \to \begin{bmatrix}
0.75 & 0.25 \\
0.75 & 0.25 \\
\end{bmatrix}
$$ to a rank one matrix so that no matter the initial state $s_0$, the distributions $s_n = K^ns_0$ converge to the same stationary distribution $\overline{s}=\begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$.
This is an example of the beautiful Perron-Frobenius theorem: under a mild assumption on the row stochastic matrix $K$, any initial distriubtion $s_0$ will converge to a unique stationary distribution $\overline{s}$. In fact, this theorem is the basis for Google’s PageRank algorithm, as explained in an earlier Feature Column.
The sequence of distributions $s_n$ is called a Markov chain so we say that any Markov chain converges to the unique stationary distribution.
But what does this have to do with the decoding problem we began with? Let’s recast the sequence of permutations $\pi_n$ that we created in the language of Markov chains.
Let’s view the set of permutations $S_n$ as a network where each permutation represents a node and an edge connects two nodes when the permutations differ by a transposition. We want to sample permutations from $S_n$ in such a way that we are more likely to sample highly plausible permutations. With this in mind, we view the plausibility as defining a distribution on $S_n$. That is, the probability of choosing a given permutation is $s(\pi) = P(\pi)/Z$ where $Z$ is a normalizing constant
$$
Z = \sum_{\pi} P(\pi).
$$ Of course, $S_n$ is so large that evaluating $Z$ is not feasible, but we’ll see that this presents no problem.
If we can find a transition matrix $K$ whose unique stationary distribution is $\overline{s} = s$, then a Markov chain will produce a sequence of permutations that is a sample from the distribution $s(\pi)$.
In the discussion of the Perron-Frobenius theorem, we began with the transition matrix $K$ and found the resulting stationary vector. Now we’d like to invert this process: if we are given a distribution $s$, can we find a transition matrix $K$ whose stationary distribution $\overline{s} = s$?
The Metropolis-Hastings algorithm tells us how to create the transition matrix $K$. We begin with any row stochastic transition matrix $J(x,y)$. Given $x$ and $y$, we define the acceptance ratio
$$
A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)}.
$$ This enables us to define
$$
K(x,y) = \begin{cases}
J(x,y) & \text{ if } x\neq y, A(x,y) \geq 1 \\
J(x,y)A(x,y) & \text{ if } x\neq y, A(x,y) \lt 1 \\
\end{cases}.
$$ The diagonal term $K(x,x)$ is chosen to enforce the condition that $\sum_y K(x,y) = 1$ so that $K$ is row stochastic.
What is the stationary distribution associated to $K$? Well, notice that $A(x,y) = 1/A(x,y)$. If the acceptance ratio $A(x,y) \lt 1$, then $A(y,x) \gt 1$ so that
$$
s(x)K(x,y) = s(x)J(x,y)A(x,y) = s(y)J(y,x) = s(y)K(y,x).
$$ Therefore
$$
\sum_x s(x)K(x,y) = \sum_x s(y)K(y,x) = s(y)\sum_xK(y,x) =
s(y).
$$ In other words, $sK = s$ so $s$ is the unique stationary distribution for $K$: $\overline{s} = s$.
When we apply this to our decoding problem, we choose the initial transition matrix $J(x,y)$ so that from the permutation $x$, we uniformly choose any permutation $y$ with $y=\sigma x$ for a transposition $\sigma$. In this case, $J$ is symmetric so that $J(x,y) = J(y,x)$ and the acceptance ratio is
$$
A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)} = \frac{s(y)}{s(x)} =
\frac{P(y)/Z}{P(x)/Z} = \frac{P(y)}{P(x)}.
$$ Notice that we can find the ratio $s(y)/s(x)$ without knowing the normalizing constant $Z$.
Now we see that the sequence of permutations $\pi_n$ that we created to decode the message is actually a Markov chain given by the transition matrix $K$ whose stationary distribution is proportional to the plausibility. Since the Markov chain samples from this distribution, we find that the sequence of permutations favors permutations with a high plausibility, which makes it likely that we encounter the right permutation for decoding the message.
This method is known as Markov Chain Monte Carlo sampling, and it plays an important role in the mathematical analysis of political redistricting plans. As is well known, American state legislatures redraw maps of districts for political representation every 10 years using data from the latest Census. A typical state is geographically divided into small building blocks, such as Census blocks or tracts, and each building block is assigned into a Congressional district.
For instance, Michigan has 14 Congressional districts and 2813 Census tracts. A redistricting plan that assigns a Congressional district to each Census block would be a function, $R:{\cal C}\to\{1,2,3,\ldots,14\}$, where $\cal C$ is the set of Census tracts. The number of such functions is $14^{2813}$, whose size we understate by calling it astronomically large. Most of these plans don’t satisfy legal requirements established by the state, but this just points to the fact that there are lots of possible redistricting plans.
Here are the requirements that must be satisfied by a Michigan redistricting plan. The populations of the districts should be approximately equal, and each district should be contiguous, meaning that one could walk over land between any two points in a district. Districts are also required to be “compact,” meaning the area of each district is a significant fraction of the area of the smallest circle that encloses the district.
When a particular party controls the state legislature, they are naturally motivated to draw districts that ensure their party will gain as many representatives as possible no matter the expressed preference of voters. For instance, they may put many voters from the opposing party into a small number of districts while spreading their own voters around just enough to ensure a majority in a large number of districts. This practice is called gerrymandering.
For example, Republicans won 60 of the 99 seats in the Wisconsin State Assembly in the 2012 elections after the Republican-controlled legislature created a new redistricting plan in 2010. That is, Republicans won about 60% of the seats in spite of winning only 50.05% of the votes. How can we assess whether this is the result of a gerrymander or simply due to constraints on the redistricting plans?
There are many mathematical approaches to this problem, but a promising recent approach is to generate a large ensemble of redistricting plans and, for each plan, determine how many seats each party would win had that plan been in place. Now the problem starts to sound familiar: we want to generate a representative sample from an extremely large set of possibilities. Markov Chain Monte Carlo sampling does just that!
There’s a network that provides a useful mathematical model of a redistricting plan. Each geographic building block is a node, and two nodes are joined by an edge if the corresponding building blocks abut one another. Nodes with the same color in the figure below belong to the same district in a particular redistricting plan.
We say that an edge joining two nodes in different districts is a conflicted edge.
To facilitate our sampling strategy, we will now build a network of redistricting plans with each plan forming a node. An edge joins two districts if they are obtained by changing the district of one endpoint of a conflicted edge. For instance, here’s a new plan obtained by changing one endpoint of the conflicted edge from yellow to green. These two plans are connected by an edge in our network of redistricting plans.
We would like to draw samples from the huge set of redistricting plans by completing a random walk on this network. If we simply follow an edge at random, we have the transition matrix $J(R, R’)$ where $R$ and $R’$ are redistricting plans joined by an edge. If $c(R)$ is the number of conflicted edges, we have
$$
J(R,R’) = \frac{1}{2c(R)}.
$$
But now we would like to sample redistricting plans that satisfy the additional requirements of equal population, contiguity, and so forth. For each requirement, there is a measure of how well a redistricting plan satisfies that requirement. For instance, $M_{\text{pop}}(R)$ will measure how well the equal population requirement is met. These measures are weighted and combined into a total scoring function
$$
M(R) = w_{\text{pop}}M_{\text{pop}}(R) + w_{\text{compact}}M_{\text{compact}}(R) + \ldots.
$$ The better the redistricting plan $R$ satisfies the requirements, the lower the scoring function $M(R)$.
Finally, the function $e^{-M(R)}$ defines a distribution
$$
s(R) = \frac{e^{-M(R)}}{Z}
$$ where $Z = \sum_R e^{-M(R)}$ is a normalizing constant that, as before, needn’t concern us. Sampling from this distribution means we are more likely to obtain redistricting plans that satisfy the requirements.
We’re now in a position to apply the Metropolis-Hastings algorithm to obtain the transition matrix $K(R, R’)$ whose unique stationary distribution is $s(R)$. Taking a random walk using this transition matrix produces a sample whose redistricting plans are likely to satisfy the legal requirements for a redistricting plan.
To study the 2012 Wisconsin State Assembly election, Herschlag, Ravier, and Mattingly drew an ensemble of 19,184 redistricting plans in this way, and, for each plan in the ensemble, determined the number of seats that would have been won by the two parties had that plan been in place. The results are summarized in the following histogram, which shows the frequency of seats won by Republicans.
The plan in place resulted in 60 Republican seats, which is shaded red in the histogram. This result appears to be an outlier, which is evidence that the redistricting plan drawn by Republicans in 2010 is the result of an egregious partisan gerrymander.
Herschlag, Ravier, and Mattingly summarize this result, combined with others from their analysis, by writing:
The Wisconsin redistricting seems to create a firewall which resists Republicans falling below 50 seats. The effect is striking around the mark of 60 seats where the number of Republican seats remains constant, despite the fraction of votes dropping from 51% to 48%.
In addition to the Wisconsin study described here, these techniques have been used to assess North Carolina’s Congressional redistricting plan, which saw Republicans capturing 10 of 13 seats in the 2016 election in spite of a 47-53 percent split in votes given to the two major parties. Fewer than one percent of the 24,000 maps generated gave Republicans ten seats.
While mathematicians have created other metrics to detect gerrymanders, the approach using Markov Chain Monte Carlo sampling described here, however, offers several advantages. First, it can be adapted to the unique requirements of any state by modifying the scoring function $M(R)$.
It’s also relatively easy to communicate the results to a non-technical audience, which is important for explaining them in court. Indeed, a group of mathematicians and legal experts filed an amicus brief with the Supreme Court that was cited in Justice Elena Kagan’s dissent in a recent gerrymandering case regarding North Carolina’s redistricting map.
This column has focused on two applications of Markov Chain Monte Carlo sampling. Diaconis’ survey article, cited in the references, describes many more uses in areas such as group theory, chemistry, and theoretical computer science.
Persi Diaconis. The Markov chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46 (2009), 179-205.
Gregory Herschlag, Robert Ravier, and Jonathan C. Mattingly. Evaluating Partisan Gerrymandering in Wisconsin. 2017.
Gregory Herschlag, Han Sung Kang, Justin Luo, Christy Vaughn Graves, Sachet Bangia, Robert Ravier, and Jonathan C. Mattingly. Quantifying gerrymandering in North Carolina. Statistics and Public Policy, 7(1) (2020), 30–38.
Moon Duchin. Gerrymandering Metrics: How to measure? What’s the baseline? 2018.
Sachet Bangia, Christy Vaughn Graves, Gregory Herschlag, Han Sung Kang, Justin Luo, Jonathan C. Mattingly, Robert Ravier. Redistricting: Drawing the Line. 2017.
MGGG Redistricting Lab. The Metric Geometry and Gerrymandering Group is a great resource for learning about new developments in this area.
Noah Giansiracusa
Bentley University
Artificial intelligence (AI) breakthroughs make the news headlines with increasing frequency these days. At least for the time being, AI is synonymous with deep learning, which means machine learning based on neural networks (don't worry if you don't know what neural networks are—you're not going to need them in this post). One area of deep learning that has generated a lot of interest, and a lot of cool results, is graph neural networks (GNNs). This technique lets us feed a neural network data that naturally lives on a graph, rather than in a vector space like Euclidean space. A big reason for the popularity of this technique is that much of our modern internet-centric lives takes place in graphs. Social media platforms connect users into massive graphs, with accounts as vertices and friendships as edges (following another user corresponds to a directed edge in a directed graph), while search engines like Google view the web as a directed graph with webpages as vertices and hyperlinks as edges.
AirBnB provides an interesting additional example. At the beginning of 2021, the Chief Technology Officer at AirBnB predicted that GNNs would soon be big business for the company, and indeed just a few months ago an engineer at AirBnB explained in a blog post some of the ways they now use GNNs and their reasons for doing so. This engineer starts his post with the following birds-eye view of why graphs are important for modern data—which I'll quote here since it perfectly sets the stage for us:
Many real-world machine learning problems can be framed as graph problems. On online platforms, users often share assets (e.g. photos) and interact with each other (e.g. messages, bookings, reviews). These connections between users naturally form edges that can be used to create a graph. However, in many cases, machine learning practitioners do not leverage these connections when building machine learning models, and instead treat nodes (in this case, users) as completely independent entities. While this does simplify things, leaving out information around a node’s connections may reduce model performance by ignoring where this node is in the context of the overall graph.
In this Feature Column we're going to explore how to shoehorn this missing graph-theoretic "context" of each node back into a simple Euclidean format that is amenable to standard machine learning and statistical analysis. This is a more traditional approach to working with graph data, pre-dating GNNs. The basic idea is to cook up various metrics that transform the discrete geometry of graphs into numbers attached to each vertex. This is a fun setting to see some graph theory in action, and you don't need to know any machine learning beforehand—I'll start out with a quick gentle review of all you need.
The three main tasks in machine learning are regression, classification, and clustering.
For regression, you have a collection of variables called features and one additional variable, necessarily numerical (meaning $\mathbb{R}$-valued) called the target variable; by considering the training data where the values of both the features and target are known, you fit a model that attempts to predict the value of the target on actual data where the features but not the target are known. For instance, predicting a college student's income after graduation based on their GPA and the college they are attending is a regression task. Suppose all the features are numerical—for instance, we could represent each college with its US News ranking (ignore for a moment how problematic those rankings are). Then a common approach is linear regression, which is when you find a hyperplane in the Euclidean space coordinatized by the features and the target that best fits the training data (i.e., minimizes the "vertical" distance from the training points down to the hyperplane).
Classification is very similar; the only difference is that the target variable is categorical rather than numerical—which in math terms just means that it takes values in a finite set rather than in $\mathbb{R}$. When this target set has size two (which is often $\{\mathrm{True}, \mathrm{False}\}$, or $\{\mathrm{yes}, \mathrm{no}\}$, or $\{0, 1\}$) this is called binary classification. For instance, predicting which students will be employed within a year of graduation can be framed as a binary classification task.
Clustering is slightly different because there is no target, only features, and you'd like to partition your data into a small number of subsets in some natural way based on these features. There is no right or wrong answer here—clustering tends to be more of an exploratory activity. For example, you could try clustering college students based on their GPA, SAT score, financial aid amount, number of honors classes taken, and number of intramural sports played, then see if the clusters have human-interpretable descriptions that might be helpful in understanding how students divide into cohorts.
I want to provide you with the details of one regression and classification method, to give you something concrete to have in mind when we turn to graphs. Let's do the k-Nearest Neighbors (k-NN) algorithm. This algorithm is a bit funny because it doesn't actually fit a model to training data in the usual sense—to predict the value of the target variable for each new data point, the algorithm looks directly back at the training data and makes a calculation based on it. Start by fixing an integer $k \ge 1$ (smaller values of k provide a localized, granular look at the data, whereas larger values provide a smoothed, aggregate view). Given a data point P with known feature values but unknown target value, the algorithm first finds the k nearest training points $Q_1, \ldots, Q_k$—meaning the training points $Q_i$ whose distance in the Euclidean space of feature values to P is smallest. Then if the task is regression, the predicted target value for P is the average of the target values of $Q_1, \ldots, Q_k$, whereas if the task is classification then the classes of these $Q_i$ are treated as votes and the predicted class for $P$ is whichever class receives the most votes. (Needless to say, there are plenty of variants such as weighting the average/vote by distance to P, changing the average from a mean to a median, or changing the metric from Euclidean to something else.)
That's all you need to know about machine learning in the usual Euclidean setting where data points live in $\mathbb{R}^n$. Before turning to data points that live in a graph, we need to discuss some graph theory.
First, some terminology. Two vertices in a graph are neighbors if they are connected by an edge. Two edges are adjacent if they have a vertex in common. A path is a sequence of adjacent edges. The distance between two vertices is the length of the shortest path between them, where length here just means the number of edges in the path.
A useful example to keep in mind is a social media platform like Facebook, where the vertices represent users and the edges represent "friendship" between them. (This is an undirected graph; platforms like Twitter and Instagram in which accounts follow each other asymmetrically form directed graphs. Everything in this article could be done for directed graphs with minor modification, but I'll stick to the undirected case for simplicity.) In this example, your neighbors are your Facebook friends, and a user has distance 2 from you if you're not friends but you have a friend in common. The closed ball of radius 6 centered at you (that is, all accounts of distance at most 6 from you) consists of all Facebook users you can reach through at most 6 degrees of separation on the platform.
Next, we need some ways of quantifying the structural role vertices play in a graph. There are a bunch of these, many of which capture various notions of centrality in the graph; I'll just provide a few here.
Starting with the simplest, we have the degree of a vertex, which in a graph without loops or multiple edges is just the number of neighbors. In Facebook, your degree is your number of friends. (In a directed graph the degree splits as the sum of the in-degree and the out-degree, which on Twitter count the number of followers and the number of accounts followed.)
The closeness of a vertex captures whether it lies near the center or the periphery of the graph. It is defined as the reciprocal of the sum of distances between this vertex and each other vertex in the graph. A vertex near the center will have a relatively modest distance to the other vertices, whereas a more peripheral vertex will have a modest distance to some vertices but a large distance to the vertices on the “opposite” side of the graph. This means that the sum of distances for a central vertex is smaller than the sum of distances for a peripheral vertex; reciprocating this sum flips this around so that the closeness score is greater for a central vertex than a peripheral vertex.
The betweenness of a vertex, roughly speaking, captures centrality in terms of the number of paths in the graph that pass through the vertex. More precisely, it is the sum over all pairs of other vertices in the graph of the fraction of shortest paths between the pair of vertices that pass through the vertex in question. That's a mouthful, so let's unpack it with a couple simple examples. Consider the following two graphs:
In graph (a), the betweenness of V1 is 0 because no shortest paths between the remaining vertices pass through V1. The same is true of V2 and V3. The betweenness of V4, however, is 2: between V1 and V2 there is a unique shortest path and it passes through V4, and similarly between V1 and V3 there is a unique shortest path and it also passes through V4. For the graph in (b), by symmetry it suffices to compute the betweenness of a single vertex. The betweenness of V1 is 0.5, because between V2 and V3 there are 2 shortest paths, exactly one of which passes through V1.
The following figure shows a randomly generated graph on 20 vertices where in (a) the size of each vertex corresponds to its closeness score and in (b) it corresponds to the betweenness score. Note that the closeness does instead reflect how central versus peripheral the vertices are; the betweenness is harder to directly interpret, but roughly speaking it helps identify important bridges in the graph.
Another useful pair of measures of vertex importance/centrality in a graph are the eigenvector centrality score and the PageRank score. I'll leave it to an interested reader to look these up. They both have nice interpretations in terms of eigenvectors related to the adjacency matrix and in terms of random walks on the graph.
Suppose we have data in the usual form for machine learning—so there are features for clustering, or if one is doing regression/classification then there is additionally a target variable—but suppose in addition that the data points form the vertices of a graph. An easy yet remarkably effective way to incorporate this graph structure (that is, to not ignore where each vertex is "in the context of the overall graph," in the words of the AirBnB engineer) is simply to append a few additional features given by the vertex metrics discussed earlier: degree, closeness, betweenness, eigenvector centrality, PageRank (and there are plenty others beyond these as well).
For instance, one could perform clustering in this manner, and this would cluster the vertices based on both their graph-theoretic properties as well as the original non-graph-theoretic feature values. Concretely, if one added closeness as a single additional graph-theoretic feature, then the resulting clustering is more likely to put peripheral vertices together in the same clusters and it is more likely to put vertices near the center of the graph together in the same clusters.
The following figure shows the same 20-vertex random graph pictured earlier, now with vertices colored by a clustering algorithm (k-means, for $k=3$) that uses two graph-theoretic features: closeness and betweenness. We obtain one cluster comprising the two isolated vertices, one cluster comprising the two very central vertices, and one cluster comprising everything else.
If one is predicting the starting income of college students upon graduation, one could use a regression method with traditional features as discussed above but include additional features such as the eigenvector centrality of each student in the network formed by connecting students whenever they took at least one class together.
So far we've augmented traditional machine learning tasks by incorporating graph-theoretic features. Our last topic is a machine learning task without counterpart in the traditional non-graph-theoretic world: edge prediction. Given a graph (possibly with a collection of feature values for each vertex), we'd like to predict which edge is most likely to form next, when the graph is considered as a somewhat dynamic process in which the vertex set is held constant but the edges form over time. In the context of Facebook, this is predicting which two users who are not yet Facebook friends are most likely to become ones—and once Facebook makes this prediction, it can use it as a suggestion. We don't know the method Facebook actually uses for this (my guess is that it at least involves GNNs), but I can explain a very natural approach that is widely used in the data science community.
We first need one additional background ingredient from machine learning. Rather than directly predicting the class of a data point, most classifiers first compute the propensity scores, which up to normalization are essentially the estimated probability of each class—then the predicted class is whichever class has the highest propensity score. For example, in k-NN I said the prediction is given by counting the number of neighbors in each class and taking the most prevalent class; these class counts are the propensities scores for k-NN classification. Concretely, for 10-NN if a data point has 5 red neighbors and 3 green neighbors and 2 blue neighbors, then the propensity scores are 0.5 for red, 0.3 for green, and 0.2 for blue (and of course the prediction itself is then red). For binary classification one usually just reports a single propensity score between 0 and 1, since the propensity score for the other class is just the complementary probability.
Returning to the edge prediction task, consider a graph with n vertices and imagine a matrix with n choose 2 rows indexed by the pairs of vertices in the graph. The columns for this matrix are features associated to pairs of vertices—which could be something like the mean (or min, or max) of the closeness (or betweenness, or eigenvector centrality, or...) score for the two vertices in the pair, and if there are non-graph-theoretic features associated with the vertices one could also draw from these, and one could also use the distance between the two vertices in the pair as a feature. Create an additional column, playing the role of the target variable, that is a 1 if the vertex pair are neighbors (that is, joined by an edge) and a 0 otherwise. Train a binary classifier on this data, and the vertex pair with the highest propensity score among those that are not neighbors is the pair most inclined to become neighbors—that is, this is the next edge most likely to form, based on the features used. This reveals the edges that don't exist yet seem like they should, based on the structure of the graph (and the extrinsic non-graph data, if one also uses that).
If one has snapshots of the graph's evolution across time, one can train this binary classifier on the graph at time $t$ then compare the predicted edges to the actual edges that exist at some later time $t' > t$, to get a sense of how accurate these edge predictions are.
In broad outlines, here's the path we took in this article and where we ended up. The distance between vertices in a graph—generalizing the popular "degrees of separation" games played with Kevin Bacon's movie roles and Paul Erdős' collaborations—allows one to quantify various graph-theoretic roles the vertices play, via notions like betweenness and closeness. These quantifications can then serve as features in clustering, regression, and classification tasks, which helps the machine learning algorithms involved incorporate the graph structure on the data points. By considering vertex pairs as data points and using the average closeness, betweenness, etc., across each pair (and/or the distance between the pair), we can predict which missing edges "should" exist in the graph. When the graph is a social media network, these missing edges can be framed as algorithmic friend/follower suggestions.
And when the graph is of mathematical collaborations (mathematicians as vertices and edges joining pairs that have co-authored a paper together), this can suggest to you who your next collaborator should be: just find the mathematician you haven't published with yet whose propensity score is highest!
Ursula Whitcher
Mathematical Reviews (AMS)
In Helsinki this summer, Ukrainian mathematician Maryna Viazovska was awarded a Fields Medal "for the proof that the $E_8$ lattice provides the densest packing of identical spheres in 8 dimensions, and further contributions to related extremal problems and interpolation problems in Fourier analysis."
Finding the most efficient way to pack identical spheres is an extremely challenging problem! Even in three dimensions, this was an open question for hundreds of years. Johannes Kepler asserted that the familiar stacked pyramid was the best possible option in the seventeenth century, but the first person to provide a complete proof of this fact was Thomas Hales, in 1998. (Bill Casselman illustrated some of Hales' arguments using two-dimensional examples in the December 2000 Feature Column, Packing pennies in the plane.)
In 2016, Viazovska found an elegant way to show that a particular method of packing spheres in 8-dimensional space was the best. She then teamed up with four other mathematicians inspired by her arguments, Henry Cohn, Abhinav Kumar, Stephen D. Miller, and Danylo Radchenko, to prove a similar result in 24 dimensions. Six years later, 8 and 24 are still the only dimensions higher than 3 where the best way to pack identical spheres is known! The best way to pack spheres in these dimensions turns out to be extremely symmetrical. This might not be true in every dimension—maybe sometimes it's better to squeeze extra higher-dimensional spheres into unexpected corners!
What is the $E_8$ lattice that appears in Viazovska's proof? What makes it special? How do you use it to pack spheres? Let's explore these questions and check out some beautiful visualizations of $E_8$.
Before we define $E_8$, we should explore the more general concept of a lattice. It's important to be specific here, because the word "lattice" is used for multiple distinct mathematical concepts! The set of points in the plane with integer coordinates is an easy-to-visualize example of the lattices we'll be talking about today.
There are lots of different ways to describe the points in the plane with integer coordinates. One place to start is to focus on the points $(0,1)$ and $(1,0)$. If we add these points together, we get a new point, $(1,1)$, that also has integer coordinates. If we keep on adding or subtracting $(0,1)$ and $(1,0)$, we will eventually reach every point in the plane with integer coordinates!
We can connect the origin $(0,0)$, the starting points $(0,1)$ and $(1,0)$, and the first sum $(1,1)$ to make a square. A square is a special kind of parallelogram, and this square is called the fundamental parallelogram for the lattice of points with integer coordinates.
Now, what if we reversed the procedure? Instead of starting with an infinite set of scattered points and making a parallelogram, we could start with a parallelogram and create an infinite set. Our procedure is to place one vertex of the parallelogram at the origin, then repeatedly add or subtract the points corresponding to the two vertices of the parallelogram adjacent to the origin. In particular, the sum of these two vertices is the remaining vertex of the parallelogram.
In three dimensions, we could generalize this procedure using the corners of a box. Or more broadly, since we don't need the edges to meet at right angles, we could use the three-dimensional generalization of a parallelogram: a parallelepiped. (Here's an etymological puzzle I learned from John Conway: why don't we pronounce "parallelepiped" as "parallel-epi-ped," emphasizing the Greek root for "on"?)
To build a lattice in $n$ dimensions, we just need the $n$-dimensional generalization of a parallelogram and parallelepiped. Such a shape is called a parallelotope. Alternatively, we could simplify a bit by focusing on the $n$ vertices of our fundamental parallelotope that are connected by a line segment to the origin. (Linear algebra enthusiasts will recognize that we are specifying a basis of $\mathbb{R}^n$.)
We are ready to describe the $E_8$ lattice! Concretely, it is the eight-dimensional lattice determined by the eight following fundamental parallelotope vertices:
Repeatedly adding and subtracting these points creates the entire infinite $E_8$ lattice.
You might notice that the points of $E_8$ have either integer coordinates or half-integer coordinates, but never a combination of integers and half-integers. Another way to describe $E_8$ is that it consists of the points in eight dimensions with only integer or only half-integer coordinates where the sum of the coordinates is an even number.
Maryna Viazovska showed that the most efficient sphere packing in eight dimensions places a sphere center at each of the points in the $E_8$ lattice. What is the radius of these spheres? You might notice that each of the eight fundamental parallelotope vertices we used to build $E_8$ is a distance of $\sqrt{2}$ from the origin. This turns out to be the smallest possible distance between any pair of points in the $E_8$ lattice, so if we give each sphere a radius of $\sqrt{2}/2$, the spheres will touch without overlapping. (Of course, if we want to pack identical eight-dimensional spheres that have some other radius, we can scale the entire picture up or down.)
The eight special points we chose are not the only points in $E_8$ with minimum distance from the origin! We can quickly construct eight more by subtracting our chosen points from the origin. But that is only the beginning. There are 240 points in $E_8$ at a distance of $\sqrt{2}$ from the origin. These special points are called roots. We can immediately see that in the $E_8$ lattice packing, a sphere centered at the origin will touch 240 other spheres. Because we can move any point of $E_8$ to the origin by subtracting its coordinates from every other lattice point, we conclude that every sphere in the $E_8$ lattice packing touches 240 other spheres.
We can use the 240 roots of $E_8$ to construct a polytope—a higher-dimensional polyhedron—that has 240 vertices. Formally, this polytope is known as the $4_{21}$ polytope, based on a classification by the lawyer and amateur mathematician Thorold Gosset. Because it lives in eight dimensions, we can't visualize the $4_{21}$ polytope directly. However, we can create images describing which vertices are connected by edges to which other vertices, in a process analogous to the diagrams of a cube you might draw on a sheet of paper. This helps us visualize the structure of the 240 spheres surrounding the sphere at the origin in the $E_8$ lattice packing.
Here is a projection of the $E_8$ root polytope into three dimensions:
The mathematician José Luis Rodríguez Blancas led a project to visualize the $4_{21}$ polytope using colored thread. In this visualization, the 240 roots are divided into eight different "crowns," each containing 30 vertices.
Another way to understand the eight fundamental parallelotope vertices connected to the origin is to look at the angle described by the origin and each pair of vertices. Let's think of the vertices as vectors with their heads at the vertex and the tail at the origin. If we measure in radians, the angle $\theta$ between any pair of vectors $\mathbf{v}$ and $\mathbf{w}$ satisfies the equation
$$ \cos \theta = \frac{\mathbf{v} \cdot \mathbf{w}}{||\mathbf{v}|| ||\mathbf{w}||}.$$
In our eight-dimensional case, $(v_1, \dots, v_8) \cdot (w_1, \dots, w_8) = v_1 w_1 + \dots v_8 w_8$ and the length $||(v_1,\dots,v_8)||$ is given by $\sqrt{v_1^2+\dots+v_8^2}$. Because each of our eight special vertices is a distance of $\sqrt{2}$ from the origin, we know $||\mathbf{v}|| ||\mathbf{w}|| = 2$ for any pair of special vertices, so
$$ \cos \theta = \frac{1}{2} \mathbf{v} \cdot \mathbf{w}.$$
Thus, we can determine the angles between pairs of our special vertices by calculating their dot products. Here's a matrix with every possible pair of dot products:
$$\begin{pmatrix}
2 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\
-1 & 2 & -1 & 0 & 0 & 0 & 0 & 0 \\
0 & -1 & 2 & -1 & 0 & 0 & 0 & 0 \\
0 & 0 & -1 & 2 & -1 & 0 & 0 & 0 \\
0 & 0 & 0 & -1 & 2 & -1 & -1 & 0 \\
0 & 0 & 0 & 0 & -1 & 2 & 0 & 0 \\
0 & 0 & 0 & 0 & -1 & 0 & 2 & -1 \\
0 & 0 & 0 & 0 & 0 & 0 & -1 & 2
\end{pmatrix}$$
We have 2s on the diagonal, as expected. Distinct vertices have dot product either 0 or -1, so the corresponding angles are either right angles or $2 \pi/3$ ($120^\circ$)—a strikingly symmetrical arrangement!
A subtler fact about this matrix of dot products is that it has determinant 1. When a lattice's dot product matrix has this property, we say that the lattice is unimodular. In any dimension, the lattice of all points with integer coordinates is unimodular, since the corresponding dot product matrix is the identity matrix. But every point in the $E_8$ lattice has an even dot product with itself, and even unimodular lattices are much rarer. In fact, $E_8$ is the smallest even unimodular lattice—as long as we assume that our lattices can be embedded in $\mathbb{R}^n$! (If you're willing to admit imaginary lengths, your options for building unimodular objects expand.)
We can make a sixteen-dimensional even unimodular lattice by taking pairs of lattice points in $E_8$. There's a second even unimodular lattice in dimension sixteen, as well. But there are no other geometrically feasible options between 8 and 16. The next even unimodular lattices appear in 24 dimensions. That's why Viazovska and her collaborators guessed there should be a way to extend her techniques from 8 to 24-dimensional sphere packings—skipping everything in between!
Courtney Gibbons
Hamilton College
My interest in applied algebra was a long time coming. I’m not exactly a fan of living in reality, so the idea of taking something so lovely (to me) as algebra and applying it to things like disease modeling or phylogenetic trees seemed too, well, real. That’s not to say I didn’t realize how important these applications are, but those weren’t applications that captured my interest or imagination—at least, not right away!
Around 2010, I came across a paper by Elizabeth Arnold, Stephen Lucas, and Laura Taalman that described how to use algebra (particularly Groebner bases) to solve Sudoku (see reference [1]). The principles are the same as in many “real world” applied algebra settings, and thanks to this paper, I started to get interested in using algebra to help better understand the world we live in. At the same time, the 2010 Mathematics Subject Classification added “applications of commutative algebra (e.g., to statistics, control theory, optimization, etc.)” (13P25) to its list.
Today’s blog post will walk through some of the commutative algebra and algebraic geometry involved in solving a 4 by 4 Sudoku puzzle (called “shidoku”). The interested can play along by downloading this Macaulay2 file, shidoku.m2, and running computations on the University of Melbourne’s Macaulay2 web server: https://www.unimelb-macaulay2.cloud.edu.au/#home.
First off, let’s review the rules of the game. In a 4 by 4 grid, there are sixteen cells arranged into four rows, four columns, and four 2 by 2 cages that each take up a corner of the square.
Each cell can be populated with the numbers 1, 2, 3, or 4 in such a way that each row, column, and cage contains all four numbers exactly once.
Those rules describe everything you need to play on an empty board, but a board usually already has some cells filled in. Give it a whirl:
When you’ve finished, play it again—but get rid of the blue 4 and see how many solutions you can find. When you finish that one, put the blue 4 back, add an additional 4 to the fourth row and second column, and see what happens.
Regarding the variations, I’m pretty sure the etiquette of puzzle creation insists that a “good” puzzle has a unique solution—but bear with me! I promise I’m breaking the rules of etiquette for a good reason! Anyway, given how quickly you can mentally solve these puzzles, the natural question is: why bother with algebra? Aside from the obvious (and somewhat cheeky) answer, why NOT?, I often find it useful to understand mathematical ideas by seeing them applied to a situation I know pretty well. In this case, it’s solving a small logic puzzle.
The first step in applying algebra to a puzzle is figuring out how to model the game and its rules. If we take a big-picture look at what we’re doing, we’re trying to find a very special point in sixteen-dimensional space that satisfies the rules of the game. So, we could imagine setting up a bunch of polynomials in sixteen variables, one variable for each cell, that describe the rules and the clues so that a (hopefully unique!) zero of all the polynomials represents a (hopefully unique!) solution to our puzzle.
In other words, we’re going to start with the polynomial ring $\mathbb{C}[x_{11}, x_{12}, \ldots, x_{44}]$, and we’re going to make an ideal $I$ that represents our puzzle. Solutions to our puzzle will belong to the set $V(I) = \{(a_{11},a_{12},\ldots,a_{44}) \, : \, a_{ij} \in \mathbb{C} \text{ and } f(a_{11},a_{12},\ldots,a_{44}) = 0 \, \forall \, f \in I\}$.
As a reminder—or a teaser—an ideal $I$ in a (commutative) ring $R$ is a nonempty set that is closed under addition, subtraction, and the “multiplicative absorption” property (less poetically called “scalar multiplication” by some): for all $a$ in $I$ and $r \in R$, $ar \in I$. Its partner concept from geometry is that of a variety in an affine (or projective) space, which is a set of points that are simultaneous solutions to a set of polynomial equations. For instance, the set of polynomial equations could be the polynomials belonging to an ideal in a polynomial ring, in which case we call the variety $V(I)$.
Here’s a little example in a more familiar setting. Consider the polynomials $x+y$ and $y^3 – x^2$. They generate an ideal, $J = \langle x+y,y^3-x^2 \rangle$, which includes all linear combinations (with “scalars” from $\mathbb{R}[x,y]$) of the generators. Setting both generators equal to zero, we plot a line and a cusp, and their simultaneous solutions are the points $(1,-1)$ and $(0,0)$ in $\mathbb{R}^2$. That means $V(J) = \{(1,-1),(0,0)\}$.
You can read more about ideals and varieties, their interactions, and the algorithms that make them nice to work with in [2].
So, back to our games. Let’s build the ideal $I$ that represents the game. The clues on the board are easiest to encode because if we know that the cell in row one, column three must be 3, we must ensure that points in $V(I)$ satisfy $a_{13} = 3$. Since we insist (by definition of $V(I)$) that the polynomial $x_{13} – a_{13} = 0$ when we evaluate it at any point in $V(I)$, this means $x_{13} – 3$ is in our set of clues. You can work out the rest of the clues’ polynomials similarly.
The rules of the game are a little more complicated to model, but let’s muddle through. We know that we want each cell to be one of the numbers 1, 2, 3, or 4, which means for each cell, we have a polynomial $(x_{ij} – 1)(x_{ij} – 2)(x_{ij} – 3)(x_{ij} – 4)$. I suppose I could have tried fiddling with my coefficient ring or solution space to work modulo 4, but some of the algebra I want to use requires that we are working over an algebraically closed field. So I’ve chosen to work over the complex numbers (although if you’re playing with the code, you’ll notice I just picked a “big enough” finite field so that Macaulay2 wouldn’t gripe at me!).
We also know that we want each row (respectively, column; doubly respectively, cage) to contain the numbers 1, 2, 3, and 4 exactly once. In the case of the first row, this means solutions satisfy $x_{11} + x_{12} + x_{13} + x_{14} – 10$ and $x_{11}x_{12}x_{13}x_{14} – 24$. Alas, $1 + 1 + 4 + 4 = 10$ satisfies the addition rule if we leave off the multiplication rule, and $2\cdot 2\cdot 2 \cdot 3 = 24$ satisfies the multiplication rule if we leave off the addition rule. If we leave off the 1-through-4 rules for each cell, we get fun complex solutions like $a_{11} = 1+\sqrt{-5}, a_{12} = 1-\sqrt{-5}, a_{13} = 2, a_{14} = 2$; it’s an exercise for the reader to make sure these satisfy the addition and multiplication rules.
This set completely describes our game board: two rules for each row, two for each column, two for each cage, and a rule for each cell. Putting all these polynomials together along with the clues, and then closing them up under addition, subtraction, and multiplication by elements of $R$, we find the ideal $I$ we’re interested in. And the points in $V(I)$ are precisely the solutions to our game.
The connection between our ideal $I$ and our variety $V(I)$ is via one of the best theorems out there: Hilbert’s Nullstellensatz! One form of the Nullstellensatz states that, over an algebraically closed field, there’s a one-to-one correspondence between maximal ideals and varieties consisting of a single point.
Let’s think about this in the two-variable example. The variety $V(J)$ can be expressed a union of simpler varieties, $V(J) = \{(1,-1)\} \cup \{(0,0)\} = V(x-1,y+1) \cup V(x,y)$. In terms of the Nullstellensatz, the point $(1,-1)$ corresponds to the maximal ideal $\langle x-1,y+1\rangle$ and the point $(0,0)$ corresponds to the maximal ideal $\langle x,y \rangle$. We don’t quite have that $J$ is the intersection of these ideals, but it does satisfy $J \subseteq \langle x-1, y+1 \rangle \cap \langle x, y\rangle$. (Why not equality in general? If you think about varieties as sets of points, the points themselves don’t carry information about whether they’re single or double or triple roots of a polynomial. By working with ideals, we can factor polynomials and recover multiplicity information about roots that varieties aren’t fine enough to catch.)
Back to our game! If we’re looking for points that solve our game on the algebraic geometry side, we harness the power of the Nullstellensatz to look for all the maximal ideals that contain our ideal $I$ on the algebra side. And there’s an app for that! By app, I mean a technique called “primary decomposition” that can be done computationally. Emmy Noether played a big role in establishing the generality and usefulness of primary decompositions. She proved that in a Noetherian ring, every ideal can be decomposed as a finite intersection of primary ideals—giving us another kind of factorization. The curious reader will find primary decomposition covered in commutative algebra classics like [3] and [4].
If your Sudoku board doesn’t have a unique solution, calculating the primary decomposition of $I$ will let you find all the solutions—and if the board has incompatible clues, you’ll learn that through primary decomposition, too.
If you’re curious about the number of solutions to the empty board, you (by which I mean your computer or U. Melbourne’s) can calculate the primary decomposition of the game board with no clues to find that there are (spoiler!) 188 solutions to the blank board.
These tools—ideals, varieties, primary decomposition (among others)—form your starter kit for solving lots of real-world problems. In fact, these tools are so powerful that the 2020 Mathematics Subject Classification newly includes “statistics on algebraic and topological structures” covering algebraic statistics (62R01), statistical aspects of big data/data science (62R07), and topological data analysis (62R40), as well as new classes in 14Q for computational aspects of algebraic geometry. Applied algebra is a bustling and booming research area! For a charming introduction to algebraic statistics applied to computational biology, pick up [5].
References:
[1] Arnold, Elizabeth; Lucas, Stephen; Taalman, Laura. Groebner Basis Representations of Sudoku. The College Mathematics Journal, Vol. 41, No. 2, March, 2010.
[2] Cox, David A.; Little, John; O’Shea, Donald. Ideals, Varieties, and Algorithms: an introduction to computational algebraic geometry and commutative algebra. Fourth edition. Undergraduate Texts in Mathematics. Springer, 2015.
[3] Atiyah, M. F.; Macdonald, I. G. Introduction to Commutative Algebra. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont. 1969.
[4] Matsumura, Hideyuki. Commutative Ring Theory. Translated from the Japanese by M. Reid. Second edition. Cambridge Studies in Advanced Mathematics, 8. Cambridge University Press, Cambridge, 1989.
[5] Algebraic Statistics for Computational Biology. Edited by Lior Pachter and Bernd Sturmfels. Cambridge University Press, New York, 2005.
–
Author’s Note: I wrote the first draft of this blog post before the US Supreme Court decision that gutted abortion rights (kicking off a season of decisions and opinions that also upended climate regulation, tribal law, and more). I hope readers found Sara Stoudt’s column last month timely! I also wish all readers of this blog the privilege and comfort of finding solace in mathematics, pure or applied, during turbulent times.
If your eyebrows quirked a bit at seeing “abortion” in an AMS feature column, I urge you to read Karen Saxe’s excellent piece over at the MAA’s Math Values Master Blog, Mathematicians’ Case for Preserving the Right to Abortion. As a mathematician and soon-to-be first-time mom, I have a redoubled commitment to (and new appreciation for) a person’s right to choose when and if to carry a pregnancy to term.
]]>Sara Stoudt
Bucknell University
Can data help support or refute claims of wrongdoing? Take a case about claimed hiring discrimination. What information would you want to know about a company’s hiring practices before you made a decision? Maybe you would want to compare the demographics of the application pool to the demographics of those actually employed by the company to see if there are any discrepancies. If you found any, you might investigate further to see if those discrepancies are too large to have happened just by chance.
Statistical ideas abound in Kimberlé Crenshaw’s 1989 paper discussing antidiscrimination court cases. This paper also first defined the term “intersectionality.” Intersectionality provides a framework to explain that different elements of a person’s identity combine to create privilege or pathways to discrimination. If we try to think about this statistically, it means that for a response variable of “how people treat you” there might be an interaction effect of, for example, race and sex, as well as an additive effect of those characteristics individually. For example, a Black man may be treated differently than a Black woman. Although this term is often used in social discourse, did you know it has its origins in a legal setting?
Kimberlé Crenshaw in 2018. Photo by the Heinrich Böll Foundation, CC-BY SA 2.0.
We will consider the first two cases considered in Crenshaw’s work and then a case brought up by Ajele and McGill in a further study of intersectionality in the law to make some connections to statistical ideas. Full disclosure before moving forward: I do not have any legal training, just an interest in how statistics is used in the courtroom, so this is my interpretation of these court summaries. If you have extra insight to share about the legal process, please reach out!
While this post was in press, the United States Supreme Court made a decision to overturn Roe v. Wade. We acknowledge that reading about court case implications is particularly heavy at this time. We frame some of the statistical questions that these cases bring up as intellectual exercises to emphasize the way that small, seemingly abstract decisions can have huge impacts on millions of people’s lives.
In this case, a group of Black women alleged that General Motors’ system of using seniority as a factor in determining who was laid off during a recession continued the effects of past discrimination against Black women. Importantly, the court would not allow the class of “Black woman” to be protected but required the plaintiffs to argue a sex discrimination case or a race discrimination case, but not both. As Arehart points out, the use of “or” in the Civil Rights Act (protects against discrimination based on race, color, religion, sex, or nationality) has led courts to interpret this as a plaintiff needing to choose one characteristic to focus on in their case.
There are some details that led the plaintiffs in this case to choose to pursue a sex discrimination claim. It was revealed during the case that General Motors did not hire Black women before the Civil Rights Act of 1964, so when everyone hired after 1970 was laid off, that meant that Black women were more likely to have less seniority. However, because white women were hired before 1964, there was a large enough pool of women who were not laid off that the court decided there was not enough evidence to support sex discrimination in this policy.
What does this have to do with statistics? We can formulate this situation into an example of Simpson’s Paradox. When employee outcomes were examined overall, there was no evidence of discrimination between men and women. However, if employee outcomes were to be further broken down by race, there would have been a very clear discrepancy between the Black women and white women.
To look at it visually, there is a very narrow pathway towards remaining at the company for Black women (in purple) while there seems to be a reasonable pathway towards remaining at the company for all women (in red).
What if a plaintiff was allowed to combine identities in a claim? In this case, the plaintiff, a Black woman, alleged discrimination based on race and sex. The court then determined that because the claim was made as a Black woman the plaintiff could not represent all Black workers nor all female workers. This limited the pool of workers that could be used in the statistics supporting the discrimination claim. The plaintiff could not use data for all female workers to make an argument, nor could they use data for all Black workers to make an argument. Instead, they were left with the small number of Black women as their data pool with which they could make an argument.
By limiting the pool of people eligible to be included in an analysis, the power to detect a real discrimination effect decreases. Consider a null hypothesis that the company is not discriminating based on race and sex. The power to reject that hypothesis when it is actually false is related to the sample size of each group, making a small group size a limiting factor. The court’s decision effectively raised the probability of a false negative, i.e., falsely concluding that there was no discrimination when there actually was.
If we consider a simplified framing of this question and determine the difference between the proportion of Black women promoted and the proportion of non-Black women promoted, we can use the pwr R package to investigate the power to detect a difference in proportions with unequal sample sizes. Take this investigation by Seongyong Park. They find that if the proportion of those who were fired in two groups is 0.15 and 0.30 (one group is twice as likely to be fired than the other) and both groups have an equal number of people in them, the power to detect the difference is about 0.86. However, if one group is 10 times as large as the other, the power drops to 0.69. Go ahead and use this code to investigate other situations! What would it take for the power to drop to 0.5?
Is there another way for a plaintiff to combine identities in a claim and still face a statistical challenge? In this case, a Black woman alleged that she was discriminated against due to her race and sex. The court did evaluate both race and sex claims, but it did so separately. The court found no evidence of race discrimination alone nor evidence of sex discrimination alone, but the interaction was not investigated.
What makes this case particularly interesting? I picked this case because of a footnote about the statistician expert witness. From the case overview:
Dr. Jane Harworth, Ph.D., an expert in statistical analysis, examined applicant flow and hiring data for 1975-1983 and performed a binomial distribution analysis. When the raw data involved small pools, she utilized the Fisher’s Exact Test, a more precise version of the Student T Test. Dr. Harworth testified that there is no statistical support for the allegations of the existence of a non-neutral policy or of a pattern or practice of discrimination against blacks or females. Her analysis showed that the actual numbers of black or female hirees were within the range of two standard deviations. She further observed that the success rate of blacks and females exceeded whites and males, respectively.
This is an interesting example of how statistics are explained to the court. Note the translation of Fisher’s Exact Test as “a more precise version of the Student T test” and the decision to focus on plus or minus two standard deviations. However, there is nothing technically preventing a Fisher’s Exact Test from being used to compare Black women to everyone else.
Time for another exercise for the reader! I don’t love the binary distinctions in these court case scenarios, so let’s pick a different set of categories to work with. Consider a population of 100 people who can prefer summer or winter and who can prefer vanilla or chocolate. I’m considering this information when determining who to be friends with. Can you design a situation where it does not look like I discriminate based on season preference nor flavor preference yet it does look like I prefer to befriend a particular combination of of season and flavor preference? Here’s a hint: what if the recession happened a little earlier in the Moore example such that anyone hired after 1964 was laid off? It might be useful to sketch a 2×2 table.
Decisions made about how to measure discrimination involve statistical decisions behind the scenes. In fact, Crenshaw points this out in a footnote:
“A central issue in a disparate impact case is whether the impact proved is statistically significant. A related issue is how the protected group is defined. In many cases a Black female plaintiff would prefer to use statistics which include white women and/or Black men to indicate the policy in question does in fact disparately affect the protected class. If, as in Moore, the plaintiff may use only statistics involving Black women, there may not be enough Black women employees to create a statistically significant sample.”
Thinking about statistical decisions in context as well as the implications of court precedent or practice in terms of statistical concepts can help us both refine our practice of statistics and consider the consequences of our work. Real people are impacted by data-driven decisions; we must recognize and bear the responsibility of that.
References and Resources
William Casselman
University of British Columbia
The game Wordle, which is found currently on the New York Times official Wordle site, can be played by anybody with internet access. It has become extremely popular in recent times, particularly among mathematicians and programmers. One programmer comments facetiously, “At the current rate, I estimate that by the end of 2022 ninety-nine percent of all new software releases will be Wordle clones.”
The point for us is that it raises intriguing questions about the nature of information, and offers good motivation for understanding the corresponding mathematical theory introduced by Claude Shannon in the 1940s.
When you visit the official Wordle web site, you will be faced with an empty grid that looks like this:
Every day a secret new word of five letters is chosen by the site, and you are supposed to guess what it is by typing candidates into successive rows, responding to hints offered by the site regarding them. For example, today (May 22, 2022) I began with the word slate, and the site colored it like this:
What this means is that the secret word of the day has no letters “S”, “L”, “A”, or “T”, but does have a “E” in some location other than the last. I next entered the word homer and got in response:
This means that the secret word does not contain at all either an “H” or an “R”; contains an “O” in the second place and an “E” in the fourth place; and contains an “M” somewhere other than the third place. With my next choice I was lucky:
This is a pretty good session—the average game is predicted to require about three and a half tries. So three is better than average, and most of the time you should get the answer in at most four.
I must now tell you the precise rules for coloring proposed answers. If neither the secret word nor your proposal has repeated letters, the rules are very simple: (1) a square is colored green if your letter and that of the secret word are the same at that location; (2) a square is colored yellow if the secret word does contain the corresponding letter somewhere, but not at this particular location; (3) the square is colored gray if the corresponding letter in your guess does not appear at all in the secret word.
But if there are repeated letters, there is some potential ambiguity that has to be dealt with. First of all, all exact coincidences are colored green, and as this is done they are removed from your guess and, internally, from the secret word. Under consideration there are now two words of length possibly less than five, for which there are no exact coincidences. The remaining guess is now scanned left to right. A location is colored yellow if it occurs somewhere in the secret word, and it is removed from the secret word. The location is colored grey if it now occurs nowhere in the secret word.
For example, if the secret word is decoy and your guess is odder, since there is only one “D” in decoy it will be colored
Scanning left to right is an arbitrary choice, and scanning in the opposite direction would give a different coloring. One consequence of this rule and some other elementary reasoning is that some colorings, such as
can never occur.
Incidentally, the NYT Wordle site contains a convenient keyboard displayed underneath the grid, which illustrates graphically how the coloring is to be interpreted.
There are a few more somewhat arbitrary things you should know about. The principal one is that the Wordle grid will accept only a proper subset of English words of five letters, and many of the ones it does accept will probably be unfamiliar to you, such as aahed, aalii, and aarti, as well as zygal, zygon, and zymic. If you submit a word that will not be accepted, the Wordle display will complain by shuddering. The current list of words that will not cause a shudder has 12,957 words on it, and is itself divided into two subsets—the list of the approximately 2,300 words which are in fact possible answers (i.e., made up of words which should be familiar) and the rest (compiled from a huge list of all possible words in English text), which can help you to find answers even though they are not themselves among the possible answers. (You might see on the Internet mentions of 12,972 acceptable submissions. This was the original number. The game changed a bit when the NYT took it over, eliminating a few of the accepted words that might offend some people.)
The Internet is full of advice on how best to play, but I’ll say very little about that. What is interesting to a mathematician is that many of the proposed strategies use the notions introduced by Claude Shannon to solve problems of communication and elucidate the notion of redundancy.
In fact, redundancy in English is what Wordle is all about. What do I mean by redundancy? Most succinctly, that not all sequences of five letters of the alphabet are English words. (One constraint is that the ones that do occur must be pronounceable.) In other words, certain sequences of letters cannot appear. For example, in standard English (although not in Wordle’s list of acceptable words, alas) a “q” is always followed by a “u”, so the pairs “qa”, “qb”, … are forbidden. The “u” following a “q” is therefore redundant. Another way of saying this is that “u” after “q” conveys no new information—it will not help to distinguish one word from another.
For an example more relevant to Wordle, suppose you find yourself facing the array
Just as “u” always comes after “q”, here the last letter must be either “d”, “e”, or “p”, making possible answers shard, share, or sharp. So here the final letter carries information—it picks out one of three possibilities—but not a lot, since the possible options are very limited.
These two examples illustrate the general fact: information is the resolution of uncertainty. The more uncertainty there is, the more information is conveyed by choosing one of the options. The mathematical theory of information initiated by Shannon makes this quantitative.
To understand the relation between information and mathematics, I’ll look now at a simpler guessing game, a variant of ‘twenty questions’. In this game, somebody chooses a random number $n$ such that $0 \le n \lt 2^{3} = 8$ (i.e., in the non-inclusive range $[0, 8)$), and asks you to guess what it is. Any question you ask will be answered with only a ‘yes’ or ‘no’. How many questions might you have to ask?
The simplest strategy is to start with “Is it 0?” and continue with possibly 7 more. But this is certainly unnecessarily inefficient. The best way to phrase the optimal procedure is to express numbers in the range $[0, 8)$ in base $2$ notation. Thus $5$ is expressed as $111$, since $$ 5 = 1 + 0 \cdot 2 + 1 \cdot 4 . $$ The coefficients in such an expression are called bits. With this mind, you have only to ask three questions: is the $i$-th bit equal to $0$? for $i = 0$, $1$, and $2$. Whether the answer is ‘yes’ or ‘no’, you gain one bit of information.
The drawback to this procedure is that you will never get by with fewer than three questions, whereas with a naive strategy you might be lucky and get it in one. Sure, but you are more likely to be unlucky! In the naive scheme the average number of guesses is $(1 + 2 + \cdots + 7 + 8)/8 = 4.5$, whereas in the other it is $3$. When the number of choices is $2^{3}$ the difference in the average number of questions asked is small, but if $2^{3}$ is replaced by $2^{20}$ the difference is huge.
Very generally, a game with $2^{n}$ possible items may be played by asking only $n$ questions with ‘yes’ or ‘no’ answers and receiving in return $n$ bits of information. Another way to put this: a string of zeroes and ones of length $n$, chosen randomly, conveys $n$ bits of information. As a simple variant of this, suppose each item of the string is a number in the range $[0, 2^{k})$. Then a string of such numbers of length $n$ is equivalent to one of $nk$ bits, and hence conveys $nk$ bits of information.
But now Shannon posed the question: suppose the bits are not chosen randomly? Then less information will be conveyed, in general. Can we measure how much?
For example, suppose the string is of length $2n$, and may therefore be considered a string of $n$ numbers in the range $[0, 4)$. Suppose further that each $k$ is constrained to the range $[0, 3)$ (i. e. with $0 \le k \le 2$), so that in effect the string is the expression of a number in base $3$. I’ll call is a $3$-string. Instead of $4^{n}$ possible strings, there are only $3^{n}$, so that fewer than $2n$ bits of information can be conveyed. Assuming that the individual ‘digits’ are chosen randomly, how much information is conveyed by such a $3$-string?
The most fruitful answer tells what happens as $n$ becomes larger and larger. Large strings of $n$ integers in the range $[0, 3)$ can be compressed, and more efficiently compressed for large $n$. A single integer $0 \le k \le 2$ requires two bits to be expressed, one in the range $[0, 9)$ requires $4$ bits, or twice as many. But $3^{3} = 27 \lt 2^{5} = 32$, so a string of $3$ requires only $5$ bits instead of $6$. In general, if $$ 2^{m-1} \lt 3^{n} \lt 2^{m} $$ then more than $m-1$ bits are required to specify every $3$-string of length $n$, but fewer than $m$. We can find a formula for $n$, in fact, since extracting $n$-th roots gives us $$ 2^{m/n – 1/n} \lt 3 \lt 2^{m/n} . $$ Since we can write $3 = 2^{\log_{2} 3}$, this is equivalent to $$ { m\over n } – { 1 \over n } \lt \log_{2} 3 \lt { m \over n } . $$ so that for large $n$ we see that $m$ is approximately $n \log_{2} 3$. What Shannon says is that a random $3$-string of length $n$ carries $n \log_{2} 3$ bits of information. Since there are $3^{n}$ such strings, if these are chosen randomly each one has probability $p = 1/3^{n}$ of occurring. Shannon’s way of putting this becomes in general the recipe:
Note that since $p \le 1$, we know that $\log_{2} \frac{1}{p} \ge 0$, so this is always non-negative. When $p = 1$ the event always takes place. There are no surprises. Sure enough, $\log_{2} 1 = 0$, so Shannon’s rule says that no information is conveyed. The rarer an event, the more surprising it is, and the more information it conveys: “Dog bites man” is nothing new, but “Man bites dog” is a different story.
Let’s look at one simple example. Suppose we are again playing ‘three questions’. You feel lucky and can’t resist blurting out, “Is the number 6?” If the answer is ‘yes’, you have acquired, as we have already seen, three bits of information. But how much information does a ‘no’ gives you? All we can say immediately is that it isn’t much, because it has only reduced the number of possibilities fom $8$ to $7$. Now if, we assume, the answers are chosen at random, the probability of getting a ‘yes’ here is $p = 1/8$, so the probability of getting a ‘no’ is $1 – p= 7/8$. Shannon assigns to it $\log_{2} 8/7 \sim 0.193$ bits of information.
For us, a random event is one with a finite number of outcomes. Suppose the $i$-th event to have probability $p_{i}$. If the event takes place a large number of times, what is the average amount of information seen? The $i$-th event has probability $p_{i}$, and the associated amount of information is $\log_{2} p_{i}$, so the expected average is
$$ \sum_{i} p_{i} \log_{2} p_{i} . $$
Even the case $p_{i} =0$ is allowed, since
$$ \lim_{p\rightarrow 0} p \cdot \log_{2} p = 0 . $$
This average is what Shannon calls the entropy of the event, measured in bits. If the event has two outcomes, with probabilities $p$ and $1-p$, the entropy is
$$ (1-p)\log_{2} (1-p) + p \log_{2} p . $$
Its graph looks like this:
If $p=0$ or $p=1$ there is no probability involved, and no information. The maximum possible entropy occurs when $p=1-p = 1/2$. This is when the maximum uncertainty is present, and in general entropy is a measure of overall uncertainty.
This last remark remains true when any number of outcomes are involved:
That is to say, whenever $(p_{i})$ is any sequence of $n$ numbers $p_{i} \ge 0$ with $\sum p_{i} = 1$ then
$$ \sum_{i} p_{i} \log_{2} p_{i} \le \log_{2} n . $$
This is immediately evident when $n=2$, since the graph of $y = \log x$ is concave downwards.
In general it can then be derived by mathematical induction on $n$.
What does this have to do with Wordle?
When you enter a candidate word into the Wordle display, the game replies by coloring your entry with one of three colors—it is giving you information about your proposal. But Wordle differs from 20 questions in that you can use this information to make your next guess. How much information is the game giving you? How can you best use this information to make your next submission?
The secret word is effectively a choice of one of the $2,309$ valid answers. Each of these presumably has equal probability. But the coloring reduces this number considerably—the true answer has to be one of those compatible with the coloring. For example, a few days ago I chose slate as my initial guess, and got a coloring
I can scan through all possible answers and check which ones would give me this coloring. It happens that my choice was very bad, since there were $164$ words that do this. We can see exactly how bad this is by making a graph like the following:
This graph was constructed in the following way: I scanned through all of the 2,309 possible answers and computed the coloring it would give. I used this to list for each colour all of the compatible answers. I made a list of the sizes for each color, and then sorted the list by magnitude. For example, the largest count was 221, corresponding to all grays. But the second highest was the one I got, at 164 (marked on the graph). As you can see, there was a very large reservoir of things I might have hoped for.
Could I have made a better choice of first guess? Well, what should be clear from what I write just above, each possible first guess gives me a graph like the one above. What I would like to see is a graph with a somewhat uniform height, for which the likelihood of narrowing my choices down is large. I display below a few graphs for other first guesses.
It turns out that it is impossible to get a uniform height, but that some choices do much better than others. The point is that the uniformity is maximized if the entropy of a certain probability is maximized. Every choice of a starting word assigns a colour to every possible answer. These colors partition the set of possible answers, and if $n_{i}$ answers gives rise to the same color $i$ then $\sum n{i} = 2,309 = n$. I set $p_{i} = n_{i}/n$. It is the entropy of this probability distribution that is displayed in the graphs above, and you can see that a choice is better if the entropy is relatively large.
The idea of using entropy to find optimal ways of playing Wordle have proliferated on the Internet. Many have used entropy to make a best first guess—among these are crane, which is the one used by the official NYT advisors, and slate. Very few of these add something new, and most seem to be just taking their ideas from the video of Grant Sanderson that I mention below. (This seems to be the original investigation of entropy.)
I don’t want to add to this literature, but I do want to discuss the question of best second choices, about which less is said. It is a relatively simple calculation to list all possible answers that will colour a first guess in a given way. For example, as I have already mentioned my choice of slate above came up with 164 possibilities. This is a severe reduction of the original 2,409. But one of the quirks of Wordle is that choosing your second guess from this list might not be best. For example, if you get the coloring
you know that the secret word must be sharp, shard, or share. The obvious thing to then do is try these, one by one. However, that might take three tries, while a careful choice of some totally different word—for example pedal—will give it to you in two tries by eliminating two of the three possibilities.
Absolutely optimal strategies for Wordle are now known and posted on the Internet. But these miss the real point—I’d like to see more theoretical discussion of exactly what Wordle’s colorings tell you.
I have ‘borrowed’ several ideas from this well known and impressive YouTube video.
Ursula Whitcher
Mathematical Reviews (AMS)
Mathematicians and physicists both love symmetry, but depending on who you’re talking to, the implications of a simple statement such as “This theory admits a symmetry” can be very different. In this column, I’ll describe one attempt at interdisciplinary translation using an entire rainbow of colors. Our subject is Adinkras. Here is an example:
Using Adinkras, we can move from physics questions to algebra questions to combinatorics questions. I’ll sketch ways to think about Adinkras from all three points of view, on the way to providing a formal definition. But to start with, let me introduce you to the inventors of mathematical Adinkras—and the symbols’ namesake.
The physicists Michael Faux and Sylvester James Gates, Jr.—Jim Gates, for short—first described Adinkras in a 2004 paper.
In their introduction, Faux and Gates wrote:
The use of symbols to connote ideas which defy simple verbalization is perhaps one
of the oldest of human traditions. The Asante people of West Africa have long been
accustomed to using simple yet elegant motifs known as Adinkra symbols, to serve
just this purpose. With a nod to this tradition, we christen our graphical symbols as
“Adinkras.”
(Some traditions say the name “Adinkra” derives from the name of an early nineteenth-century ruler of Gyaman, a West African kingdom founded by the Bono people.)
As an example, here are two Adinkra symbols used in early twentieth-century Ghana: Nkyimkyim, the twisted pattern, and Aya, the fern, which can also mean “I am not afraid of you.”
Faux and Gates were motivated by the desire to understand a physical concept known as supersymmetry. This concept arose as a theoretical attempt to organize the vast quantities of information involved in particle physics.
In the Standard Model of particle physics, the fundamental components of the universe are the following types of particles:
Some of these particles make up familiar types of energy and matter. For example, up and down quarks combine to make protons and neutrons, and thus atomic nuclei. Photons are the particles that make up beams of light. Other components of the Standard Model are harder to detect: the IceCube detector at the South Pole hunts for faint flashes of light due to rare neutrino interactions, and confirming the Higgs boson’s existence entailed detailed analysis of particle decay within the particle accelerator known as the Large Hadron Collider.
The fundamental particles in the Standard Model can be divided into two types, bosons and fermions. Bosons transmit fundamental forces—for example, photons carry electromagnetic energy—while we’ve already noted that fermions can combine to make up matter. Other distinctions are more technical. Every boson has an intrinsic integer amount of angular momentum—spin—when measured in units of Planck’s constant $\hbar$, while bosons have half-integer spin ($\frac{1}{2}$, $\frac{3}{2}$, etc.) Furthermore, bosons can cluster: identical bosons can share the same quantum state. In contrast, identical fermions must occupy distinct quantum states, a principle sometimes known as the Pauli exclusion principle.
Supersymmetry is a theoretical physical symmetry that exchanges bosons and fermions. In particular, in a supersymmetric theory, every type of boson must have a fermion partner, and vice versa.
Despite years of searching, experimental physicists have not found evidence of superpartners for any of the particles in the Standard Model. That doesn’t mean understanding supersymmetry isn’t useful! After all, a real-world tennis ball’s path will deviate from a perfect parabola due to air resistance, but parabolas are not meaningless. Indeed, the equations underlying quantum supersymmetry have proved useful in understanding optical and condensed-matter systems at larger scales.
Studying supersymmetry in physically realistic situations requires a tremendous amount of physical and mathematical sophistication. We’re going to simplify as much as possible: all the way down to zero spatial dimensions! In other words, from now on we’ll assume that all of our bosons and fermions live on a single point.
In physics at the scales where human beings usually operate, once you’ve restricted to a single point, nothing much can happen. But in quantum physics, particles can have intrinsic properties without obviously moving. For example, an electron can act like a single point with angular momentum, but a single point doesn’t have anywhere to spin. We’ll represent the intrinsic information in our model mathematically by writing our bosons and fermions as functions of a time parameter $t$. In physicist’s language, this is a one-dimensional theory: that one dimension is time.
Let’s suppose we have $m$ fundamental bosons, $\{\phi_1, \dots, \phi_m\}$, and $m$ fundamental fermions, $\{\psi_1, \dots, \psi_m\}$. We also assume that we have $N$ supersymmetry operators $\{Q_1, \dots, Q_N\}$. Each supersymmetry operator transforms bosons to fermions, and vice versa. We write this in a notation similar to function notation. For example, $Q_1 \phi_2$ should be a fermion (perhaps one of our fundamental fermions, or perhaps a transformation of a fundamental fermion, such as $-\psi_3$).
In quantum physics, the order in which you do things matters: for example, the outcome of two measurements may depend on which one you try first. That means that $Q_I Q_J$ and $Q_J Q_I$ could be different. We can measure this using the anticommutator, which we write with curly brackets:
\[\{A, B\} = AB + BA.\]
With this notation in hand, we’re ready to describe the rules that the supersymmetry operators have to follow. In order to do so, we’ll have to use a little bit of calculus notation, namely, the time derivative $\frac{d}{dt}$. If this notation is unfamiliar to you, don’t worry. You can think of $\frac{d}{dt}$ as telling you something about the way a boson or fermion changes over a small period of time. But we’ll replace all our derivatives by diagrams when we start building Adinkras!
Here are the two rules for supersymmetry operators.
We’re interested in understanding how many genuinely different solutions to the supersymmetry operator rules we can find. Even for small examples, there may be multiple possibilities. Each one is called a representation of the supersymmetry algebra. For example, suppose we have one fundamental boson $\phi_1$, one fundamental fermion $\psi_1$, and one supersymmetry operator $Q_1$. We need to specify $Q_1 \phi_1$ and $Q_2 \psi_2$. One possibility is:
$$ Q_1 \phi_1 = \psi_1 $$
$$ Q_1 \psi_1 = i \frac{d}{dt} \phi_1.$$
Another possibility is:
$$ Q_1 \phi_1 = \frac{d}{dt} \psi_1 $$
$$ Q_1 \psi_1 = i \phi_1.$$
You can check that in either case, $Q_1 Q_1 \phi_1 = i \frac{d}{dt} \phi_1$ and $Q_1 Q_1 \psi_1 = i \frac{d}{dt} \psi_1$.
To organize our possible supersymmetry operator representations, we use graphs. These are graphs in the sense of graph theory, with vertices (dots) connected by edges (line segments). We use $m$ open vertices and $m$ closed vertices. Each open vertex represents a fundamental boson $\phi_j$ together with its derivatives and constant multiples; each closed vertex represents a fundamental fermion together with its derivatives and multiples. We pick $N$ possible edge colors, one for each of the $N$ supersymmetry operators. We think of acting on a boson or fermion by a supersymmetry operator as traveling along the edge of the appropriate color. Since we can act on any boson or fermion by any of the operators, we need an edge of each possible color attached to every boson vertex and every fermion vertex. Here is an example of the resulting structure.
This example has a property we haven’t discussed yet, corresponding to the supersymmetry algebra rule $Q_I Q_J = -Q_J Q_I$. Applying two different supersymmetry operators in different orders yields the same result, up to a sign change. That means that if we pick a vertex and a pair of edge colors, it shouldn’t matter which edge we follow first: for example, blue then green should have the same result as green then blue.
We can summarize the requirements we’ve described so far using the notion of a chromotopology. A chromotopology is a finite simple graph (that is, it has a finite number of vertices and edges, and none of the edges starts and ends in the same place) with the following properties:
The quadrilateral property is a different way of describing the requirement that following a pair of edge colors in either order has the same result: instead of considering two possible paths away from the same vertex, we notice that following one path away from a vertex and the other path back again brings us back to the start (so blue-green-blue-green will return us to the vertex where we started).
To make an Adinkra, we start with a chromotopology and add two additional structures. The first is a height assignment. To make a height assignment, we place our bosons and fermions on different levels, and require that every time we follow an edge, we go up or down one level. For example, when $N=1$, there are two possible height assignments, one with the boson on the lowest level and one with the fermion on the lowest level. You might notice that the example Adinkra and chromotopology we’ve already seen are drawn with consistent height assignments.
In the conversion between Adinkras and supersymmetry representations, the height assignment tells us when to take derivatives. We use the convention that going up a level does not use a time derivative, but going back down does. Thus, in the one-dimensional example, our choice of height specifies whether we take a derivative when going from fermion to boson or from boson to fermion.
Our final structure is an odd dashing. We know that in a chromotopology, picking a vertex and a pair of colors gives us a 4-cycle of edges that will bring us back where we started. To make an odd dashing, for each of these 4-cycles, we dash an odd number of edges—either just one, or three of the four. For example, here is a possible dashing of our example $N=3$ chromotopology—making it into an Adinkra.
Here is a different odd dashing of the same chromotopology—and thus a different Adinkra!
The dashings tell us where to place minus signs in our supersymmetry representations. Because there are an odd number of dashed edges in each two-color 4-cycle, when we flip the order in which we follow a pair of colors, we will also change the number of dashed edges we encounter. This guarantees that the minus sign condition in the supersymmetry algebra rule $Q_I Q_J = -Q_J Q_I$ will be satisfied.
Adinkras help us generate a multitude of supersymmetry algebra representations. They also give us new ways to visualize relationships between distinct representations: for example, we can classify supersymmetry algebra representations that correspond to distinct dashings of the same chromotopology and height assignment, or move from representation to representation by raising and lowering vertices to different levels. The result is a powerful tool for exploration and discovery, in both physics and mathematics.
Joe Malkevitch
York College (CUNY)
When looking at a body of mathematical ideas, one might look for the “atoms” or parts so that one could see the whole by having insight into its parts. If in some future state of the Earth there were no automobiles and some humans came across a well preserved car from the 1970’s, but with no prior knowledge of what such a thing was, how might they interpret what they were looking at? An archaeologist at that time might try to understand its parts as a way to think through what the whole thing was good for. Perhaps these people might decide it was a small moveable house?
In an earlier Feature Column essay I looked at how by studying primes 2, 3, 5, 7, … we get insight into big integers such as 1111113. There I also looked at partitions of positive integers—for example, $5 = 4 + 1$ and $5 = 3 + 1 + 1$ are but two of the partitions of 5. Words connoting or related to decomposition in English are: decompositions, dissections, factoring, irreducible, etc. It is not uncommon in mathematics to use words as technical vocabulary that suggest ideas that a word has in more ordinary usage, that is non-mathematical contexts. For example, consider the word "irreducible". This suggests something that cannot be broken up into parts.
Before addressing the issue of geometric compositions in earnest, as a teaser remember that one of the most important and well known theorem in mathematics is the Pythagorean Theorem, though attributing it to Pythagorus or even the Pythagoreans distorts the history of this remarkable result, which can be viewed as a result in algebra or a result in geometry. The Pythagorean Theorem states that if one has a right triangle (one where two sides meet at a 90 degree angle—that is, are perpendicular), the square (in the algebraic) sense of the lengths of the side opposite the right angle is the sum of the squares of the lengths of the other two sides. But this theorem about lengths can also be interpreted as a statement about the areas of the geometric squares that can be constructed on the sides of a right triangle. Here are diagrams (Figures 1 and 2) that support one of the many proofs of the Pythagorean Theorem that involved moving around pieces of squares and assembling them to form other squares.
Proofs of theorems using "physical models" such as the diagrams in Figures 1 and 2 are quite compelling because of the amazing ability of humans to input and process visual information. The eye responds to issues related to the length of segments and the area of regions, even if sometimes the fact that area scales as the square of length rather than length itself sometimes causes individuals to make misleading judgments about diagrams.
One might try to understand geometric objects in terms of the parts that make them up. These parts might be described as: points, lines, membrane patterns, corners, curves, etc. Sometimes these parts can be viewed with terms that overlap. In describing a shirt one might use terms like sleeves and buttons and in describing a car one might mention windows and wheels. Here we will give special consideration to the notion of a polygon.
After the point and the line, among the most fundamental of geometrical objects is the triangle. A triangle is a collection of three points not on a line and the segments joining pairs of the points which are known as the vertices of the triangle. When we classify polygons that are drawn on a flat piece of paper in the plane, we can do so by counting the number of corners of the polygon or by counting the number of sides (edges) of the polygon. We can think of a polygon as a collections of points joined by sticks with no membrane filling in the result or we can include the interior of the polygon along with the "sticks." There are pros and cons for defining shapes in particular ways. Here I just want to point out that we can classify polygons drawn in the plane, not only by their number of corners, but by whether or not the polygon intersects itself or has notches. We say a set $X$ is convex if given any two elements $u$ and $v$ of $X$ the line segment joining $u$ and $v$ is contained in (is a subset of) $X$.
Figure 3 provides an illustration of this fundamental concept of modern geometry. Intuitively, a convex set is one that does not have notches or holes. In particular polygons drawn in the plane are usually defined (as are circles) to be dots connected by straight line segments—"sticks"—without the points in the interior. Triangles with their interiors are convex sets but as soon as we have a polygon with more than 3 vertices we can have non-convex polygons, polygons whose vertices do not lie in a plane, or polygons whose sides self-intersect. In some situations polygons are allowed to have several consecutive vertices lying along a straight line, but often it is required that pairs of consecutive vertices not lie along the same line. This polygon has 6 vertices (corners) and 6 sides, and is thus often described as a hexagon, in this case, a non-convex hexagon.
Historically, attention has been given to the length of sides and the measure of the interior (and sometimes exterior) angles of polygons. When the angles of a convex polygon are equal and its sides have the same length, it is called regular. However, one can consider polygons where the sides are all equal, that is the polygon is equilateral, or the angles are all equal, in which case it is equiangular. The polygon in Figure 4 is an equilateral hexagon. Its angles are not all equal, but there are three different sizes of angle, equal in pairs. If one adopts a partition-style way of classifying this polygon it is an example of a non-convex $\{6\}$; $\{2, 2, 2\}$ since there are six sides of equal length, and three types of angles equal in pairs.
Figure 4 (A non-convex hexagon where all of the sides have equal length.)
Figure 5 shows a small sampler of polygons, one convex and others non-convex. In one case, all consecutive pairs of sides of the polygon meet at right angles (a rectilinear or orthogonal polygon).
Figure 5 (A sampler of different kinds of polygons, convex and non-convex.)
The simplest polygon is one that has only three vertices, a triangle. In the spirit of decomposition, it is natural to ask if every plane simple polygon can be decomposed using existing vertices into triangles. While this may seem intuitively obvious, it is actually not that easy to prove this fact, though it is true. It may seem intuitively clear that the 3-dimensional analogues of polyhedra, including ones that are non-convex but have the topology of a sphere can be decomposed into tetrahedra (e.g. the "atoms" of 3-dimensional convex polyhedra, as it were), but this is in fact not true!
Given a polygon with vertices drawn in the plane, it is always possible to subdivide that polygon into triangles using existing vertices. However, for some decomposition problems it is of interest to add additional vertices to the sides of the polygon as part of the decomposition effort, and sometimes one might also allow having vertices in the interior of the polygon. Thus in Figure 12 you can see how a square can be subdivided into triangles by adding some additional vertices along the side of the square and also how to do the subdivision using an additional vertex. Figure 6 shows a simple (no self-intersections) non-convex polygon with 11 vertices (11 sides). Using 8 segments joining existing vertices this 11-gon can be subdivided into triangles using 8 diagonals and giving rise to 9 triangles. In fact there are many other such triangles starting with this same polygon but all of them will involve using 8 diagonals and give rise to 9 triangles. In general any simple $n$-gon ($n$ at least 4) can be triangulated using
$(n-3)$ diagonals into $(n-2)$ triangles. (One can see this using Euler’s Polyhedral Formula for a connected graph: $V + F – E = 2$ for a connected graph drawn in the plane.)
Figure 6 (A simple non-convex polygon converted using diagonals to a polygon subdivided into triangles.)
Figure 7 and Figure 8 show (Figure 7) a convex 9-gon subdivided by 6 diagonals into 7 triangles and (Figure 8) a non-convex 9-gon subdivided by 6 diagonals into 7 triangles. Unlike the triangulation in Figure 6, each of these triangulations includes one (or more) triangles which share edges with three other triangles, something which does not occur for Figure 6.
Figure 7 (A convex polygon subdivided into triangles.)
Figure 8 (A non-convex polygon partitioned into triangles using the diagonal edges shown in red.)
In Figure 8 I have called attention to the edges that subdivide the original polygon into triangles by coloring the subdividing diagonals red. As a problem in graph theory (the theory of diagrams involving dots and the lines that join them), you may want to think about the question:
Given a collection of edges, when can they serve as the diagonals for a plane $n$-gon that turns the $n$-gon into a triangulated polygon?
A remarkable theorem involving decompositions is that if one has two plane simple polygons of the same area, it is possible to decompose either of the polygons into polygonal pieces that can be reassembled to form the other polygon. This result is known as the Wallace-Bolyai-Gerwien Theorem. By way of illustration, Figure 9 shows a way to decompose a square and equilateral triangle of equal area into polygonal parts that can be used to form the other shape. The decomposition shown uses the minimal number of pieces. A lot of research has been done on equidecomposibility with a minimal number of pieces and where one uses particular shapes for the pieces in the decomposition.
It is natural to ask if, in 3 dimensions, two polyhedra with the same volume can have one be decomposed (cut) into pieces and reassembled to form the other. This problem was solved by Max Dehn (1876-1952) and was one of a set of famous problems designed by David Hilbert (1862-1943), whose solution he thought would create progress in a variety of mathematical areas. Figure 10 shows two 3-dimensional polyhedra, one decomposed into the other. Dehn provided tools for telling when this could be done. However, it may surprise you to learn that the analogue of what we see in Figure 9 can’t be achieved, that is, a regular 3-cube (see left of Figure 10) can’t be cut into polyhedral pieces and reassembled to a regular tetrahedron of the same volume.
A natural question about decomposition of a polygon is whether or not the polygon can be decomposed into convex pieces, and in particular triangles, which will automatically be convex, where the pieces can be made to have equal area.
Each of the squares in Figure 11 can be thought of as squares of side length 2 and both have been divided into 4 congruent triangles, and thus, for the one on the left the 4 triangles have the same area and the same is true on the right. It is interesting that in each case the triangles are special because they are right triangles and thus satisfy the Pythagorean theorem. You can verify for yourself that on the left, the right triangles are isosceles right triangles with sides $\sqrt{2}$, $\sqrt{2}$, and 2 while on the right the triangles are scalene (all three sides of different lengths) and these lengths are 1, 2, $\sqrt{5}$. Also note that although the squares above are meant to be congruent, both of side length 2, it may not appear that the smaller triangles on each side have equal area but you can verify using the Pythagorean Theorem that the small triangles on the left have the same area as the smaller triangles on the right.
Figure 11 (Two different ways to subdivide congruent initial squares into 4 congruent triangles.)
The decompositions of squares into equal area triangular parts in Figure 12 can be extended to decomposing a square in various ways into an even number of parts. The decompositions shown in Figure 11 have the property that when two triangles touch each other they touch along a complete edge of another triangle. Are other kinds of decompositions of a square into an even number of triangles possible? Figure 12 shows an example of how a square can be decomposed into 6 parts, and not all of the triangles are right triangles, where some of the triangles meet other triangles along a part of an edge rather than a full edge, and yet the 6 parts can be shown to have equal area. Triangles which touch in this way are said not to meet edge-to-edge. In recent years interest in tilings of the plane as well as polygons into pieces with special properties allows for tiles that don’t meet edge-to-edge.
Figure 12 (A decomposition of a square into 6 triangles of equal area, where not all of the triangles meet edge-to-edge. Image courtesy of Wikipedia.)
What almost certainly has already occurred to you regarding this discussion is decomposing a square into an odd number of triangles with the same area. If you try to find such a dissection you will find that you are not able to do this! So you might try to prove that it cannot occur. However, if you are like many people you will not find it so easy to do this. Recently, the following theorem has been associated with the name Paul Monsky:
Theorem (1970): It is not possible to decompose a square into an odd number of triangles of equal area.
Like many easy-to-state questions that are not so easy to demonstrate there is a lot of history in how Monsky came to show his result. It might seem that the history would go back into ancient times but in fact the problem seems to have been born quite recently. The origin of the problem appears to have occurred in 1965 with Fred Richmond and other names associated with the problem are John Thomas and Sherman Stein. While many saw the problem with the publication of Monsky’s proof in 1970, it was work of Sherman Stein that magnified interest in the problem under the title of what have come to be called equidissection problems. What Stein basically did was to ask what other shapes in the plane were such that they could not be dissected into triangles of equal area. You might want to think about this issue for yourself and perhaps you can come up with some new variants that other people have not thought of. After all why restrict oneself to squares!
The fact that a square can’t be decomposed into an odd number of squares of the same area does not mean that one can’t think about dividing squares into an odd number of triangles such as those in Figure 13. We know that the
triangles in such a decomposition cannot have equal area, but researchers have investigated the issue of how close to being equal in area they can be made in terms of the number of triangles in the decomposition.
Figure 13 (A square decomposed into an odd number of triangles. Such a decomposition with all the triangles having equal area is not possible.)
Much more recent than the ideas that lead to Monsky’s Theorem has been the idea of investigating when one can take a convex polygonal region in the plane and decompose it into $N$ convex pieces which have the same area and perimeter! A published version of this challenge appeared in a paper posted to the ArXiv in 2008 by R. Nandakumar and N. Ramana Rao which has subsequently, in the spirit of Monsky’s Theorem, attracted additional interest in this question and its generalizations.
This problem, while very easy to state, has inspired a huge amount of new geometrical facts as well as many new questions. It is very common in the process of making progress on one mathematical problem that it opens up new questions, the need to invent new tools and sometimes whole new clusters of mathematical questions.
We have looked at how surprisingly rich and complex the environment of decomposing a polygon into triangles can be and, in particular, the decomposition of a square into triangles of equal area. What about decomposing a square into squares subject to various rules? Clearly one can take a square and decompose it into various numbers of other squares of the same area, and the number of squares in the decomposition can be even or odd. However, much earlier than the interest in what has come to be called Monsky’s Theorem, a group of British students, all of whom went on to distinction in various ways, R.L. Brooks, Cedric Smith, Arthur Stone and William Tutte, while students at Cambridge University in the 1930s looked at a problem which has lead to much important and interesting work in various parts of mathematics and is, again, very much a decomposition theorem. The idea is to take a square (or once generalized as a question, rectangle) and divide it into squares with the initially curious restriction that all of the squares in the decomposition have different side lengths. This problem has come to be known as the perfect square problem. Figure 14 shows an interesting example that fails to achieve this goal but is nonetheless striking for what it does accomplish.
Figure 14 (A square of relatively small side length subdivided into smaller squares some of which have the same edge length. Image courtesy of Wikipedia.)
Figure 15 shows an example of a square with the property that all the subdividing squares have different edge lengths.
There are many "windows" (some not square) that serve as entries into mathematical insights and investigations. Looking at parts or decompositions of shapes as well as numbers leads to lots of fascinating mathematics and its applications.
References
Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society’s MathSciNet can be used to get additional bibliographic information and reviews of some of these materials. Some of the items above can be found via the ACM Digital Library, which also provides bibliographic services.
Abrams, Aaron, and Jamie Pommersheim. "Generalized dissections and Monsky’s theorem." Discrete & Computational Geometry (2022): 1-37.
Alsina, Claudi, and Roger B. Nelsen. Math made visual: creating images for understanding mathematics. Vol. 28. American Mathematical Soc., 2006.
Akopyan, Arseniy, Sergey Avvakumov, and Roman Karasev. "Convex fair partitions into an arbitrary number of pieces." arXiv preprint arXiv:1804.03057 (2018).
Alsina, Claudi, and Roger B. Nelsen. A Cornucopia of quadrilaterals. Vol. 55. American Mathematical Soc., 2020.
Frederickson, G. Dissections: Plane and Fancy. New York: Cambridge University Press, pp. 28-29, 1997.
Hoehn, Larry. A New Proof of the Pythagorean Theorem: Mathematics Teacher. February 1995; NCTM: Reston, VA.
Jepsen, Charles, and Roc Yang. "Making Squares from Pythagorean Triangles." The College Mathematics Journal 29.4 (1998): 284-288.
Karasev, Roman, Alfredo Hubard, and Boris Aronov. "Convex equipartitions: the spicy chicken theorem." Geometriae Dedicata 170.1 (2014): 263-279.
Kasimatis, Elaine Ann. "Dissections of regular polygons into triangles of equal areas." Discrete & Computational Geometry 4.4 (1989): 375-381.
Katz, Victor J. . A History of Mathematics. 1993; Harper Collins: New York, New York.
Loomis, E. S. The Pythagorean Proposition: Its Demonstrations Analyzed and Classified and Bibliography of Sources for Data of the Four Kinds of "Proofs," 2nd ed. Reston, VA: National Council of Teachers of Mathematics, 1968.
Machover, M. "Euler’s Theorem Implies the Pythagorean Proposition." Amer. Math. Monthly 103, 351, 1996.
Maldonado, Gerardo L., and Edgardo Roldán-Pensado. "Dissecting the square into seven or nine congruent parts." Discrete Mathematics 345.5 (2022): 112800.
Maor, Eli. The Pythagorean theorem: a 4,000-year history. Vol. 65. Princeton University Press, 2019.
Mead, David G. "Dissection of the hypercube into simplexes." Proceedings of the American Mathematical Society 76.2 (1979): 302-304.
Monsky, Paul. "On dividing a square into triangles." The American Mathematical Monthly 77.2 (1970): 161-164.
Nelsen, Roger B. Proofs without words: Exercises in visual thinking. No. 1. MAA, 1993.
Posamentier, Alfred S. The Pythagorean theorem: the story of its power and beauty. Prometheus books, 2010.
Nandakumar, R., and N. Ramana Rao. "Fair partitions of polygons: An elementary introduction." Proceedings-Mathematical Sciences 122.3 (2012): 459-467.
Rooney, Elaine Ann Kasimatis. "DISSECTION OF REGULAR POLYGONS INTO TRIANGLES OF EQUAL AREAS." (1987): 4188-4188.
Stein, Sherman. "Equidissections of centrally symmetric octagons." Aequationes Mathematicae 37.2 (1989): 313-318.
Stein, Sherman K. "Cutting a polyomino into triangles of equal areas." The American Mathematical Monthly 106.3 (1999): 255-257.
Stein, Sherman. "A generalized conjecture about cutting a polygon into triangles of equal areas." Discrete & Computational Geometry 24.1 (2000): 141-145.
Stein, Sherman. "Cutting a polygon into triangles of equal areas." The Mathematical Intelligencer 26.1 (2004): 17-21.
Su, Zhanjun, and Ren Ding. "Dissections of polygons into triangles of equal areas." Journal of Applied Mathematics and Computing 13.1 (2003): 29-36.
Wang, Yang, Lei Ren, and Hui Rao. "Dissecting a square into congruent polygons." Discrete Mathematics & Theoretical Computer Science 22 (2020).
]]>Sara Stoudt
Bucknell University
Fitting a line to a set of points… how hard can it be? When those points represent the temperature outside and a town’s ice cream consumption, I’m really invested in that line helping me to understand the relationship between those two quantities. (What if my favorite flavor runs out?!) I might even want to predict new values of ice cream consumption based on new temperature values. A line can give us a way to do that too. But when we start to think more about it, more questions arise. What makes a line “good”? How do we tell if a line is the “best”?
A technique called ordinary least squares (OLS), aka linear regression, is a principled way to pick the “best” line where “best” is defined as the one that minimizes the sum of the squared distances between the line and each point. We chant the assumptions of OLS and know what to look for in diagnostic plots, but where do these assumptions come from? Are there assumptions that are more hurtful than others if broken? These questions are something I even second guess myself, though I frequently use and teach linear regression.
While proving “good” performance of OLS, we need to make certain assumptions to streamline the process. Walking through the proof reveals what assumptions are most crucial and what the impact is of breaking each one.
Let’s first write out our theoretical model in matrix notation and recall what the dimensions of each piece of the puzzle are. We are looking for a linear relationship between $X$ and $Y$, but we know there may be some error, which we represent by $\epsilon$.
The OLS approach estimates the unknown parameters $\beta$ by $\hat{\beta} = (X’X)^{-1}X’Y$. Here $X’$ is the transpose of $X$, so the product $X’X$ is a square matrix.
Have we implicitly assumed anything yet? Actually yes, there is an inverse in play! What would $X’X$ have to look like to be invertible? Let’s put our linear algebra hats on.
This matrix would have to be “full rank”. Informally, that means that there is no redundant information in the columns of $X$. So if the number of predictor variables $p$ is greater than the number of data points $n$, that would be a problem. Even if $p < n$, the columns of $X$ still need to be linearly independent. Multicollinearity (when one covariate is highly correlated with another covariate) would make us suspicious of this. Let’s keep that in mind just in case it comes back to haunt us.
Now that we have our estimator for $\beta$, there are two properties we want to check on.
Let’s start with determining if we “expect” to get the right answer on average. It will help to rewrite $Y$ based on the proposed model.
As we break this expression down, we can see that we are on the right track. However, somehow that second term needs to have expectation zero.
How can we make this happen? This is where two regression assumptions are born. First we need the errors, $\epsilon$, to be independent of $X$. This seems plausible. If the errors depend on $X$, somehow we still have some information leftover that is not accounted for in the model. If the errors did depend on $X$, that would be a form of heteroscedasticity (non-constant variance for short). That sets our OLS assumption alarm bells ringing. We also need to make an assumption about the magnitude of the errors themselves. It would be nice if they were not systematically positive or negative (that doesn’t seem like a very good model), so assuming they are on average zero seems like a reasonable path forward.
With these assumptions in hand we now have an estimator that is unbiased. We expect to get the right answer on average. So far we have seen the motivation for two assumptions. If either of those are broken, this unbiasedness is not guaranteed. So if our goal is to make predictions or interpret these coefficients in context, we will be out of luck if these assumptions aren’t met. The next step is to understand the estimator’s uncertainty. Let’s see what other assumptions reveal themselves in the process.
So let’s recap what we’ve had to assume about the errors so far.
What haven’t we had to assume yet? Normality! Then why is everyone always worried about that when doing regression? Why all of those qqplots?! Well, we do need to know the whole sampling distribution of the estimates if we want to do inference, or say something about a more general population based on a sample. If we assume the distribution that those errors have is actually Normal, that lets us get a normal sampling distribution of $\hat{\beta}$.
Ah, but there is that pesky inverse again! If we have multicollinearity, that inverse might get unstable, affecting our understanding of the spread of the sampling distribution. And of course nothing in life is perfectly normal. However, we can often get an approximately normal sampling distribution if our sample size is large enough thanks to the Central Limit Theorem (an explainer for another time), so there is some robustness to breaking this particular assumption.
We’ve now seen all of the regression assumptions unfold, so we should be able to build a sort of hierarchy of assumptions.
This hierarchy is also borne out in simulation studies like this one that try to stress test OLS. You may even want to code up a simple simulation yourself if you want to look for any gaps between the theory and practice.
Some broken assumptions are fixable though. We might transform variables to deal with a lack of linearity or use a Generalized Linear Model that allows the relationship between $X$ and $Y$ to take a non-linear form. We might just be happy with estimating the best linear projection of the true relationship. If errors aren’t independent and/or do not have constant variance, methods like generalized least squares (GLS) can step in. If errors are not normally distributed, they might be approximately normally distributed thanks to the Central Limit Theorem, or we might want to use a Generalized Linear Model (GLM) that can handle error distributions of other forms. We always have options!
It’s great to see the theory behind the scenes informing the practice, and in general, assumptions of any method have to come from somewhere. Seeking out where in the math assumptions help make our lives easier can help demystify where these assumptions actually come from and gives some insight into which ones, if broken, are more dangerous than others. Happy line fitting!
$Y = X \beta + \epsilon$
$\hat{\beta} = (X’X)^{-1}X’Y$
$E[\hat{\beta} | X] = \beta$
$\hat{\beta} = (X’X)^{-1} X’ Y = (X’X)^{-1} X’ (X\beta + \epsilon)$
$= (X’X)^{-1} X’X\beta + (X’X)^{-1} X’\epsilon$
$= \beta + (X’X)^{-1}X’ \epsilon$
$E[\hat{\beta} | X] = E[(\beta + (X’X)^{-1} X’ \epsilon) | X]$
$ = E[\beta | X] + E[(X’X)^{-1}X’ \epsilon | X]$
$ = \beta + (X’X)^{-1} X’ E[\epsilon | X]$
$E[\epsilon | X] = 0?$
$\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
$\hat{\beta} \sim N(\beta, \sigma^2 (X’X)^{-1})$
Special thanks to my students in my Stat 2 class this semester for raising good questions about these regression assumptions and to my colleagues for helping me work through this hierarchy before reporting back. Also, thanks to Features Writer Courtney R. Gibbons for including drawings in her last post. That inspired me to do some doodling myself.
Note: The website https://drawdata.xyz/ made it possible to generate data to make the plots in the opening image.
]]>David Austin
Grand Valley State University
I was running some errands recently when I spied the Chicken Robot ambling down the sidewalk, then veering into the bike lane. A local grocery chain needs to transport rotisserie chickens from one of its larger stores to its downtown market. Their solution? A semi-autonomous delivery vehicle that makes the three-mile journey navigating with GPS.
I was curious how this worked. The locations provided by most GPS systems are only accurate to within a meter or so. How could those juicy chickens know how to stay so carefully in the center of the bike lane?
A typical solution to problems like this is an elegant algorithm, known as Kalman filtering, that’s embedded in a wide range of technology that most of us frequently either use or benefit from in some way. The algorithm allows us to efficiently combine our expectations about the state of a system, the chickens’ physical location, for instance, with imperfect measurements about the system to develop a highly accurate picture of the system.
This column will describe some simple versions of Kalman filtering, the main observation that makes it work, and why it’s such a great idea.
Let’s begin with a simple example. Suppose we would like to determine the weight of some object. Of course, any scale is imperfect so we might weigh it repeatedly on the same scale and find the following weights, perhaps in grams.
50.4 | 46.0 | 46.1 | 45.1 | 48.9 | 42.6 | 50.7 | 45.7 | 47.8 | 46.7 |
As the result of each weighing is recorded, we update our estimate of the weight as the average of all the measurements we’ve seen so far. That is, after obtaining $z_n$, the $n^{th}$ weight, we could estimate the object’s weight by finding the average $$ x_n = \frac1n(z_1 + z_2 + \cdots + z_n). $$ The result is shown below.
As the next weight $z_{n}$ is recorded, we update our estimate:
$$
\begin{aligned}
x_{n} & = \frac1{n}\left(z_1 + z_2 + \cdots + z_{n}\right) \\
x_{n} & = \frac1{n}\left((n-1)x_{n-1} + z_{n}\right) \\
x_{n} & = x_{n-1} + \frac1{n}\left(z_{n} – x_{n-1}\right)
\end{aligned}
$$
This expression tells us how each new measurement influences the next estimate. More specifically, the new estimate $x_{n}$ is obtained from the previous estimate $x_{n-1}$ by adding in $\frac1{n}(z_{n} – x_{n-1})$. Notice that this term is proportional to the difference between the new measurement and the previous estimate with the proportionality constant being $K_n = \frac1{n}$. We call $K_n$ the Kalman gain; the fact that it decreases as we include more measurements reflects our increasing confidence in the estimates $x_n$ so that, as $n$ increases, new measurements have less effect on the updated estimates.
Suppose now that we have some additional information about the accuracy of our measurements $z_n$. For example, if $z_n$ is the chickens’ location reported by a GPS system, the true location is most likely within a meter or so. More generally, we might imagine that the true value is normally distributed about $z_n$ with standard deviation $\sigma_{z_n}$.
Returning to our weighings, we could perhaps estimate $\sigma_{z_n}$ by repeatedly weighing an object of known weight. The probability that the true value is near $z_n$ is given by the distribution
$$
p_{z_n}(x) =
\frac{1}{\sqrt{2\pi\sigma_{x_n}^2}}
e^{-(x-z_n)^2/(2\sigma_{z_n}^2)}.
$$
Let’s also suppose that we have some estimate of the uncertainty of our estimates $x_n$. In particular, we’ll assume the probability that the true value is near $x_n$ is given by the normal distribution having mean $x_n$ and standard deviation $\sigma_{x_n}$:
$$
p_{x_n}(x) =
\frac{1}{\sqrt{2\pi\sigma_{x_n}^2}}
e^{-(x-x_n)^2/(2\sigma_{x_n}^2)}.
$$
We’ll describe how we can estimate $\sigma_{x_n}$ a bit later.
These distributions could be as shown below.
Now here’s the beautiful idea that underlies Kalman’s filtering algorithm. Both distributions $p_{x_n}$ and $p_{z_n}$ describe the location of the true value we seek. We can combine them to obtain a better estimate of the true value by multiplying them. That is, we combine or “fuse” these two distributions using the product
$$
p_{x_n}(x) p_{z_n}(x).
$$
Since the product of two Gaussians is a new Gaussian, we obtain, after normalizing, a new normal distribution that better reflects the true value. In our example, the product $p_{x_n}p_{z_n}$ is shown in green.
More specifically, the product of the two distributions is
$$
\begin{aligned}
& \exp\left[-(x-x_n)^2/(2\sigma_{x_n}^2)\right]
\exp\left[-(x-z_n)^2/(2\sigma_{z_n}^2)\right] \\
& \hspace{48pt}=
\exp\left[-(x-x_n)^2/(2\sigma_{x_n})^2-(x-z_n)^2/
(2\sigma_{z_n}^2)\right].
\end{aligned}
$$
Expanding the quadratics in the exponents and recombining leads to the new normal distribution
$$
\frac{1}{\sqrt{2\pi\sigma_{x_{n+1}}^2}}
\exp\left[-(x-x_{n+1})^2/(2\sigma_{x_{n+1}}^2)\right]
$$
where
$$
\begin{aligned}
x_{n+1} & =
\frac{\sigma_{z_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2} x_n +
\frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2} z_n \\
& = x_n + \frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}(z_n –
x_n) \
\end{aligned}
$$
Notice that this expression is similar to the one we found when we updated our estimates by simply taking the average of all the measurements we have seen up to this point. The Kalman gain is now
$$
K_n = \frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}
$$
In addition, the product of the Gaussians leads to the new standard deviation
$$
\sigma_{x_{n+1}}^2 =
\frac{\sigma_{x_n}^2\sigma_{z_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}
= (1-K_n) \sigma_{x_n}^2
$$
Notice that the Kalman gain satisfies $0\lt K_n \lt 1$. In the case that $\sigma_{x_n} \ll \sigma_{z_n}$, we feel much more confident in our estimate $x_n$ than our new measurement $z_n$. Therefore, $K_n\approx 0$ and so $x_{n+1} \approx x_n$, reflecting the fact that the new measurement has little influence on our next estimate.
However, if $\sigma_{x_n} \gg \sigma_{z_n}$, we feel much more confident in our new measurement $z_n$ than in our estimate $x_n$. Then $K_n\approx 1$ and $x_{n+1}\approx z_n$.
In both cases, the uncertainty in our new estimate, as measured by $\sigma_{x_{n+1}}^2 = (1-K_n)\sigma_{x_n}^2$, has decreased.
This leads to the following algorithm:
To illustrate, suppose that we make many measurements of a known weight with our scale and determine that the standard deviation of its measurements is constant $\sigma_z=2$. Initializing so that $x_1 = z_1$ and $\sigma_{x_1} = \sigma_z = 2$, we have the following sequence of estimates along with the decreasing
sequence of uncertainties.
In our first example, the quantity we’re tracking, the weight of some object, doesn’t change. Let’s now consider a dynamic process where the quantities are changing. For instance, suppose we’re tracking the height and vertical velocity of a moving object. The true height of the object could look like this, although
that true height is not known to us.
In addition to recording the height $h$, we will also track the vertical velocity $v$ so that the state of the object at some time is given by the vector
$$
{\mathbf x}_{n,n} = \begin{bmatrix} h_n \\ v_n \end{bmatrix}.
$$
The reason for writing the subscript in this particular way will become clear momentarily.
Moreover, we will assume that there is some uncertainty in both the position and the velocity. Initially, these uncertainties may be uncorrelated with one another
so that we arrive at a Gaussian
$$
\exp\left[-(h-h_n)^2/(2\sigma_{h_n}^2) –
(v-v_n)^2/(2\sigma_{v_n}^2)\right]
$$
that describes the multivariate normal distribution of the true state. A more convenient expression for this Gaussian uses the covariance matrix
$$
P_{n,n} = \begin{bmatrix} \sigma_{h_n}^2 & 0 \\
0 & \sigma_{v_n}^2 \
\end{bmatrix}
$$
so that the Gaussian is defined in terms of the quadratic form associated to $P_{n,n}^{-1}$:
$$
\exp\left[-\frac12 ({\mathbf x} – {\mathbf x}_{n,n})^TP_{n,n}^{-1}({\mathbf x} –
{\mathbf x}_{n,n})\right].
$$
The following figure represents the value of this distribution by shading more likely regions more darkly.
Now if we know $x_{n,n}$, the state at some time, we can predict the state at a later time using some assumption about the system. For instance, we might assume, after $\Delta t$ time units have passed, that the new state is ${\mathbf x}_{n+1,n} = \begin{bmatrix}h_{n+1} \\ v_{n+1} \end{bmatrix}$ where
$$
\begin{aligned}
h_{n+1} & = h_n + v_n\Delta t \\
v_{n+1} & = v_n.
\end{aligned}
$$
That is, we assume that the height has increased at the constant velocity given by the state vector ${\mathbf x}_n$ and that the velocity has remained constant. More succinctly, if
$$
F_n = \begin{bmatrix} 1 & \Delta t \\ 0 & 1 \end{bmatrix},
$$
we can write
$$
{\mathbf x}_{n+1, n} = F_n{\mathbf x}_{n}.
$$
The subscript ${\mathbf x}_{n+1, n}$ reflects the fact that we are extrapolating the next state using information only from the previous state.
This is a particular model we are choosing; in other situations, another model may be more appropriate. For instance, if we have information about the object’s acceleration, we may want to incorporate it. In any case, we assume there is a matrix $F_n$ from which we extrapolate the next state:
$$
{\mathbf x}_{n+1,n} = F_n{\mathbf x}_{n,n}.
$$
Of course, uncertainty in the state ${\mathbf x}_{n,n}$ will translate into uncertainty in the extrapolated state ${\mathbf x}_{n+1,n}$. It is straightforward to verify that the covariance matrix is transformed as
$$
P_{n+1,n} = F_nP_{n,n}F_n^T.
$$
For instance, if the uncertainties in position and velocity are initially uncorrelated, they may become correlated after the transformation, which is to be expected.
$P_{n,n}$ | $P_{n+1,n}$ |
These two transformations form the extrapolation phase of the algorithm:
$$
\begin{aligned}
{\mathbf x}_{n+1,n} & = F_n{\mathbf x}_{n,n} \\
P_{n+1,n} & = F_nP_{n,n}F_n^T.
\end{aligned}
$$
Next, suppose we have a new measurement ${\mathbf z}_{n}$ whose uncertainty is described by the covariance matrix $R_{n}$. We imagine that the normal distribution centered on ${\mathbf z}_{n}$ and with covariance matrix $R_n$ describes the distribution of the true state.
As before, we will fuse our predicted state ${\mathbf x}_{n+1,n}$ with the measured state ${\mathbf z}_{n}$ by multiplying the two normal distributions and rewriting as a single Gaussian. With some work, one finds the expression for the Kalman gain $$ K_{n} = P_{n+1,n}(P_{n+1,n} + R_{n})^{-1}, $$ which should be compared to our earlier expression.
We also obtain the updated state and covariance matrix
$$
\begin{aligned}
{\mathbf x}_{n+1, n+1} & = {\mathbf x}_{n+1,n} + K_{n}({\mathbf z}_{n} –
{\mathbf x}_{n+1,n}) \\
P_{n+1,n+1} & = P_{n+1, n} – K_{n}P_{n+1,n}. \
\end{aligned}
$$
So now we arrive at a new version of the algorithm:
Let’s see how this plays out in an example. Imagine we are tracking the height and vertical velocity of an object whose true height as is shown.
Remember that we don’t know the true height, but we do have some noisy measurements that reflect the object’s height and its velocity.
Applying the Kalman filtering algorithm naively, we obtain the green curve describing the object’s height. Notice how there is a time lag between the filtered height and the true height.
What’s the problem here? Clearly, the object is experiencing a significant amount of acceleration, which is not built into the model. As we saw in our earlier static example, the uncertainty in the estimated state ${\mathbf x}_{n,n}$ continually decreases, which means the confidence we have in our estimates grows and causes us to downplay the new measurements.
There are several ways to deal with this. If we have measurements about the acceleration, we could rebuild our model so that the state vector ${\mathbf x}_{n,n}$ includes the acceleration in addition to the position and velocity. Alternatively, if we have information about how an operator is controlling the object we’re tracking, we could build it into the extrapolation phase using the update
$$
{\mathbf x}_{n+1,n} = F_{n}{\mathbf x}_{n,n} + B_n{\mathbf u}_n
$$
where ${\mathbf u}_n$ is a vector describing some additional control and $B_n$ is a matrix describing how this control feeds into the extrapolated state. Clearly, we need to know more about the system to incorporate a term like this.
Finally, we can simply build additional uncertainty into the model by adding it into the extrapolation phase. For instance, we could define
$$
P_{n+1,n} = F_nP_{n,n}F_n^T + Q_n,
$$
where $Q_n$ is a covariance matrix known as process noise and represents our way of saying there are additional influences not incorporated in the extrapolation model:
${\mathbf x}_{n+1,n} = F_n{\mathbf x}_{n,n}$.
Adding some process noise into the extrapolation phase prevents us from becoming overly confident in the extrapolated states and continuing to give sufficient weight to new measurements. This leads to the filtered state as shown below, which is clearly much better than the measured signal.
Kalman developed this algorithm in 1960, though it seems to have appeared earlier in other guises, and found a significant early use in the Apollo guidance computer. Indeed, the algorithm is well suited for this application. While the guidance computer was a marvel of both hardware and software engineering, its memory and processing power were modest by our current standards. As the algorithm only relies on our current estimate of the state, its demands on memory are slight, and the computational complexity is similarly small.
In addition to being fast and efficient, the algorithm is also optimal in the sense that, under certain assumptions on the system being modeled, the algorithm has the smallest possible expected error obtained from a given set of measurements.
Kalman filtering is now ubiquitous in navigation and guidance applications. In fact, it is used to smooth the motion of computer trackpads so you may have used it while reading this article. If you are driving while navigating with an app like Google Maps, you may notice the effect of the algorithm when you come to a stop at a traffic light. It sometimes happens that your location continues with constant speed into the intersection and then quickly snaps back to your actual location. The extrapolation phase of the algorithm would lead us to believe that we continue with constant speed into the intersection before new measurements pull the extrapolated locations back to our true location.
Finally, the Chicken Robot needs your help to find a new name.
There are lots of relevant references on the internet, but few give an intuitive sense of what makes this algorithm work. Besides Kalman’s original paper, I’ve given a few of those here.