Courtney Gibbons
Hamilton College
It was a dark and stormy night… Okay, it was probably more like 3:30 in the afternoon on a crisp fall day back when I was teaching Calc 1 for the first time as “Professor Gibbons,” and I was looking through my colleagues’ past syllabi to see what problems they liked to assign. One of the problems sent me off on a tangent (pun intended!) because it evocatively named the rational function $y = \frac{1}{1+x^2}$ “The Witch of Agnesi” and, in this problem and others in subsequent chapters, proceeded to use differential calculus to tease out its secrets.
In fact, the Witch is one of a family of plane curves, $x^2 y + 4c^2y – 8c^3 = 0$, parametrized by $c$ (just take $c = \frac{1}{2}$).
I don’t want to spoil anyone’s calculus homework, but Evelyn Lamb has written a nice blog about the Witch over at Roots of Unity. (Okay, one spoiler: the Italian for “curve” is “versiera” while “witch” is “avversiera” – so the name of the curve is a mistranslation at best, or one of those groan-inducing mathematician puns at worst.)
My encounter with the Witch left me thinking about the many ways we discuss polynomials (and rational functions) with our college students, and I wanted to share some of those perspectives in this column.
A math student’s first encounter with polynomials often comes hot on the heels of the definition of a function. In this context, we meet our friends the single-variable polynomials $f(x) = a_n x^n + a_{n-1} x^{n-1} + \cdots + a_1 x + a_0$ and spend a lot of time with the functions that plot lines ($y = mx + b$) and parabolas ($y = ax^2 + bx + c$). Implicit in this introduction is that the coefficients belong to the real numbers.
There are lots of ways to write down a line, and thinking back to high school algebra, you might remember that you and your math class best buddy might look at the line plotted below and come up with different point-slope equations.
You write down $y – 5 = \frac{1}{2}(x – 6)$ while your pal writes down $y – 3 = \frac{1}{2}(x-2)$. These should be the same because we say two functions are equal if they make the exact same input-output assignments (and in the case of functions we can plot in the $xy$-plane, this means they sketch out the same graph). Now, if you and your high school math pal wanted to check that your lines are the same in a different way, you’d each wrangle your line into slope-intercept form and make sure they match (in this case, you’d both end up with the single-variable polynomial expression $\frac{1}{2}x + 2$).
Okay, but. That last part? That relies on the definition of polynomial equality (two polynomials are equal if they are equal coefficient by coefficient) matching the definition of function equality, and that doesn’t always work!
Consider the field $F$ with two elements, $0$ and $1$, with addition and multiplication defined modulo $2$. That is, $0 + 0 = 0 = 1+1$ and $0+1 = 1 = 1+0$, while $0\cdot 1 = 1\cdot 0 = 0\cdot 0 = 0$ and $1\cdot 1 = 1$. We’re going to take our polynomial coefficients from this field.
As functions, $f(x) = x^2 + 1$ and $g(x) = x+1$ take elements of $F$ and assign them to elements of $F$ the same way. Indeed, $f(0) = g(0) = 1$ and $f(1) = g(1) = 0$. So, as functions, $f(x)$ and $g(x)$ are equal! But as polynomials, they are not, because when we do a coefficient comparison, we see that $f$ has a nonzero coefficient for $x^2$ while $g$ does not. (In this weird little example, we actually have that $f = g^2$.)
It’s a fair question to ask why we would bother with a special definition of polynomial equality if it’s going to pull sneaky tricks like this. Polynomials, it turns out, are useful for more than just input-output assignments!
I will admit to taking a certain joy in teaching partial fraction decomposition in calculus. There are pedagogical arguments in favor and against including it in the syllabus, but my enthusiasm is epicurean: I like that students are seeing an example of a basis for a pretty weird vector space before they have taken linear algebra!
If your integrand contains the rational function $y = \frac{p(x)}{(x+2)^2(x^2 +1)}$, you may remember that you can decompose it by first performing polynomial long division to write $\frac{p(x)}{(x+2)^2(x^2+1)} = q(x) + \frac{r(x)}{(x+2)^2(x^2+1)}$ and then break the “reduced” rational function down into a sum of the form $$\frac{A}{x+2} + \frac{B}{(x+2)^2} + \frac{Cx}{x^2+1} + \frac{D}{x^2+1}.$$ This technique works because $\frac{1}{x+2}$, $\frac{1}{(x+2)^2}$, $\frac{x}{x^2+1}$, and $\frac{1}{x^2+1}$ form a basis for the vector space of rational functions with denominator $(x+2)^2(x^2 +1)$. One technique for finding the constants $A$, $B$, $C$, and $D$ is to use that rational functions are, well, functions! If two expressions really are the same functions, they’ll have the same input-output assignments. So, trying some handy values for $x$ (who doesn’t love $x = 0$?) gives us leverage to solve for the missing constants.
More generally, many abstract mathematical objects have features that you might want to collect (or count), and you might want to stash your collection (or count) somewhere. For example, the characteristic polynomial $p(x) = |A – Ix|$ of a matrix $A$ stashes the eigenvalues of $A$ in its linear factors! The entries in a Young tableau are stashed as the coefficients of a Schur polynomial (read all about it)! It’s a bonus when these polynomials turn out to be invariants, and even better when you can specialize to all sorts of other polynomial invariants, as you can with the Tutte polynomial of a graph (or link, or matroid, or…)!
In marshalling my resources for this blog post, I spent a little time reviewing the AMS Notices “What is…?” collection. There are some recent entries that piqued my interest and might pique yours, too: a column about multiple orthogonal polynomials and another about Sobolev orthogonal polynomials.
And, finally, Florian Cajori writes this entry in his A History of Mathematics (available for free through Project Gutenberg):
Maria Gaetana Agnesi (1718–1799) of Milan, distinguished as a linguist, mathematician, and philosopher, filled the mathematical chair at the University of Bologna during her father’s sickness. In 1748 she published her Instituzioni Analitiche, which was translated into English in 1801. The “witch of Agnesi” or “versiera” is a plane curve containing a straightline, $x = 0$, and a cubic, $(\frac{y}{c})^2 + 1 = \frac{c}{x}$.
It’s a terse entry for a rather remarkable person. Indeed, by 7 she had mastered Greek, Hebrew, and Latin (having already mastered French by 5); at 9, she defended higher education for women in her father’s salon. After her father died, she left her mathematical and scientific endeavors to care for dying women. Scientific American has a nice biography column for those who want to learn more!
Allechar Serrano López
Harvard University
The Jordan curve theorem is a result in topology that states that every Jordan curve (a plane simple closed curve) divides the plane into an “inside” region enclosed by the curve and an “outside” region. We can think of a plane simple closed curve as a closed loop that does not intersect with itself. The theorem feels true: it intuitively makes sense and we do not have to spend several minutes trying to convince ourselves it’s true. We have seen this theorem in action in our lives… this is why fences work, right?
In order to state the theorem in a formal manner, we need a definition:
A Jordan curve $C$ is a simple closed curve in $\mathbb{R}^2$. We can construct such a curve as the image of a continuous map $\phi: [0,1] \rightarrow \mathbb{R}^2$ such that:
Here, condition (1) makes sure that we have a loop, and condition (2) ensures that our loop does not have any self-intersection points.
Then we can state the theorem as follows:
Let $C$ be a Jordan curve in the plane $\mathbb{R}^2$. Then its complement, $\mathbb{R}^2\setminus C$, consists of two connected components. One of these components is bounded (the interior) and the other one is unbounded (the exterior), and the curve $C$ is the boundary of each component.
So, the Jordan curve separates $\mathbb{R}^2$ into two pieces: an inside region (which has a finite area) and an outside region.
If we have a point on the plane and draw a closed curve, we can try to figure out if the curve encloses the point (and how many times it goes around it) or if it doesn’t. First, we start by giving the curve an orientation; mathematicians chose a long time ago that counterclockwise is the way to go, so the winding number is positive if the curve encloses the point counterclockwise. The winding number of a closed curve around a given point is an integer that counts the number of times that the curve goes around the point. If the curve does not encircle the point, then the winding number at that point is 0. The winding number of a Jordan curve around a point in its interior is 1 (or -1 if we travel in the other direction!)
While the definition of winding numbers seems straightforward, they come up in some of the advanced undergraduate- and graduate-level courses like differential geometry and complex analysis. In complex analysis, they came to haunt me disguised as line integrals. In fact, Stokes’ theorem and the residue theorem are related to the Jordan curve theorem.
I was first introduced to the Jordan curve theorem in my algebraic topology class. Like most definitions and theorems, it was introduced in clinical detail: as part of a long list of results with no mention of a context or why would anyone care about it. However, mathematics presents itself in different ways to different groups of people, so I would like to discuss the Jordan curve theorem as part of an intrinsic human activity: storytelling.
The Chokwe people live in Southwestern Africa and they are known for their art, which they also employ in their storytelling. They have a tradition of drawing figures in the sand, these are known as lusona (plural: sona), to illustrate their stories. The sona illustrate fables, games, riddles, proverbs, and stories. Each lusona starts with a series of evenly-spaced dots, in a rectangular array, and the drawing consists of lines weaving in and out around the dots.
The storyteller draws and narrates simultaneously while keeping the audience engaged. The sona and the stories accompanying them played an important role in the passing down of knowledge and traditions from one generation to the next, but many of them were lost due to colonization and slavery. What we know about the sona today comes from documentation kept by missionaries.
The mukanda is a rite of passage for boys into adulthood and it begins when the chief of a Chokwe village and his counselors decide that there is a sufficiently large group of children to carry out the rite. The mukanda is a camp enclosed by a fence with huts for the boys; the length of the stay at the camp varies from one year to three years. In the mukanda, they learn rituals, stories, and how to make masks, and they can return home after the prescribed education is complete. Kalelwa, a spirit who is incarnated by a mask of the same name, is who gives the signal for the coming and going from the mukanda, and mothers are not allowed to see their sons while they are going through the rite of passage.
There are several sona referring to the mukanda. A lusona which is a continuous closed curve with no self-intersections includes a story where the line of dots are the children involved in the rite of passage, the two higher dots are the guardians of the camp, and the lower dots represent people who are not involved in the ceremony. The children and the guardians are inside the camp and so they are in the bounded connected component while people not participating in the ceremony are in the unbounded connected component.
The mukanda exemplifies a topological concern of the Chokwe: distinguish between the inside of the mukanda (where the children are) and the outside world, that is, the need to determine two regions with a common boundary (which is exactly the issue that the Jordan curve theorem addresses!).
Bill Casselman
University of British Columbia
Commercial transactions on the internet are invariably passed through a process that hides them from unauthorized parties, using RSA public key encryption (named after the authors of the method) to hide details best kept hidden from prying spies.
I’ll remind you how this works, in very basic terms. Two people want to communicate with each other. The messages they exchange pass over a public network, so that they must assume that anybody at all can read them. They therefore want to express these messages—to encrypt them—in such a way that they are nonsense, for all practical purposes, to those not in their confidence.
The basic idea of RSA, and perhaps all public key encryption schemes, is that each person—say A— willing to receive messages chooses a set of private keys and computes from these a set of public keys, which are then made available at large. The point is that computing the public key from the private one is straightforward, whereas going from public to private is very, very difficult. Anyone—say B—who wishes to send A a message uses the public key to encode a message, before he sends it. Once it is in this enciphered form it can—one hopes!—be reconstituted only by knowing A’s private keys.
In this, and perhaps in all public key encryption systems, messages are made into a sequence of integers, and the enciphered message is then computed also as a sequence of integers. The difficulty of reading these secret message depends strongly on the difficulty of carrying out certain mathematical tasks.
In the RSA scheme, one of the two components in the public key data a person publishes is the product of two large prime numbers, and reading a message sent using this key requires knowing these factors. As long as such factoring is impractical, only person A can do this. This factoring has been so far a very difficult computational problem, but as Tony Phillips explained in two earlier columns, it seems very possible that at some point in the not-so-distant future quantum computers will be able to apply impressively efficient parallel computational strategies to make it easy. (If it doesn’t become possible, it won’t be because a lot of clever people aren’t trying. Quantum computing will have benefits beyond the ability to read other people’s mail.)
In order to deal with this apparently looming threat, the NIST (National Institute of Standards and Technology) has optimistically engaged in what it calls a “Post-Quantum Cryptography (PQC) standardization process”, with the aim of finding difficult mathematical problems even quantum computers cannot handle. Preliminary results were announced in the summer of 2022. Mathematically, the most intriguing of the new proposals use lattices for message encryption. Lattices comprise a topic of great theoretical interest since the late 18th century. That’s what this column is about.
Suppose $u$ and $v$ to be two vectors in the plane. The lattice they generate is the set of all vectors of the form $au +bv$ with integers $a$, $b$. These make up a discrete set of points, evenly spaced.
The choice of vectors $u$, $v$ partition the whole plane into copies of the parallelogram they span:
But there are many other pairs of vectors that generate the same lattice. In this figure, the pair $U = 4u + v$, $V = 5u+v$:
Geometrically, a generating pair has the property that the parallelogram they span doesn’t contain any lattice points inside it.
There is a simple way to generate such pairs. Let $T$ be a double array of integers
$$ T = \left[ \matrix { a & b \cr c & d \cr } \right ] \hbox{ for example } \left[ \matrix { 4 & 1 \cr 5 & 1 \cr } \right ] \, . $$
This is a matrix of size $2 \times 2$. Impose the condition that its determinant $ad-bc$ be $\pm 1$ (as it is here). Then $au+bv$, $cu+dv$ will generate the same lattice, and all generating pairs are found in this way.
The point of requiring determinant $1$ is that the relationship is reversible. From
$$ U = au + bv, \qquad V = cu + dv $$
we can solve to get
$$ u = dU—bV, \qquad v = -cU + aV \, , $$
as you can check by substitution.
These things are best dealt with as matrix equations:
$$ \left[ \matrix { U \cr V \cr } \right ] = \left[ \matrix { a & b \cr c & d \cr } \right ]\left[ \matrix { u \cr v \cr } \right ] , \qquad \left[ \matrix { u \cr v \cr } \right ] = \left[ \matrix { a & b \cr c & d \cr } \right ]^{-1}\left[ \matrix { U \cr V \cr } \right ] = \left[ \matrix { \phantom{-}d & -b \cr -c & \phantom{-}a \cr } \right ]\left[ \matrix { U \cr V \cr } \right ] \, , $$ and in the example $$ \left[ \matrix { u \cr v \cr } \right ] = \left[ \matrix { \phantom{-}1 & -1 \cr -5 & \phantom{-}4 \cr } \right ]\left[ \matrix { U \cr V \cr } \right ] , \quad u = U—V, \; v = V—5U \, , $$
so that $u$ and $v$ are in the lattice generated by $U$, $V$.
Every vector $w$ in the plane is a linear combination $au +bv$ of $u$ and $v$, with $a$, $b$ arbitrary real numbers. If
$$ \eqalign{ w &= \left[ \matrix { x & y \cr } \right] \cr u &= \left[ \matrix { u_{x} & u_{y} \cr } \right] \cr v &= \left[ \matrix { v_{x} & v_{y} \cr } \right] \, . \cr } $$
then we write
$$ \eqalign { w &= au + bv \cr x &= a u_{x} + b v_{x} \cr y &= a u_{y} + b v_{y} \cr } $$
which means that in order to find what $a$ and $b$ are we have to solve two equations in two unknowns. The efficient way to do this to write it as a matrix equation
$$ \left[ \matrix { x & y \cr } \right] = \left[ \matrix { a & b \cr } \right] \left[ \matrix { u_{x} & u_{y} \cr v_{x} & v_{y} } \right] $$
which gives us
$$ \left[ \matrix { a & b \cr } \right] = \left[ \matrix { x & y \cr } \right] \left[ \matrix { u_{x} & u_{y} \cr v_{x} & v_{y} } \right]^{-1}. $$
If $a$ and $b$ are integers, then $w$ will be in the lattice generated by $u$ and $v$, but otherwise not.
Many of the new encryption methods rely on a problem that is easy to state, but not quite so easy to solve: Given a lattice in the plane and a point $P$ in the plane, what is the point in the lattice closest to $P$?
The difficulty of answering this question depends strongly on how the lattice is specified. The best situation is that in which it is given by two generators that are nearly orthogonal. The parallelogram they span can be used to partition the plane, and every point in the plane will belong to a copy of this basic one.
For example, suppose that the lattice is generated by $u = \left[\matrix { 5 & 1 \cr } \right]$, $v = \left[\matrix {-3 & 7 \cr } \right] $. Let $P = \left[\matrix { 9 & 10 \cr } \right]$. The formula from the previous section tells us that $P = a u + b v$ with
$$ \left[\matrix { a & b \cr } \right] = \left[\matrix { 9 & 10 \cr } \right]\left[\matrix {\phantom{-}5 & 1 \cr -3 & 7 \cr } \right]^{-1} \sim \left[\matrix { 2.24 & 1.11 \cr } \right] \, . $$
The important fact now is that the nearest point in the lattice will be one of the four nearest vertices of that copy. Finding which one is quite quick. In the example just above, the nearest lattice point is $2u + v = \left[\matrix { 7 & 9 \cr } \right]$.
In the figure on the left (called a Voronoi diagram), each point in the lattice is surrounded by the region of points closest to it. On the right, this is overlaid by the span of good generators. You can verify by looking that what I write is true.
Thus, if the lattice is specified in terms of nearly orthogonal generators, we are in good shape. (In fact, this could be taken as the definition of ‘nearly orthogonal’.) But suppose we are given a pair of generators that are nowhere orthogonal. We can still find which parallelogram we are in, and we can find the nearest vertex of that parallelogram easily, but now that vertex may be far from the nearest lattice point.
To summarize: we can answer the question about the nearest lattice point, if we have in hand a good pair of generators of the lattice, but if not we can expect trouble. We shall see in a moment what this has to do with cryptography.
There are several different ways to send encrypted messages using lattices. The simplest is known as GGH, after the authors of the paper that originated it. It has turned out that it is not secure enough to use in its basic form, but the mathematics it incorporates is also used in other, better schemes, so it is worth looking at.
All schemes for public key messaging depend on some way to transform messages into arrays of integers. In my examples, I’ll use the Roman alphabet and the standard ASCII code (acronym of American Standard Code for Information Interchange). This assigns to each letter and each punctuation symbol a number in the open range $[0, 256 = 2^{8})$.
With this encoding, the message “Hello!” becomes the sequence 72 101 108 108 111 33 . But I’ll make this in turn into a sequence of larger numbers by assembling them into groups of four, and then making each group a number in the range $[0, 4294967296=2^{32})$:
$$ \eqalign { 1,819,043,144 &= 72 + 101\cdot 2^8 + 108\cdot 2^{16} + 108 \cdot 2^{24} \cr 8,559 &= 111 + 33 \cdot 2^8 \, . \cr } $$
For the moment, I’ll look only at messages of $8$ characters, so this gives us an array $[m, n]$ of two large integers, here $\left[ \matrix{1819043144 & 8559} \right]$.
Of course the array $[m, n]$ can be turned back into the original message very easily. For example, if you divide $1,819,043,144$ by $256$ you get a remainder of $72$, which you convert to “H”. Continue:
$$ \eqalign { 1819043144 &= 72 + 256\cdot 7105637 \cr 7105637 &= 101 + 256 \cdot 27756 \cr 27756 &= 101 + 256\cdot 108 \cr 108 &= 108\, . \cr } $$
We want to transform this un-secret message into one that is a bit more secret. There are three steps to making this possible: (1) The person—say A—who is to receive messages makes up and posts some public data needed by people who want to write to her. (2) A person—say B—who writes must use these data to make up a publicly unreadable message. (3) Person A applies her private key to read the message.
Some preliminary work has to be done. A person—say A—who wants to receive messages first chooses a pair of vectors in the plane with integral coordinates. These shouldn’t be too small, and they should be nearly orthogonal. For example, say $ u = \left[\matrix { 5 & 1 \cr } \right]$, $v = \left[\matrix {-3 & 7 \cr } \right] $, and in a picture:
They generate a lattice, made up of all integral linear combinations $a u + bv $ with $a$ and $b$ integers:
As we have seen, the vectors $u$ and $v$ are a good set of generators of the lattice. But, also as we have seen, there are lots of other pairs of vectors that generate it. For maximum security, she should choose two that are not at all orthogonal:
This pair make up A’s public key, which fits into a matrix:
$$ W = \left[ \matrix {22 & 12 \cr 17 & 11 \cr } \right ] $$
When person B wants to send a message to A, he first encodes it as an array $m$ of two integers in the range $[0, 2^{32})$. He then looks up A’s key $W$, which is a $2 \times 2$ matrix. He also chooses a small random vector $r$, computes
$$ M = m \cdot W + r $$
and sends it to A.
Why do we expect that no unauthorized person will be able to reverse this computation? The vector $m \cdot W$ is in A’s lattice. Since $r$ is small, it is the vector in the lattice that is closest to $M$. But since we don’t know a good nearly orthogonal generating pair for the lattice, computing the closest point, as we have seen, is not quite trivial.
But $A$ does have in hand a good generating pair of vectors, and she can therefore compute $m\cdot W$, then $m$, quite easily.
Here is an outline of how things go, with messages of any length.
Let’s take up again the earlier example, in which A’s secret key was the matrix
$$ T = \left[ \matrix { \phantom{-}5 & 1 \cr -3 & 7 } \right ] \, . $$
She chooses to multiply it by
$$ U = \left[ \matrix { 5 & 1 \cr 4 & 1 } \right ] $$
to get her public key
$$ W = UT = \left[\matrix { 22 & 12 \cr 17 & 11 \cr } \right] \,. $$
Suppose B wants to tell her “Hello!”. As we have seen, applying ASCII code produces as raw text the integer array $m = \left[ \matrix{1819043144 & 8559} \right]$. He multiplies it on the right by $W$, and adds a small vector $r$:
$$ M = m \cdot W + \left[ \matrix{2 & -1} \right] = \left[\matrix { 9095190045 & 1819103056 \cr } \right] \, . $$
Admittedly, this $r$ is a bit small. I should say, the role of $r$ is crucial, since without it this whole business just amounts to a really complicated substitution cipher of the kind read by Sherlock Holmes.
When $A$ gets this, she calculates $$ M \cdot T^{-1} = \left [ \matrix { 1819043144.29 & 8558.82 } \right ] $$ which she converts (correctly) to $\left [ \matrix { 1819043144 & 8559 } \right ]$ and translates to “Hello!”.
We have seen that reading messages in this business amounts to finding closest lattice points, and that this in turn depends on finding good generators of a lattice. It turns out that in low dimensions there are extremely efficient ways to do this. It is exactly this problem that arose in the classification of integral quadratic forms, a branch of number theory. Finding closest vectors for arrays of length $2$ is particularly easy, due to a well known ‘reduction algorithm’ due to the eighteenth century French mathematician Lagrange. It was later made well known by work of Karl Friedrich Gauss, to whom it is sometimes attributed.
A similar algorithm for arrays of length $3$ was found by the nineteenth century German mathematician Gotthold Eisenstein (which seems to have been incorporated in the computer algebra system SageMath). There are very general algorithms known for all dimensions, but they run more and more slowly as the length of arrays increases, and are not really practical.
Sadly, even for long messages the GGH scheme was shown by Phong Nguyen not to be very secure. But schemes that do depend on the difficulty of finding good generating sets of lattice vectors in high dimensions still look quite plausible.
Are quantum computers really in sight?
Chapter 7 is about lattices.
Lattices in higher dimensions are an active topic of research.
Anil Venkatesh
Adelphi University
Everyone knows that multiplication is iterated addition in the sense that $2 \times 3 = 2 + 2 + 2$. Similarly, exponentiation is generally introduced as iterated multiplication since $2^3 = 2 \times 2 \times 2$. What’s the name for the operation of iterated exponentiation? In 1947, R.L. Goodstein proposed the term tetration for the binary operation
\[(a, n) \mapsto \underbrace{a^{a^{\cdot^{\cdot^{\cdot^a}}}}}_{n}\]
where the power tower is evaluated by convention from right to left. Goodstein derived this name from the Greek word for four by thinking of addition as the first operation, multiplication as the second, and exponentiation as the third. He also coined pentation, hexation, and so on for even higher operations in the hierarchy, although it is difficult to write these down without some new notation.
Three decades later, Donald Knuth introduced up-arrow notation that made Goodstein’s operations easier to conceptualize. Putting $2 \uparrow 3 = 2^3$, we let $2 \uparrow\!\uparrow 3 = 2 \uparrow (2 \uparrow 2)$ represent the tetration of 2 by 3. It can be helpful here to slightly reframe the idea of iterated operations. Instead of thinking of $2 \times 3$ as three 2’s added together, let’s think of multiplication as an inductive process: compute $2 \times 2$ and add one final 2 afterwards. By extension, we write $a \uparrow^{n} b = a \uparrow^{n-1} (a \uparrow^{n} (b-1))$. Just how big is $2 \uparrow\!\uparrow b$? The table below shows the first several tetrations of 2.
$b$ | 1 | 2 | 3 | 4 | 5 |
$2 \uparrow\!\uparrow b$ | 2 | 4 | 16 | 65,536 | $2^{65536} \approx 10^{19728}$ |
Challenge: Try writing down $2 \uparrow\!\uparrow 6$ in scientific notation!
The idea of hyperoperations formed by iteration is solidly motivated by the relationship of multiplication to addition. In fact, we can even identify the successor operation $(a, b) \mapsto a+1$ as the prior operation to addition in the hierarchy. However, a few key arithmetical properties seem to go missing once we pass beyond multiplication. The higher operations are merely right-associative and lack commutativity. They also fail to distribute over the immediately previous hyperoperation. Lastly, the hyperoperations of Goodstein lack an obvious extension to $\mathbb{R}$ or $\mathbb{C}$ and the concepts of identity and inverse. To gain insight here, we need to wind the clock back 30 years before Goodstein to the time of Albert A. Bennett.
Challenge: Double-check that $\uparrow^{n}$ doesn’t distribute over $\uparrow^{n-1}$ for $n\geq 1$.
In 1915, the paper “Note on an Operation of the Third Grade” by Albert A. Bennett appeared in the Annals of Mathematics. A terse two-page note, it was largely neglected until the early 2000s and its contribution remains much less studied than Goodstein’s hyperoperations. In this comment thread on Math Stack Exchange, super-contributor Dave L. Renfro claims to have raised Bennett’s work from obscurity in 2001 or 2002 when he encountered it by chance while flipping through old volumes at a university library. What a find!
It seems that Bennett’s paper is the earliest known foray into the idea of operations beyond exponentiation. Bennett observed that for positive real numbers, multiplication can be rephrased in terms of addition like so: $a \times b = e^{\log(a) + \log(b)}$. He goes on to examine the operation $\star_2: (a,b) \mapsto e^{\log(a) \times \log(b)}$ which he notes is an associative, commutative binary operation on the positive real numbers. Also, we can check that $\star_2$ distributes over multiplication:
\begin{align*}
a \star_2 (b \times c) &= e^{\log(a) \log(b \times c)} \\
&= e^{\log(a)(\log(b) + \log(c))} \\
&= e^{\log(a)\log(b) + \log(a)\log(c)} \\
&= (a \star_2 b) \times (a \star_2 c).
\end{align*}
Challenge: Double-check that $a \star_2 b = a^{\log b} = b^{\log a}$. Then show this to a mathematical friend. If your friends are like mine, they will be surprised that this is allowed.
Let $\star_0$ represent addition and $\star_1$ represent multiplication. Bennett then defines $a \star_n b = e^{\log(a) \star_{n-1} \log(b)}$ for positive integers $n$ and whatever real numbers $a$ and $b$ can reasonably take, yielding a hierarchy of commutative hyperoperations akin to but distinct from those of Goodstein. Unlike Goodstein’s hierarchy, the hyperoperations of Bennett extend infinitely below addition as well as above. We can define $a \star_{-1} b = \log(e^a + e^b)$, sometimes known as the smooth max function for its tendency to highlight the larger of the two inputs as shown in the figure below. For any negative integer $n$, we can define $a \star_{n} b = \log(e^a \star_{n+1} e^b)$, extending the hierarchy of operations in the opposite direction too.
Challenge: What is the domain of $\star_n$? For which $n$ does $\star_n$ have an identity element? Which elements have an inverse under $\star_n$?
Surprisingly, we have that $\star_n$ distributes over $\star_{n-1}$ for all $n$. The upshot of Bennett’s work is that the exponential function gives us a hierarchy of associative, commutative operations that extend the distributive relationship of multiplication over addition infinitely in both directions. This immediately raises the question: are Bennett’s operations the only commutative hyperoperations out there?
To ask whether Bennett’s hierarchy is unique is to ask a question about the set of all binary operations (on $\mathbb{R}$, $\mathbb{C}$, or suitable subset thereof). That’s an intimidatingly rich set to sift through! Thankfully, our interest in associative, commutative operations helps to winnow the search space. Let’s drill down even further by asking a simpler question: “other than multiplication, what operations distribute over addition?”
Suppose $\star$ is a candidate alternative to multiplication. It’s fair to expect that $\star$ extends to the complex numbers, so there is an abelian group $(G, \star)$ that can be embedded into $\mathbb{C}$ in some way. Let’s add one more property of $\star$ to our wishlist: differentiability. Just as the linear map $x \mapsto 2x$ is differentiable with a derivative of $2$, we want the map $x \mapsto 2 \star x$ to be (infinitely) differentiable. When a group law is differentiable like this, it turns the group into a smooth geometric object called a Lie group. Since we want $\star$ to be commutative, our Lie group $(G, \star)$ is abelian.
At this point, we’ve piled on so many different properties of $\star$ that a special situation arises: while Lie groups are intricate, subtle objects in general, abelian Lie groups are all built out of addition in a certain sense. In our case, what this means is that there’s a surjective group homomorphism $f: (\mathbb{C}, +) \to (G, \star)$ so that
\begin{align*}
f(a) \star f(b) &= f(a+b)\textrm{, hence} \\
a \star b &= f(f^{-1}(a) + f^{-1}(b)).
\end{align*}
Wait a minute! I only said $f$ was surjective, not bijective. So how come I get to use $f^{-1}(a)$? Here, that notation means preimage, i.e., the set of all complex numbers that $f$ maps onto $a$. The upshot is that $\star$ is built out of addition suitably combined with a mystery function $f$. We’ll now see that there are very few options for this mystery function.
The function $f$ is an example of a covering map that relates the flat space $(\mathbb{C}, +)$ to the space $(G, \star)$. The word “covering” is a geometric metaphor for surjectivity, i.e., the fact that every point in $G$ is “covered” by a point in $\mathbb{C}$. If $\star$ is just standard multiplication, the group $G$ is $\mathbb{C} \backslash \{0\}$ and $f(z) = \exp(z)$, the familiar exponential map. The figure above shows how the complex plane can be twisted into a helicoid shape, thereby covering every point in $\mathbb{C} \backslash \{0\}$ infinitely many times. In this figure, the function $f$ is the projection that squashes the helicoid flat onto itself, like compressing a spring.
The only degree of freedom among universal covering maps $\mathbb{C} \to \mathbb{C} \backslash \{0\}$ is where to send the identity element. In the case of standard multiplication on $\mathbb{C} \backslash \{0\}$, we have $f(0) = 1$ because $f(z) = \exp(z)$. If we instead put $f(z) = \alpha \exp(z)$, this induces the group law $a \star b = \frac{1}{\alpha} ab$, which is just multiplication but with multiplicative identity $\alpha$ instead of 1.
Another option is to put $G = \mathbb{C}$. This imposes very strict limitations on $f$, however. Since $\mathbb{C}$ is its own universal covering, we aren’t allowed to twist it onto itself at all so $f$ has to be injective as well as surjective. The only functions with this property are of the form $f(z) = \alpha z + \beta$.
Challenge: Show that when $G = \mathbb{C}$ we must have $a \star b = a + b – \beta$ for some constant $\beta \in \mathbb{C}$. Then confirm that this definition of $\star$ fails to distribute over addition.
And that’s it! The only other option for $G$ is a torus, but this can’t be embedded into the complex plane so there’s no way to import its group law into $\mathbb{C}$. The upshot is that multiplication is the only operation that distributes over addition, at least if you care about things like associativity, commutativity, and complex-differentiability. Open questions may result if one or more of these conditions is relaxed!
One last challenge for the road: We’ve shown that the exponential map (more or less) uniquely encodes the distributive property of multiplication over addition. Does this imply that Bennett’s hierarchy of commutative hyperoperations is similarly unique?
David Austin
Grand Valley State University
In 2009, crypto miner James Howells of southern Wales mistakenly threw away a hard drive containing 8,000 bitcoins. That’s over $100 million even in today’s sinking crypto landscape. Thirteen years later, Howells has a plan, backed by venture capital, to search the 110,000 tons of garbage in the local landfill using robotic dogs and artificial intelligence.
This is a fairly common problem. Not searching landfills for cryptocurrency, of course, but searching for something in a vast sea of possibilities. That’s what we’ll be doing in this month’s column.
Let’s begin by breaking a code. More specifically, suppose we have a message encoded with a substitution cipher where each character in the message is substituted with another. To keep things simple, we’ll just imagine that the letters of the alphabet are permuted so that whenever we see an “a”, say, we write “m” instead. Here’s our message:
ziatwxafbssuladwafb uanubeablwdfunaywwmabywxdakfnzgdwsfunan wyzlatwxarbtanururyunadfbdafuawlkuafbeabagoblawnadfuagoblaf beakfnzgdwsfunanwyzlazaewldamlwoaofzkfableadfbdafuaxgueadwa kbjjadfzgagoblaswwfaadfbdaobgabajwlhadzruabhwableaofulaouag bzeahwweytuaouadwwmadfualbruaozdfaxgabgaouaezeldadfzlmadfua goblaowxjeaobldazdabltarwnuaoujjaofulaueobneayubnagbzeadfbd afuaowxjeajzmuablauqkzdzlhalbruabjjadwafzrgujiakfnzgdwsfuna nwyzlagbzeabdawlkuaozdfwxdagdwsszlhadwadfzlmadfbdafuaobgaoz llzudfuswwfableafuaobgagwabgazafb uauqsjbzlueadfuaswwfasbnd azaozjjalwoauqsjbzladfuanugdawiazd
Our task is to find the permutation that decodes the message. Of course, we could generate every permutation and look for one that generates something readable just as Mr. Howells could turn over every piece of rubbish in the landfill. However, since there are 27 characters, the letters in the alphabet along with a space, there are
$$
27! = 10,888,869,450,418,352,160,768,000,000 \approx 10^{19}
$$ permutations. That’s more than the number of seconds in the lifetime of the universe. We clearly need a better strategy to search through the permutations.
We know that letters in an English text occur at different frequencies. For instance, in Project Gutenberg’s translation of War and Peace, about 18% of the very large number of characters are spaces, 10% are “e”, 7% are “t” and so forth. In our encoded message, about 20% of the characters are “a” so it seems reasonable to suppose that “a” represents a space, about 8% are “u”, which probably represents an “e”, and so forth. This generates a permutation, which when applied to our message gives:
hk pow ntccei ao ntxe detl tioaned mooy tmowa gndhraocned d omhi pow ftp defefmed anta ne oige ntl t rsti od ane rsti n tl gndhraocned domhi h loia yios snhgn til anta ne wrel ao gtuu anhr rsti coon anta str t uoib ahfe tbo til snei se r thl boolmpe se aooy ane itfe shan wr tr se lhlia anhiy ane rsti sowul stia ha tip fode seuu snei elstdl metd rthl anta ne sowul uhye ti evghahib itfe tuu ao nhfreuk gndhraocned d omhi rthl ta oige shanowa raocchib ao anhiy anta ne str shi iheanecoon til ne str ro tr h ntxe evcuthiel ane coon ctda h shuu ios evcuthi ane dera ok ha
Hmmm. Rather than using just the frequency of single characters, perhaps we could look at the frequency of bigrams, pairs of consecutive characters. To that end, let $T(x,y)$ be the probability that $x$ is followed by $y$ in English text. For instance, we expect $T(\text{q}, \text{u})$ to be relatively large while $T(\text{q}, \text{b})$ should be relatively small, if not zero.
Suppose that our encoded message consists of the sequence of characters $c_i$ and that $\pi$ is the permutation of the symbols that decodes the message. If $c_ic_{i+1}$ is a bigram in the message, then $\pi(c_i)\pi(c_{i+1})$ is a bigram in a piece of English text, which means that $T(\pi(c_i), \pi(c_{i+1}))$ should be relatively high. This allows us to measure how likely a permutation is as a decoding key by defining the plausibility of a permutation to be
$$
P(\pi) = \prod_{i} T(\pi(c_i),\pi(c_{i+1})).
$$ Permutations with higher plausibilities are more likely to decode the message.
Now our task is to search through the $27!$ permutations looking for ones that are highly plausible. The following algorithm will generate a sequence of highly plausible permutations $\pi_n$.
When I ran this algorithm, the permutation $\pi_{1000}$ applied to the encoded message produced:
ik mof tappen so tage wead anostew voou avofs ctwirsoptew w ovin mof bam webebvew stas te once tad a rhan ow ste rhan t ad ctwirsoptew wovin i dons unoh htict and stas te fred so call stir rhan poot stas har a lony sibe ayo and hten he r aid yoodvme he soou ste nabe hist fr ar he didns stinu ste rhan hofld hans is anm bowe hell hten edhawd veaw raid stas te hofld liue an excisiny nabe all so tibrelk ctwirsoptew w ovin raid as once histofs rsoppiny so stinu stas te har hin niestepoot and te har ro ar i tage explained ste poot paws i hill noh explain ste wers ok is
and $\pi_{2500}$ gave:
if you happen to have read another book about christopher r obin you may remember that he once had a swan or the swan h ad christopher robin i dont know which and that he used to call this swan pooh that was a long time ago and when we s aid goodbye we took the name with us as we didnt think the swan would want it any more well when edward bear said that he would like an exciting name all to himself christopher r obin said at once without stopping to think that he was win niethepooh and he was so as i have explained the pooh part i will now explain the rest of it
These results look pretty reasonable, and the whole thing, including scanning War and Peace, took about five seconds on my laptop. It’s remarkable that we can find the decoding permutation, out of the set of $10^{19}$ permutations, with so little effort.
This is an example of what Persi Diaconis has called the Markov Chain Monte Carlo revolution. In his 2007 survey article, Diaconis describes how a psychologist came to the Stanford statistical consulting service with a collection of coded messages exchanged between incarcerated people in the California prison system. Guessing the message was encoded using a substitution cipher, Stanford student Marc Coram applied this algorithm to decode it.
There are a few things to be said before we investigate more deeply. First, the algorithm only worked about half of the time; I’ve just chosen to present here one of the times that it did work. Other times, it seemed to get stuck with a permutation that produced gibberish. The odds of success went up when the initial permutation was chosen to be the one that matched symbols based on the frequency with which they appeared in the message rather than choosing a permutation at random.
Even when the message was successfully decoded, continuing to run the algorithm for more iterations frequently gave something like:
if you happen to have read another mook amout christopher romin
Let’s look more carefully at the algorithm to see what’s going on. At each step, we generate a new permutation. If that permutation is more plausible, we accept it. However, there is a still a chance that a less plausible permutation will be accepted. So while the algorithm generally tends to increase the plausibility of the permutations it produces, there’s a way for it to escape a local maximum.
There is an alternative point of view that’s helpful and that we’ll develop more fully. Rather than searching for the most plausible permutation, the algorithm is actually sampling from the permutations in such a way that more plausible permutations are more likely to appear. More specifically, the probability of obtaining a permutation $\pi$ is proportional to its plausibility so that the sampling distribution is
$$
s(\pi) = \frac{P(\pi)}{Z}
$$ where $Z$ is the normalizing constant
$$
Z = \sum_\pi P(\pi).
$$
This alternative point of view will become more clear after we make a digression through the world of Markov chains.
Let’s begin with a network consisting of nodes joined by edges and suppose we traverse the network at random by moving along its edges. In particular, if we are at a node labeled $x$, the probability that we move along an edge to node $y$ will be denoted $K(x,y)$. If there are $m$ nodes in our network, $K$ is an $m\times m$ matrix we refer to as the transition matrix.
For example, the nodes in the network could be the 27 characters in a piece of English text with $K(x,y)$ being the probability that character $x$ is followed by character $y$.
Since we’ll move from $x$ to some other node, we have $\sum_y K(x,y) = 1$ so that the sum across any row of $K$ is 1. We say that $K$ is row stochastic and note that a row $K(x,-)$ represents the distribution describing our location one step after beginning at $x$.
Notice that $K(x,y)K(y,z)$ is the probability that we begin at $x$, pass through $y$, and then move to $z$. Therefore,
$$
K^2(x,z) = \sum_y K(x,y)K(y,z)
$$ represents the probability that we begin at $x$ and end up at $z$ after two steps. Likewise, $K^n(x,z)$ represents the probability that we begin at $x$ and end up at $z$ after $n$ steps.
As we randomly walk around the network, suppose that the distribution of locations at step $n$ is given by the row vector $s_n$; that is, $s_n(x)$ gives the probability we are at node $x$ at that step . The product $\sum_x s_n(x)K(x,y)$ describes the probability that we are node $y$ after the next step. In other words, the product $s_nK$ provides the new distribution, $s_{n+1} = s_nK$.
As a simple example, consider the network with transition matrix
$K=\begin{bmatrix} 5/6 & 1/6 \\ 1/2 & 1/2 \\ \end{bmatrix}.$ |
This means that if we’re at node $x$, there is a $5/6$ chance we stay at $x$ and a $1/6$ chance we move to $y$. On the other hand, if we are at $y$, we move to $x$ or stay at $y$ with equal probability.
If we begin at $x$, the initial distribution is described by the row vector $s_0 = \begin{bmatrix}1 & 0\end{bmatrix}$. After one step, the distribution is $s_1=s_0K = \begin{bmatrix}5/6 & 1/6 \end{bmatrix}$. Here’s how the first few steps proceed:
$$
\begin{array}{c|c}
{\bf n} & {\bf s_n} \\
\hline
0 & \begin{bmatrix}1.000 & 0.000 \end{bmatrix} \\
1 & \begin{bmatrix}0.833 & 0.167 \end{bmatrix} \\
2 & \begin{bmatrix}0.778 & 0.222 \end{bmatrix} \\
3 & \begin{bmatrix}0.759 & 0.241 \end{bmatrix} \\
4 & \begin{bmatrix}0.753 & 0.247 \end{bmatrix} \\
\end{array}
$$
If we continue, something remarkable happens. The distributions $s_n$ converge to $s_n\to\overline{s} = \begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$. This means that after a long time, we spend three-quarters of our time at node $x$ and one-quarter of our time at node $y$.
Since $s_{n+1} = s_nK$, this says that $$
\overline{s} = \overline{s}K
$$ and we call $\overline{s}$ a stationary distribution. In fact, the powers of $K$ converge $$
K^n \to \begin{bmatrix}
0.75 & 0.25 \\
0.75 & 0.25 \\
\end{bmatrix}
$$ to a rank one matrix so that no matter the initial state $s_0$, the distributions $s_n = K^ns_0$ converge to the same stationary distribution $\overline{s}=\begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$.
This is an example of the beautiful Perron-Frobenius theorem: under a mild assumption on the row stochastic matrix $K$, any initial distriubtion $s_0$ will converge to a unique stationary distribution $\overline{s}$. In fact, this theorem is the basis for Google’s PageRank algorithm, as explained in an earlier Feature Column.
The sequence of distributions $s_n$ is called a Markov chain so we say that any Markov chain converges to the unique stationary distribution.
But what does this have to do with the decoding problem we began with? Let’s recast the sequence of permutations $\pi_n$ that we created in the language of Markov chains.
Let’s view the set of permutations $S_n$ as a network where each permutation represents a node and an edge connects two nodes when the permutations differ by a transposition. We want to sample permutations from $S_n$ in such a way that we are more likely to sample highly plausible permutations. With this in mind, we view the plausibility as defining a distribution on $S_n$. That is, the probability of choosing a given permutation is $s(\pi) = P(\pi)/Z$ where $Z$ is a normalizing constant
$$
Z = \sum_{\pi} P(\pi).
$$ Of course, $S_n$ is so large that evaluating $Z$ is not feasible, but we’ll see that this presents no problem.
If we can find a transition matrix $K$ whose unique stationary distribution is $\overline{s} = s$, then a Markov chain will produce a sequence of permutations that is a sample from the distribution $s(\pi)$.
In the discussion of the Perron-Frobenius theorem, we began with the transition matrix $K$ and found the resulting stationary vector. Now we’d like to invert this process: if we are given a distribution $s$, can we find a transition matrix $K$ whose stationary distribution $\overline{s} = s$?
The Metropolis-Hastings algorithm tells us how to create the transition matrix $K$. We begin with any row stochastic transition matrix $J(x,y)$. Given $x$ and $y$, we define the acceptance ratio
$$
A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)}.
$$ This enables us to define
$$
K(x,y) = \begin{cases}
J(x,y) & \text{ if } x\neq y, A(x,y) \geq 1 \\
J(x,y)A(x,y) & \text{ if } x\neq y, A(x,y) \lt 1 \\
\end{cases}.
$$ The diagonal term $K(x,x)$ is chosen to enforce the condition that $\sum_y K(x,y) = 1$ so that $K$ is row stochastic.
What is the stationary distribution associated to $K$? Well, notice that $A(x,y) = 1/A(x,y)$. If the acceptance ratio $A(x,y) \lt 1$, then $A(y,x) \gt 1$ so that
$$
s(x)K(x,y) = s(x)J(x,y)A(x,y) = s(y)J(y,x) = s(y)K(y,x).
$$ Therefore
$$
\sum_x s(x)K(x,y) = \sum_x s(y)K(y,x) = s(y)\sum_xK(y,x) =
s(y).
$$ In other words, $sK = s$ so $s$ is the unique stationary distribution for $K$: $\overline{s} = s$.
When we apply this to our decoding problem, we choose the initial transition matrix $J(x,y)$ so that from the permutation $x$, we uniformly choose any permutation $y$ with $y=\sigma x$ for a transposition $\sigma$. In this case, $J$ is symmetric so that $J(x,y) = J(y,x)$ and the acceptance ratio is
$$
A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)} = \frac{s(y)}{s(x)} =
\frac{P(y)/Z}{P(x)/Z} = \frac{P(y)}{P(x)}.
$$ Notice that we can find the ratio $s(y)/s(x)$ without knowing the normalizing constant $Z$.
Now we see that the sequence of permutations $\pi_n$ that we created to decode the message is actually a Markov chain given by the transition matrix $K$ whose stationary distribution is proportional to the plausibility. Since the Markov chain samples from this distribution, we find that the sequence of permutations favors permutations with a high plausibility, which makes it likely that we encounter the right permutation for decoding the message.
This method is known as Markov Chain Monte Carlo sampling, and it plays an important role in the mathematical analysis of political redistricting plans. As is well known, American state legislatures redraw maps of districts for political representation every 10 years using data from the latest Census. A typical state is geographically divided into small building blocks, such as Census blocks or tracts, and each building block is assigned into a Congressional district.
For instance, Michigan has 14 Congressional districts and 2813 Census tracts. A redistricting plan that assigns a Congressional district to each Census block would be a function, $R:{\cal C}\to\{1,2,3,\ldots,14\}$, where $\cal C$ is the set of Census tracts. The number of such functions is $14^{2813}$, whose size we understate by calling it astronomically large. Most of these plans don’t satisfy legal requirements established by the state, but this just points to the fact that there are lots of possible redistricting plans.
Here are the requirements that must be satisfied by a Michigan redistricting plan. The populations of the districts should be approximately equal, and each district should be contiguous, meaning that one could walk over land between any two points in a district. Districts are also required to be “compact,” meaning the area of each district is a significant fraction of the area of the smallest circle that encloses the district.
When a particular party controls the state legislature, they are naturally motivated to draw districts that ensure their party will gain as many representatives as possible no matter the expressed preference of voters. For instance, they may put many voters from the opposing party into a small number of districts while spreading their own voters around just enough to ensure a majority in a large number of districts. This practice is called gerrymandering.
For example, Republicans won 60 of the 99 seats in the Wisconsin State Assembly in the 2012 elections after the Republican-controlled legislature created a new redistricting plan in 2010. That is, Republicans won about 60% of the seats in spite of winning only 50.05% of the votes. How can we assess whether this is the result of a gerrymander or simply due to constraints on the redistricting plans?
There are many mathematical approaches to this problem, but a promising recent approach is to generate a large ensemble of redistricting plans and, for each plan, determine how many seats each party would win had that plan been in place. Now the problem starts to sound familiar: we want to generate a representative sample from an extremely large set of possibilities. Markov Chain Monte Carlo sampling does just that!
There’s a network that provides a useful mathematical model of a redistricting plan. Each geographic building block is a node, and two nodes are joined by an edge if the corresponding building blocks abut one another. Nodes with the same color in the figure below belong to the same district in a particular redistricting plan.
We say that an edge joining two nodes in different districts is a conflicted edge.
To facilitate our sampling strategy, we will now build a network of redistricting plans with each plan forming a node. An edge joins two districts if they are obtained by changing the district of one endpoint of a conflicted edge. For instance, here’s a new plan obtained by changing one endpoint of the conflicted edge from yellow to green. These two plans are connected by an edge in our network of redistricting plans.
We would like to draw samples from the huge set of redistricting plans by completing a random walk on this network. If we simply follow an edge at random, we have the transition matrix $J(R, R’)$ where $R$ and $R’$ are redistricting plans joined by an edge. If $c(R)$ is the number of conflicted edges, we have
$$
J(R,R’) = \frac{1}{2c(R)}.
$$
But now we would like to sample redistricting plans that satisfy the additional requirements of equal population, contiguity, and so forth. For each requirement, there is a measure of how well a redistricting plan satisfies that requirement. For instance, $M_{\text{pop}}(R)$ will measure how well the equal population requirement is met. These measures are weighted and combined into a total scoring function
$$
M(R) = w_{\text{pop}}M_{\text{pop}}(R) + w_{\text{compact}}M_{\text{compact}}(R) + \ldots.
$$ The better the redistricting plan $R$ satisfies the requirements, the lower the scoring function $M(R)$.
Finally, the function $e^{-M(R)}$ defines a distribution
$$
s(R) = \frac{e^{-M(R)}}{Z}
$$ where $Z = \sum_R e^{-M(R)}$ is a normalizing constant that, as before, needn’t concern us. Sampling from this distribution means we are more likely to obtain redistricting plans that satisfy the requirements.
We’re now in a position to apply the Metropolis-Hastings algorithm to obtain the transition matrix $K(R, R’)$ whose unique stationary distribution is $s(R)$. Taking a random walk using this transition matrix produces a sample whose redistricting plans are likely to satisfy the legal requirements for a redistricting plan.
To study the 2012 Wisconsin State Assembly election, Herschlag, Ravier, and Mattingly drew an ensemble of 19,184 redistricting plans in this way, and, for each plan in the ensemble, determined the number of seats that would have been won by the two parties had that plan been in place. The results are summarized in the following histogram, which shows the frequency of seats won by Republicans.
The plan in place resulted in 60 Republican seats, which is shaded red in the histogram. This result appears to be an outlier, which is evidence that the redistricting plan drawn by Republicans in 2010 is the result of an egregious partisan gerrymander.
Herschlag, Ravier, and Mattingly summarize this result, combined with others from their analysis, by writing:
The Wisconsin redistricting seems to create a firewall which resists Republicans falling below 50 seats. The effect is striking around the mark of 60 seats where the number of Republican seats remains constant, despite the fraction of votes dropping from 51% to 48%.
In addition to the Wisconsin study described here, these techniques have been used to assess North Carolina’s Congressional redistricting plan, which saw Republicans capturing 10 of 13 seats in the 2016 election in spite of a 47-53 percent split in votes given to the two major parties. Fewer than one percent of the 24,000 maps generated gave Republicans ten seats.
While mathematicians have created other metrics to detect gerrymanders, the approach using Markov Chain Monte Carlo sampling described here, however, offers several advantages. First, it can be adapted to the unique requirements of any state by modifying the scoring function $M(R)$.
It’s also relatively easy to communicate the results to a non-technical audience, which is important for explaining them in court. Indeed, a group of mathematicians and legal experts filed an amicus brief with the Supreme Court that was cited in Justice Elena Kagan’s dissent in a recent gerrymandering case regarding North Carolina’s redistricting map.
This column has focused on two applications of Markov Chain Monte Carlo sampling. Diaconis’ survey article, cited in the references, describes many more uses in areas such as group theory, chemistry, and theoretical computer science.
Persi Diaconis. The Markov chain Monte Carlo revolution. Bulletin of the American Mathematical Society, 46 (2009), 179-205.
Gregory Herschlag, Robert Ravier, and Jonathan C. Mattingly. Evaluating Partisan Gerrymandering in Wisconsin. 2017.
Gregory Herschlag, Han Sung Kang, Justin Luo, Christy Vaughn Graves, Sachet Bangia, Robert Ravier, and Jonathan C. Mattingly. Quantifying gerrymandering in North Carolina. Statistics and Public Policy, 7(1) (2020), 30–38.
Moon Duchin. Gerrymandering Metrics: How to measure? What’s the baseline? 2018.
Sachet Bangia, Christy Vaughn Graves, Gregory Herschlag, Han Sung Kang, Justin Luo, Jonathan C. Mattingly, Robert Ravier. Redistricting: Drawing the Line. 2017.
MGGG Redistricting Lab. The Metric Geometry and Gerrymandering Group is a great resource for learning about new developments in this area.
Noah Giansiracusa
Bentley University
Artificial intelligence (AI) breakthroughs make the news headlines with increasing frequency these days. At least for the time being, AI is synonymous with deep learning, which means machine learning based on neural networks (don't worry if you don't know what neural networks are—you're not going to need them in this post). One area of deep learning that has generated a lot of interest, and a lot of cool results, is graph neural networks (GNNs). This technique lets us feed a neural network data that naturally lives on a graph, rather than in a vector space like Euclidean space. A big reason for the popularity of this technique is that much of our modern internet-centric lives takes place in graphs. Social media platforms connect users into massive graphs, with accounts as vertices and friendships as edges (following another user corresponds to a directed edge in a directed graph), while search engines like Google view the web as a directed graph with webpages as vertices and hyperlinks as edges.
AirBnB provides an interesting additional example. At the beginning of 2021, the Chief Technology Officer at AirBnB predicted that GNNs would soon be big business for the company, and indeed just a few months ago an engineer at AirBnB explained in a blog post some of the ways they now use GNNs and their reasons for doing so. This engineer starts his post with the following birds-eye view of why graphs are important for modern data—which I'll quote here since it perfectly sets the stage for us:
Many real-world machine learning problems can be framed as graph problems. On online platforms, users often share assets (e.g. photos) and interact with each other (e.g. messages, bookings, reviews). These connections between users naturally form edges that can be used to create a graph. However, in many cases, machine learning practitioners do not leverage these connections when building machine learning models, and instead treat nodes (in this case, users) as completely independent entities. While this does simplify things, leaving out information around a node’s connections may reduce model performance by ignoring where this node is in the context of the overall graph.
In this Feature Column we're going to explore how to shoehorn this missing graph-theoretic "context" of each node back into a simple Euclidean format that is amenable to standard machine learning and statistical analysis. This is a more traditional approach to working with graph data, pre-dating GNNs. The basic idea is to cook up various metrics that transform the discrete geometry of graphs into numbers attached to each vertex. This is a fun setting to see some graph theory in action, and you don't need to know any machine learning beforehand—I'll start out with a quick gentle review of all you need.
The three main tasks in machine learning are regression, classification, and clustering.
For regression, you have a collection of variables called features and one additional variable, necessarily numerical (meaning $\mathbb{R}$-valued) called the target variable; by considering the training data where the values of both the features and target are known, you fit a model that attempts to predict the value of the target on actual data where the features but not the target are known. For instance, predicting a college student's income after graduation based on their GPA and the college they are attending is a regression task. Suppose all the features are numerical—for instance, we could represent each college with its US News ranking (ignore for a moment how problematic those rankings are). Then a common approach is linear regression, which is when you find a hyperplane in the Euclidean space coordinatized by the features and the target that best fits the training data (i.e., minimizes the "vertical" distance from the training points down to the hyperplane).
Classification is very similar; the only difference is that the target variable is categorical rather than numerical—which in math terms just means that it takes values in a finite set rather than in $\mathbb{R}$. When this target set has size two (which is often $\{\mathrm{True}, \mathrm{False}\}$, or $\{\mathrm{yes}, \mathrm{no}\}$, or $\{0, 1\}$) this is called binary classification. For instance, predicting which students will be employed within a year of graduation can be framed as a binary classification task.
Clustering is slightly different because there is no target, only features, and you'd like to partition your data into a small number of subsets in some natural way based on these features. There is no right or wrong answer here—clustering tends to be more of an exploratory activity. For example, you could try clustering college students based on their GPA, SAT score, financial aid amount, number of honors classes taken, and number of intramural sports played, then see if the clusters have human-interpretable descriptions that might be helpful in understanding how students divide into cohorts.
I want to provide you with the details of one regression and classification method, to give you something concrete to have in mind when we turn to graphs. Let's do the k-Nearest Neighbors (k-NN) algorithm. This algorithm is a bit funny because it doesn't actually fit a model to training data in the usual sense—to predict the value of the target variable for each new data point, the algorithm looks directly back at the training data and makes a calculation based on it. Start by fixing an integer $k \ge 1$ (smaller values of k provide a localized, granular look at the data, whereas larger values provide a smoothed, aggregate view). Given a data point P with known feature values but unknown target value, the algorithm first finds the k nearest training points $Q_1, \ldots, Q_k$—meaning the training points $Q_i$ whose distance in the Euclidean space of feature values to P is smallest. Then if the task is regression, the predicted target value for P is the average of the target values of $Q_1, \ldots, Q_k$, whereas if the task is classification then the classes of these $Q_i$ are treated as votes and the predicted class for $P$ is whichever class receives the most votes. (Needless to say, there are plenty of variants such as weighting the average/vote by distance to P, changing the average from a mean to a median, or changing the metric from Euclidean to something else.)
That's all you need to know about machine learning in the usual Euclidean setting where data points live in $\mathbb{R}^n$. Before turning to data points that live in a graph, we need to discuss some graph theory.
First, some terminology. Two vertices in a graph are neighbors if they are connected by an edge. Two edges are adjacent if they have a vertex in common. A path is a sequence of adjacent edges. The distance between two vertices is the length of the shortest path between them, where length here just means the number of edges in the path.
A useful example to keep in mind is a social media platform like Facebook, where the vertices represent users and the edges represent "friendship" between them. (This is an undirected graph; platforms like Twitter and Instagram in which accounts follow each other asymmetrically form directed graphs. Everything in this article could be done for directed graphs with minor modification, but I'll stick to the undirected case for simplicity.) In this example, your neighbors are your Facebook friends, and a user has distance 2 from you if you're not friends but you have a friend in common. The closed ball of radius 6 centered at you (that is, all accounts of distance at most 6 from you) consists of all Facebook users you can reach through at most 6 degrees of separation on the platform.
Next, we need some ways of quantifying the structural role vertices play in a graph. There are a bunch of these, many of which capture various notions of centrality in the graph; I'll just provide a few here.
Starting with the simplest, we have the degree of a vertex, which in a graph without loops or multiple edges is just the number of neighbors. In Facebook, your degree is your number of friends. (In a directed graph the degree splits as the sum of the in-degree and the out-degree, which on Twitter count the number of followers and the number of accounts followed.)
The closeness of a vertex captures whether it lies near the center or the periphery of the graph. It is defined as the reciprocal of the sum of distances between this vertex and each other vertex in the graph. A vertex near the center will have a relatively modest distance to the other vertices, whereas a more peripheral vertex will have a modest distance to some vertices but a large distance to the vertices on the “opposite” side of the graph. This means that the sum of distances for a central vertex is smaller than the sum of distances for a peripheral vertex; reciprocating this sum flips this around so that the closeness score is greater for a central vertex than a peripheral vertex.
The betweenness of a vertex, roughly speaking, captures centrality in terms of the number of paths in the graph that pass through the vertex. More precisely, it is the sum over all pairs of other vertices in the graph of the fraction of shortest paths between the pair of vertices that pass through the vertex in question. That's a mouthful, so let's unpack it with a couple simple examples. Consider the following two graphs:
In graph (a), the betweenness of V1 is 0 because no shortest paths between the remaining vertices pass through V1. The same is true of V2 and V3. The betweenness of V4, however, is 2: between V1 and V2 there is a unique shortest path and it passes through V4, and similarly between V1 and V3 there is a unique shortest path and it also passes through V4. For the graph in (b), by symmetry it suffices to compute the betweenness of a single vertex. The betweenness of V1 is 0.5, because between V2 and V3 there are 2 shortest paths, exactly one of which passes through V1.
The following figure shows a randomly generated graph on 20 vertices where in (a) the size of each vertex corresponds to its closeness score and in (b) it corresponds to the betweenness score. Note that the closeness does instead reflect how central versus peripheral the vertices are; the betweenness is harder to directly interpret, but roughly speaking it helps identify important bridges in the graph.
Another useful pair of measures of vertex importance/centrality in a graph are the eigenvector centrality score and the PageRank score. I'll leave it to an interested reader to look these up. They both have nice interpretations in terms of eigenvectors related to the adjacency matrix and in terms of random walks on the graph.
Suppose we have data in the usual form for machine learning—so there are features for clustering, or if one is doing regression/classification then there is additionally a target variable—but suppose in addition that the data points form the vertices of a graph. An easy yet remarkably effective way to incorporate this graph structure (that is, to not ignore where each vertex is "in the context of the overall graph," in the words of the AirBnB engineer) is simply to append a few additional features given by the vertex metrics discussed earlier: degree, closeness, betweenness, eigenvector centrality, PageRank (and there are plenty others beyond these as well).
For instance, one could perform clustering in this manner, and this would cluster the vertices based on both their graph-theoretic properties as well as the original non-graph-theoretic feature values. Concretely, if one added closeness as a single additional graph-theoretic feature, then the resulting clustering is more likely to put peripheral vertices together in the same clusters and it is more likely to put vertices near the center of the graph together in the same clusters.
The following figure shows the same 20-vertex random graph pictured earlier, now with vertices colored by a clustering algorithm (k-means, for $k=3$) that uses two graph-theoretic features: closeness and betweenness. We obtain one cluster comprising the two isolated vertices, one cluster comprising the two very central vertices, and one cluster comprising everything else.
If one is predicting the starting income of college students upon graduation, one could use a regression method with traditional features as discussed above but include additional features such as the eigenvector centrality of each student in the network formed by connecting students whenever they took at least one class together.
So far we've augmented traditional machine learning tasks by incorporating graph-theoretic features. Our last topic is a machine learning task without counterpart in the traditional non-graph-theoretic world: edge prediction. Given a graph (possibly with a collection of feature values for each vertex), we'd like to predict which edge is most likely to form next, when the graph is considered as a somewhat dynamic process in which the vertex set is held constant but the edges form over time. In the context of Facebook, this is predicting which two users who are not yet Facebook friends are most likely to become ones—and once Facebook makes this prediction, it can use it as a suggestion. We don't know the method Facebook actually uses for this (my guess is that it at least involves GNNs), but I can explain a very natural approach that is widely used in the data science community.
We first need one additional background ingredient from machine learning. Rather than directly predicting the class of a data point, most classifiers first compute the propensity scores, which up to normalization are essentially the estimated probability of each class—then the predicted class is whichever class has the highest propensity score. For example, in k-NN I said the prediction is given by counting the number of neighbors in each class and taking the most prevalent class; these class counts are the propensities scores for k-NN classification. Concretely, for 10-NN if a data point has 5 red neighbors and 3 green neighbors and 2 blue neighbors, then the propensity scores are 0.5 for red, 0.3 for green, and 0.2 for blue (and of course the prediction itself is then red). For binary classification one usually just reports a single propensity score between 0 and 1, since the propensity score for the other class is just the complementary probability.
Returning to the edge prediction task, consider a graph with n vertices and imagine a matrix with n choose 2 rows indexed by the pairs of vertices in the graph. The columns for this matrix are features associated to pairs of vertices—which could be something like the mean (or min, or max) of the closeness (or betweenness, or eigenvector centrality, or...) score for the two vertices in the pair, and if there are non-graph-theoretic features associated with the vertices one could also draw from these, and one could also use the distance between the two vertices in the pair as a feature. Create an additional column, playing the role of the target variable, that is a 1 if the vertex pair are neighbors (that is, joined by an edge) and a 0 otherwise. Train a binary classifier on this data, and the vertex pair with the highest propensity score among those that are not neighbors is the pair most inclined to become neighbors—that is, this is the next edge most likely to form, based on the features used. This reveals the edges that don't exist yet seem like they should, based on the structure of the graph (and the extrinsic non-graph data, if one also uses that).
If one has snapshots of the graph's evolution across time, one can train this binary classifier on the graph at time $t$ then compare the predicted edges to the actual edges that exist at some later time $t' > t$, to get a sense of how accurate these edge predictions are.
In broad outlines, here's the path we took in this article and where we ended up. The distance between vertices in a graph—generalizing the popular "degrees of separation" games played with Kevin Bacon's movie roles and Paul Erdős' collaborations—allows one to quantify various graph-theoretic roles the vertices play, via notions like betweenness and closeness. These quantifications can then serve as features in clustering, regression, and classification tasks, which helps the machine learning algorithms involved incorporate the graph structure on the data points. By considering vertex pairs as data points and using the average closeness, betweenness, etc., across each pair (and/or the distance between the pair), we can predict which missing edges "should" exist in the graph. When the graph is a social media network, these missing edges can be framed as algorithmic friend/follower suggestions.
And when the graph is of mathematical collaborations (mathematicians as vertices and edges joining pairs that have co-authored a paper together), this can suggest to you who your next collaborator should be: just find the mathematician you haven't published with yet whose propensity score is highest!
Ursula Whitcher
Mathematical Reviews (AMS)
In Helsinki this summer, Ukrainian mathematician Maryna Viazovska was awarded a Fields Medal "for the proof that the $E_8$ lattice provides the densest packing of identical spheres in 8 dimensions, and further contributions to related extremal problems and interpolation problems in Fourier analysis."
Finding the most efficient way to pack identical spheres is an extremely challenging problem! Even in three dimensions, this was an open question for hundreds of years. Johannes Kepler asserted that the familiar stacked pyramid was the best possible option in the seventeenth century, but the first person to provide a complete proof of this fact was Thomas Hales, in 1998. (Bill Casselman illustrated some of Hales' arguments using two-dimensional examples in the December 2000 Feature Column, Packing pennies in the plane.)
In 2016, Viazovska found an elegant way to show that a particular method of packing spheres in 8-dimensional space was the best. She then teamed up with four other mathematicians inspired by her arguments, Henry Cohn, Abhinav Kumar, Stephen D. Miller, and Danylo Radchenko, to prove a similar result in 24 dimensions. Six years later, 8 and 24 are still the only dimensions higher than 3 where the best way to pack identical spheres is known! The best way to pack spheres in these dimensions turns out to be extremely symmetrical. This might not be true in every dimension—maybe sometimes it's better to squeeze extra higher-dimensional spheres into unexpected corners!
What is the $E_8$ lattice that appears in Viazovska's proof? What makes it special? How do you use it to pack spheres? Let's explore these questions and check out some beautiful visualizations of $E_8$.
Before we define $E_8$, we should explore the more general concept of a lattice. It's important to be specific here, because the word "lattice" is used for multiple distinct mathematical concepts! The set of points in the plane with integer coordinates is an easy-to-visualize example of the lattices we'll be talking about today.
There are lots of different ways to describe the points in the plane with integer coordinates. One place to start is to focus on the points $(0,1)$ and $(1,0)$. If we add these points together, we get a new point, $(1,1)$, that also has integer coordinates. If we keep on adding or subtracting $(0,1)$ and $(1,0)$, we will eventually reach every point in the plane with integer coordinates!
We can connect the origin $(0,0)$, the starting points $(0,1)$ and $(1,0)$, and the first sum $(1,1)$ to make a square. A square is a special kind of parallelogram, and this square is called the fundamental parallelogram for the lattice of points with integer coordinates.
Now, what if we reversed the procedure? Instead of starting with an infinite set of scattered points and making a parallelogram, we could start with a parallelogram and create an infinite set. Our procedure is to place one vertex of the parallelogram at the origin, then repeatedly add or subtract the points corresponding to the two vertices of the parallelogram adjacent to the origin. In particular, the sum of these two vertices is the remaining vertex of the parallelogram.
In three dimensions, we could generalize this procedure using the corners of a box. Or more broadly, since we don't need the edges to meet at right angles, we could use the three-dimensional generalization of a parallelogram: a parallelepiped. (Here's an etymological puzzle I learned from John Conway: why don't we pronounce "parallelepiped" as "parallel-epi-ped," emphasizing the Greek root for "on"?)
To build a lattice in $n$ dimensions, we just need the $n$-dimensional generalization of a parallelogram and parallelepiped. Such a shape is called a parallelotope. Alternatively, we could simplify a bit by focusing on the $n$ vertices of our fundamental parallelotope that are connected by a line segment to the origin. (Linear algebra enthusiasts will recognize that we are specifying a basis of $\mathbb{R}^n$.)
We are ready to describe the $E_8$ lattice! Concretely, it is the eight-dimensional lattice determined by the eight following fundamental parallelotope vertices:
Repeatedly adding and subtracting these points creates the entire infinite $E_8$ lattice.
You might notice that the points of $E_8$ have either integer coordinates or half-integer coordinates, but never a combination of integers and half-integers. Another way to describe $E_8$ is that it consists of the points in eight dimensions with only integer or only half-integer coordinates where the sum of the coordinates is an even number.
Maryna Viazovska showed that the most efficient sphere packing in eight dimensions places a sphere center at each of the points in the $E_8$ lattice. What is the radius of these spheres? You might notice that each of the eight fundamental parallelotope vertices we used to build $E_8$ is a distance of $\sqrt{2}$ from the origin. This turns out to be the smallest possible distance between any pair of points in the $E_8$ lattice, so if we give each sphere a radius of $\sqrt{2}/2$, the spheres will touch without overlapping. (Of course, if we want to pack identical eight-dimensional spheres that have some other radius, we can scale the entire picture up or down.)
The eight special points we chose are not the only points in $E_8$ with minimum distance from the origin! We can quickly construct eight more by subtracting our chosen points from the origin. But that is only the beginning. There are 240 points in $E_8$ at a distance of $\sqrt{2}$ from the origin. These special points are called roots. We can immediately see that in the $E_8$ lattice packing, a sphere centered at the origin will touch 240 other spheres. Because we can move any point of $E_8$ to the origin by subtracting its coordinates from every other lattice point, we conclude that every sphere in the $E_8$ lattice packing touches 240 other spheres.
We can use the 240 roots of $E_8$ to construct a polytope—a higher-dimensional polyhedron—that has 240 vertices. Formally, this polytope is known as the $4_{21}$ polytope, based on a classification by the lawyer and amateur mathematician Thorold Gosset. Because it lives in eight dimensions, we can't visualize the $4_{21}$ polytope directly. However, we can create images describing which vertices are connected by edges to which other vertices, in a process analogous to the diagrams of a cube you might draw on a sheet of paper. This helps us visualize the structure of the 240 spheres surrounding the sphere at the origin in the $E_8$ lattice packing.
Here is a projection of the $E_8$ root polytope into three dimensions:
The mathematician José Luis Rodríguez Blancas led a project to visualize the $4_{21}$ polytope using colored thread. In this visualization, the 240 roots are divided into eight different "crowns," each containing 30 vertices.
Another way to understand the eight fundamental parallelotope vertices connected to the origin is to look at the angle described by the origin and each pair of vertices. Let's think of the vertices as vectors with their heads at the vertex and the tail at the origin. If we measure in radians, the angle $\theta$ between any pair of vectors $\mathbf{v}$ and $\mathbf{w}$ satisfies the equation
$$ \cos \theta = \frac{\mathbf{v} \cdot \mathbf{w}}{||\mathbf{v}|| ||\mathbf{w}||}.$$
In our eight-dimensional case, $(v_1, \dots, v_8) \cdot (w_1, \dots, w_8) = v_1 w_1 + \dots v_8 w_8$ and the length $||(v_1,\dots,v_8)||$ is given by $\sqrt{v_1^2+\dots+v_8^2}$. Because each of our eight special vertices is a distance of $\sqrt{2}$ from the origin, we know $||\mathbf{v}|| ||\mathbf{w}|| = 2$ for any pair of special vertices, so
$$ \cos \theta = \frac{1}{2} \mathbf{v} \cdot \mathbf{w}.$$
Thus, we can determine the angles between pairs of our special vertices by calculating their dot products. Here's a matrix with every possible pair of dot products:
$$\begin{pmatrix}
2 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\
-1 & 2 & -1 & 0 & 0 & 0 & 0 & 0 \\
0 & -1 & 2 & -1 & 0 & 0 & 0 & 0 \\
0 & 0 & -1 & 2 & -1 & 0 & 0 & 0 \\
0 & 0 & 0 & -1 & 2 & -1 & -1 & 0 \\
0 & 0 & 0 & 0 & -1 & 2 & 0 & 0 \\
0 & 0 & 0 & 0 & -1 & 0 & 2 & -1 \\
0 & 0 & 0 & 0 & 0 & 0 & -1 & 2
\end{pmatrix}$$
We have 2s on the diagonal, as expected. Distinct vertices have dot product either 0 or -1, so the corresponding angles are either right angles or $2 \pi/3$ ($120^\circ$)—a strikingly symmetrical arrangement!
A subtler fact about this matrix of dot products is that it has determinant 1. When a lattice's dot product matrix has this property, we say that the lattice is unimodular. In any dimension, the lattice of all points with integer coordinates is unimodular, since the corresponding dot product matrix is the identity matrix. But every point in the $E_8$ lattice has an even dot product with itself, and even unimodular lattices are much rarer. In fact, $E_8$ is the smallest even unimodular lattice—as long as we assume that our lattices can be embedded in $\mathbb{R}^n$! (If you're willing to admit imaginary lengths, your options for building unimodular objects expand.)
We can make a sixteen-dimensional even unimodular lattice by taking pairs of lattice points in $E_8$. There's a second even unimodular lattice in dimension sixteen, as well. But there are no other geometrically feasible options between 8 and 16. The next even unimodular lattices appear in 24 dimensions. That's why Viazovska and her collaborators guessed there should be a way to extend her techniques from 8 to 24-dimensional sphere packings—skipping everything in between!
Courtney Gibbons
Hamilton College
My interest in applied algebra was a long time coming. I’m not exactly a fan of living in reality, so the idea of taking something so lovely (to me) as algebra and applying it to things like disease modeling or phylogenetic trees seemed too, well, real. That’s not to say I didn’t realize how important these applications are, but those weren’t applications that captured my interest or imagination—at least, not right away!
Around 2010, I came across a paper by Elizabeth Arnold, Stephen Lucas, and Laura Taalman that described how to use algebra (particularly Groebner bases) to solve Sudoku (see reference [1]). The principles are the same as in many “real world” applied algebra settings, and thanks to this paper, I started to get interested in using algebra to help better understand the world we live in. At the same time, the 2010 Mathematics Subject Classification added “applications of commutative algebra (e.g., to statistics, control theory, optimization, etc.)” (13P25) to its list.
Today’s blog post will walk through some of the commutative algebra and algebraic geometry involved in solving a 4 by 4 Sudoku puzzle (called “shidoku”). The interested can play along by downloading this Macaulay2 file, shidoku.m2, and running computations on the University of Melbourne’s Macaulay2 web server: https://www.unimelb-macaulay2.cloud.edu.au/#home.
First off, let’s review the rules of the game. In a 4 by 4 grid, there are sixteen cells arranged into four rows, four columns, and four 2 by 2 cages that each take up a corner of the square.
Each cell can be populated with the numbers 1, 2, 3, or 4 in such a way that each row, column, and cage contains all four numbers exactly once.
Those rules describe everything you need to play on an empty board, but a board usually already has some cells filled in. Give it a whirl:
When you’ve finished, play it again—but get rid of the blue 4 and see how many solutions you can find. When you finish that one, put the blue 4 back, add an additional 4 to the fourth row and second column, and see what happens.
Regarding the variations, I’m pretty sure the etiquette of puzzle creation insists that a “good” puzzle has a unique solution—but bear with me! I promise I’m breaking the rules of etiquette for a good reason! Anyway, given how quickly you can mentally solve these puzzles, the natural question is: why bother with algebra? Aside from the obvious (and somewhat cheeky) answer, why NOT?, I often find it useful to understand mathematical ideas by seeing them applied to a situation I know pretty well. In this case, it’s solving a small logic puzzle.
The first step in applying algebra to a puzzle is figuring out how to model the game and its rules. If we take a big-picture look at what we’re doing, we’re trying to find a very special point in sixteen-dimensional space that satisfies the rules of the game. So, we could imagine setting up a bunch of polynomials in sixteen variables, one variable for each cell, that describe the rules and the clues so that a (hopefully unique!) zero of all the polynomials represents a (hopefully unique!) solution to our puzzle.
In other words, we’re going to start with the polynomial ring $\mathbb{C}[x_{11}, x_{12}, \ldots, x_{44}]$, and we’re going to make an ideal $I$ that represents our puzzle. Solutions to our puzzle will belong to the set $V(I) = \{(a_{11},a_{12},\ldots,a_{44}) \, : \, a_{ij} \in \mathbb{C} \text{ and } f(a_{11},a_{12},\ldots,a_{44}) = 0 \, \forall \, f \in I\}$.
As a reminder—or a teaser—an ideal $I$ in a (commutative) ring $R$ is a nonempty set that is closed under addition, subtraction, and the “multiplicative absorption” property (less poetically called “scalar multiplication” by some): for all $a$ in $I$ and $r \in R$, $ar \in I$. Its partner concept from geometry is that of a variety in an affine (or projective) space, which is a set of points that are simultaneous solutions to a set of polynomial equations. For instance, the set of polynomial equations could be the polynomials belonging to an ideal in a polynomial ring, in which case we call the variety $V(I)$.
Here’s a little example in a more familiar setting. Consider the polynomials $x+y$ and $y^3 – x^2$. They generate an ideal, $J = \langle x+y,y^3-x^2 \rangle$, which includes all linear combinations (with “scalars” from $\mathbb{R}[x,y]$) of the generators. Setting both generators equal to zero, we plot a line and a cusp, and their simultaneous solutions are the points $(1,-1)$ and $(0,0)$ in $\mathbb{R}^2$. That means $V(J) = \{(1,-1),(0,0)\}$.
You can read more about ideals and varieties, their interactions, and the algorithms that make them nice to work with in [2].
So, back to our games. Let’s build the ideal $I$ that represents the game. The clues on the board are easiest to encode because if we know that the cell in row one, column three must be 3, we must ensure that points in $V(I)$ satisfy $a_{13} = 3$. Since we insist (by definition of $V(I)$) that the polynomial $x_{13} – a_{13} = 0$ when we evaluate it at any point in $V(I)$, this means $x_{13} – 3$ is in our set of clues. You can work out the rest of the clues’ polynomials similarly.
The rules of the game are a little more complicated to model, but let’s muddle through. We know that we want each cell to be one of the numbers 1, 2, 3, or 4, which means for each cell, we have a polynomial $(x_{ij} – 1)(x_{ij} – 2)(x_{ij} – 3)(x_{ij} – 4)$. I suppose I could have tried fiddling with my coefficient ring or solution space to work modulo 4, but some of the algebra I want to use requires that we are working over an algebraically closed field. So I’ve chosen to work over the complex numbers (although if you’re playing with the code, you’ll notice I just picked a “big enough” finite field so that Macaulay2 wouldn’t gripe at me!).
We also know that we want each row (respectively, column; doubly respectively, cage) to contain the numbers 1, 2, 3, and 4 exactly once. In the case of the first row, this means solutions satisfy $x_{11} + x_{12} + x_{13} + x_{14} – 10$ and $x_{11}x_{12}x_{13}x_{14} – 24$. Alas, $1 + 1 + 4 + 4 = 10$ satisfies the addition rule if we leave off the multiplication rule, and $2\cdot 2\cdot 2 \cdot 3 = 24$ satisfies the multiplication rule if we leave off the addition rule. If we leave off the 1-through-4 rules for each cell, we get fun complex solutions like $a_{11} = 1+\sqrt{-5}, a_{12} = 1-\sqrt{-5}, a_{13} = 2, a_{14} = 2$; it’s an exercise for the reader to make sure these satisfy the addition and multiplication rules.
This set completely describes our game board: two rules for each row, two for each column, two for each cage, and a rule for each cell. Putting all these polynomials together along with the clues, and then closing them up under addition, subtraction, and multiplication by elements of $R$, we find the ideal $I$ we’re interested in. And the points in $V(I)$ are precisely the solutions to our game.
The connection between our ideal $I$ and our variety $V(I)$ is via one of the best theorems out there: Hilbert’s Nullstellensatz! One form of the Nullstellensatz states that, over an algebraically closed field, there’s a one-to-one correspondence between maximal ideals and varieties consisting of a single point.
Let’s think about this in the two-variable example. The variety $V(J)$ can be expressed a union of simpler varieties, $V(J) = \{(1,-1)\} \cup \{(0,0)\} = V(x-1,y+1) \cup V(x,y)$. In terms of the Nullstellensatz, the point $(1,-1)$ corresponds to the maximal ideal $\langle x-1,y+1\rangle$ and the point $(0,0)$ corresponds to the maximal ideal $\langle x,y \rangle$. We don’t quite have that $J$ is the intersection of these ideals, but it does satisfy $J \subseteq \langle x-1, y+1 \rangle \cap \langle x, y\rangle$. (Why not equality in general? If you think about varieties as sets of points, the points themselves don’t carry information about whether they’re single or double or triple roots of a polynomial. By working with ideals, we can factor polynomials and recover multiplicity information about roots that varieties aren’t fine enough to catch.)
Back to our game! If we’re looking for points that solve our game on the algebraic geometry side, we harness the power of the Nullstellensatz to look for all the maximal ideals that contain our ideal $I$ on the algebra side. And there’s an app for that! By app, I mean a technique called “primary decomposition” that can be done computationally. Emmy Noether played a big role in establishing the generality and usefulness of primary decompositions. She proved that in a Noetherian ring, every ideal can be decomposed as a finite intersection of primary ideals—giving us another kind of factorization. The curious reader will find primary decomposition covered in commutative algebra classics like [3] and [4].
If your Sudoku board doesn’t have a unique solution, calculating the primary decomposition of $I$ will let you find all the solutions—and if the board has incompatible clues, you’ll learn that through primary decomposition, too.
If you’re curious about the number of solutions to the empty board, you (by which I mean your computer or U. Melbourne’s) can calculate the primary decomposition of the game board with no clues to find that there are (spoiler!) 188 solutions to the blank board.
These tools—ideals, varieties, primary decomposition (among others)—form your starter kit for solving lots of real-world problems. In fact, these tools are so powerful that the 2020 Mathematics Subject Classification newly includes “statistics on algebraic and topological structures” covering algebraic statistics (62R01), statistical aspects of big data/data science (62R07), and topological data analysis (62R40), as well as new classes in 14Q for computational aspects of algebraic geometry. Applied algebra is a bustling and booming research area! For a charming introduction to algebraic statistics applied to computational biology, pick up [5].
References:
[1] Arnold, Elizabeth; Lucas, Stephen; Taalman, Laura. Groebner Basis Representations of Sudoku. The College Mathematics Journal, Vol. 41, No. 2, March, 2010.
[2] Cox, David A.; Little, John; O’Shea, Donald. Ideals, Varieties, and Algorithms: an introduction to computational algebraic geometry and commutative algebra. Fourth edition. Undergraduate Texts in Mathematics. Springer, 2015.
[3] Atiyah, M. F.; Macdonald, I. G. Introduction to Commutative Algebra. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont. 1969.
[4] Matsumura, Hideyuki. Commutative Ring Theory. Translated from the Japanese by M. Reid. Second edition. Cambridge Studies in Advanced Mathematics, 8. Cambridge University Press, Cambridge, 1989.
[5] Algebraic Statistics for Computational Biology. Edited by Lior Pachter and Bernd Sturmfels. Cambridge University Press, New York, 2005.
–
Author’s Note: I wrote the first draft of this blog post before the US Supreme Court decision that gutted abortion rights (kicking off a season of decisions and opinions that also upended climate regulation, tribal law, and more). I hope readers found Sara Stoudt’s column last month timely! I also wish all readers of this blog the privilege and comfort of finding solace in mathematics, pure or applied, during turbulent times.
If your eyebrows quirked a bit at seeing “abortion” in an AMS feature column, I urge you to read Karen Saxe’s excellent piece over at the MAA’s Math Values Master Blog, Mathematicians’ Case for Preserving the Right to Abortion. As a mathematician and soon-to-be first-time mom, I have a redoubled commitment to (and new appreciation for) a person’s right to choose when and if to carry a pregnancy to term.
]]>Sara Stoudt
Bucknell University
Can data help support or refute claims of wrongdoing? Take a case about claimed hiring discrimination. What information would you want to know about a company’s hiring practices before you made a decision? Maybe you would want to compare the demographics of the application pool to the demographics of those actually employed by the company to see if there are any discrepancies. If you found any, you might investigate further to see if those discrepancies are too large to have happened just by chance.
Statistical ideas abound in Kimberlé Crenshaw’s 1989 paper discussing antidiscrimination court cases. This paper also first defined the term “intersectionality.” Intersectionality provides a framework to explain that different elements of a person’s identity combine to create privilege or pathways to discrimination. If we try to think about this statistically, it means that for a response variable of “how people treat you” there might be an interaction effect of, for example, race and sex, as well as an additive effect of those characteristics individually. For example, a Black man may be treated differently than a Black woman. Although this term is often used in social discourse, did you know it has its origins in a legal setting?
Kimberlé Crenshaw in 2018. Photo by the Heinrich Böll Foundation, CC-BY SA 2.0.
We will consider the first two cases considered in Crenshaw’s work and then a case brought up by Ajele and McGill in a further study of intersectionality in the law to make some connections to statistical ideas. Full disclosure before moving forward: I do not have any legal training, just an interest in how statistics is used in the courtroom, so this is my interpretation of these court summaries. If you have extra insight to share about the legal process, please reach out!
While this post was in press, the United States Supreme Court made a decision to overturn Roe v. Wade. We acknowledge that reading about court case implications is particularly heavy at this time. We frame some of the statistical questions that these cases bring up as intellectual exercises to emphasize the way that small, seemingly abstract decisions can have huge impacts on millions of people’s lives.
In this case, a group of Black women alleged that General Motors’ system of using seniority as a factor in determining who was laid off during a recession continued the effects of past discrimination against Black women. Importantly, the court would not allow the class of “Black woman” to be protected but required the plaintiffs to argue a sex discrimination case or a race discrimination case, but not both. As Arehart points out, the use of “or” in the Civil Rights Act (protects against discrimination based on race, color, religion, sex, or nationality) has led courts to interpret this as a plaintiff needing to choose one characteristic to focus on in their case.
There are some details that led the plaintiffs in this case to choose to pursue a sex discrimination claim. It was revealed during the case that General Motors did not hire Black women before the Civil Rights Act of 1964, so when everyone hired after 1970 was laid off, that meant that Black women were more likely to have less seniority. However, because white women were hired before 1964, there was a large enough pool of women who were not laid off that the court decided there was not enough evidence to support sex discrimination in this policy.
What does this have to do with statistics? We can formulate this situation into an example of Simpson’s Paradox. When employee outcomes were examined overall, there was no evidence of discrimination between men and women. However, if employee outcomes were to be further broken down by race, there would have been a very clear discrepancy between the Black women and white women.
To look at it visually, there is a very narrow pathway towards remaining at the company for Black women (in purple) while there seems to be a reasonable pathway towards remaining at the company for all women (in red).
What if a plaintiff was allowed to combine identities in a claim? In this case, the plaintiff, a Black woman, alleged discrimination based on race and sex. The court then determined that because the claim was made as a Black woman the plaintiff could not represent all Black workers nor all female workers. This limited the pool of workers that could be used in the statistics supporting the discrimination claim. The plaintiff could not use data for all female workers to make an argument, nor could they use data for all Black workers to make an argument. Instead, they were left with the small number of Black women as their data pool with which they could make an argument.
By limiting the pool of people eligible to be included in an analysis, the power to detect a real discrimination effect decreases. Consider a null hypothesis that the company is not discriminating based on race and sex. The power to reject that hypothesis when it is actually false is related to the sample size of each group, making a small group size a limiting factor. The court’s decision effectively raised the probability of a false negative, i.e., falsely concluding that there was no discrimination when there actually was.
If we consider a simplified framing of this question and determine the difference between the proportion of Black women promoted and the proportion of non-Black women promoted, we can use the pwr R package to investigate the power to detect a difference in proportions with unequal sample sizes. Take this investigation by Seongyong Park. They find that if the proportion of those who were fired in two groups is 0.15 and 0.30 (one group is twice as likely to be fired than the other) and both groups have an equal number of people in them, the power to detect the difference is about 0.86. However, if one group is 10 times as large as the other, the power drops to 0.69. Go ahead and use this code to investigate other situations! What would it take for the power to drop to 0.5?
Is there another way for a plaintiff to combine identities in a claim and still face a statistical challenge? In this case, a Black woman alleged that she was discriminated against due to her race and sex. The court did evaluate both race and sex claims, but it did so separately. The court found no evidence of race discrimination alone nor evidence of sex discrimination alone, but the interaction was not investigated.
What makes this case particularly interesting? I picked this case because of a footnote about the statistician expert witness. From the case overview:
Dr. Jane Harworth, Ph.D., an expert in statistical analysis, examined applicant flow and hiring data for 1975-1983 and performed a binomial distribution analysis. When the raw data involved small pools, she utilized the Fisher’s Exact Test, a more precise version of the Student T Test. Dr. Harworth testified that there is no statistical support for the allegations of the existence of a non-neutral policy or of a pattern or practice of discrimination against blacks or females. Her analysis showed that the actual numbers of black or female hirees were within the range of two standard deviations. She further observed that the success rate of blacks and females exceeded whites and males, respectively.
This is an interesting example of how statistics are explained to the court. Note the translation of Fisher’s Exact Test as “a more precise version of the Student T test” and the decision to focus on plus or minus two standard deviations. However, there is nothing technically preventing a Fisher’s Exact Test from being used to compare Black women to everyone else.
Time for another exercise for the reader! I don’t love the binary distinctions in these court case scenarios, so let’s pick a different set of categories to work with. Consider a population of 100 people who can prefer summer or winter and who can prefer vanilla or chocolate. I’m considering this information when determining who to be friends with. Can you design a situation where it does not look like I discriminate based on season preference nor flavor preference yet it does look like I prefer to befriend a particular combination of of season and flavor preference? Here’s a hint: what if the recession happened a little earlier in the Moore example such that anyone hired after 1964 was laid off? It might be useful to sketch a 2×2 table.
Decisions made about how to measure discrimination involve statistical decisions behind the scenes. In fact, Crenshaw points this out in a footnote:
“A central issue in a disparate impact case is whether the impact proved is statistically significant. A related issue is how the protected group is defined. In many cases a Black female plaintiff would prefer to use statistics which include white women and/or Black men to indicate the policy in question does in fact disparately affect the protected class. If, as in Moore, the plaintiff may use only statistics involving Black women, there may not be enough Black women employees to create a statistically significant sample.”
Thinking about statistical decisions in context as well as the implications of court precedent or practice in terms of statistical concepts can help us both refine our practice of statistics and consider the consequences of our work. Real people are impacted by data-driven decisions; we must recognize and bear the responsibility of that.
References and Resources
William Casselman
University of British Columbia
The game Wordle, which is found currently on the New York Times official Wordle site, can be played by anybody with internet access. It has become extremely popular in recent times, particularly among mathematicians and programmers. One programmer comments facetiously, “At the current rate, I estimate that by the end of 2022 ninety-nine percent of all new software releases will be Wordle clones.”
The point for us is that it raises intriguing questions about the nature of information, and offers good motivation for understanding the corresponding mathematical theory introduced by Claude Shannon in the 1940s.
When you visit the official Wordle web site, you will be faced with an empty grid that looks like this:
Every day a secret new word of five letters is chosen by the site, and you are supposed to guess what it is by typing candidates into successive rows, responding to hints offered by the site regarding them. For example, today (May 22, 2022) I began with the word slate, and the site colored it like this:
What this means is that the secret word of the day has no letters “S”, “L”, “A”, or “T”, but does have a “E” in some location other than the last. I next entered the word homer and got in response:
This means that the secret word does not contain at all either an “H” or an “R”; contains an “O” in the second place and an “E” in the fourth place; and contains an “M” somewhere other than the third place. With my next choice I was lucky:
This is a pretty good session—the average game is predicted to require about three and a half tries. So three is better than average, and most of the time you should get the answer in at most four.
I must now tell you the precise rules for coloring proposed answers. If neither the secret word nor your proposal has repeated letters, the rules are very simple: (1) a square is colored green if your letter and that of the secret word are the same at that location; (2) a square is colored yellow if the secret word does contain the corresponding letter somewhere, but not at this particular location; (3) the square is colored gray if the corresponding letter in your guess does not appear at all in the secret word.
But if there are repeated letters, there is some potential ambiguity that has to be dealt with. First of all, all exact coincidences are colored green, and as this is done they are removed from your guess and, internally, from the secret word. Under consideration there are now two words of length possibly less than five, for which there are no exact coincidences. The remaining guess is now scanned left to right. A location is colored yellow if it occurs somewhere in the secret word, and it is removed from the secret word. The location is colored grey if it now occurs nowhere in the secret word.
For example, if the secret word is decoy and your guess is odder, since there is only one “D” in decoy it will be colored
Scanning left to right is an arbitrary choice, and scanning in the opposite direction would give a different coloring. One consequence of this rule and some other elementary reasoning is that some colorings, such as
can never occur.
Incidentally, the NYT Wordle site contains a convenient keyboard displayed underneath the grid, which illustrates graphically how the coloring is to be interpreted.
There are a few more somewhat arbitrary things you should know about. The principal one is that the Wordle grid will accept only a proper subset of English words of five letters, and many of the ones it does accept will probably be unfamiliar to you, such as aahed, aalii, and aarti, as well as zygal, zygon, and zymic. If you submit a word that will not be accepted, the Wordle display will complain by shuddering. The current list of words that will not cause a shudder has 12,957 words on it, and is itself divided into two subsets—the list of the approximately 2,300 words which are in fact possible answers (i.e., made up of words which should be familiar) and the rest (compiled from a huge list of all possible words in English text), which can help you to find answers even though they are not themselves among the possible answers. (You might see on the Internet mentions of 12,972 acceptable submissions. This was the original number. The game changed a bit when the NYT took it over, eliminating a few of the accepted words that might offend some people.)
The Internet is full of advice on how best to play, but I’ll say very little about that. What is interesting to a mathematician is that many of the proposed strategies use the notions introduced by Claude Shannon to solve problems of communication and elucidate the notion of redundancy.
In fact, redundancy in English is what Wordle is all about. What do I mean by redundancy? Most succinctly, that not all sequences of five letters of the alphabet are English words. (One constraint is that the ones that do occur must be pronounceable.) In other words, certain sequences of letters cannot appear. For example, in standard English (although not in Wordle’s list of acceptable words, alas) a “q” is always followed by a “u”, so the pairs “qa”, “qb”, … are forbidden. The “u” following a “q” is therefore redundant. Another way of saying this is that “u” after “q” conveys no new information—it will not help to distinguish one word from another.
For an example more relevant to Wordle, suppose you find yourself facing the array
Just as “u” always comes after “q”, here the last letter must be either “d”, “e”, or “p”, making possible answers shard, share, or sharp. So here the final letter carries information—it picks out one of three possibilities—but not a lot, since the possible options are very limited.
These two examples illustrate the general fact: information is the resolution of uncertainty. The more uncertainty there is, the more information is conveyed by choosing one of the options. The mathematical theory of information initiated by Shannon makes this quantitative.
To understand the relation between information and mathematics, I’ll look now at a simpler guessing game, a variant of ‘twenty questions’. In this game, somebody chooses a random number $n$ such that $0 \le n \lt 2^{3} = 8$ (i.e., in the non-inclusive range $[0, 8)$), and asks you to guess what it is. Any question you ask will be answered with only a ‘yes’ or ‘no’. How many questions might you have to ask?
The simplest strategy is to start with “Is it 0?” and continue with possibly 7 more. But this is certainly unnecessarily inefficient. The best way to phrase the optimal procedure is to express numbers in the range $[0, 8)$ in base $2$ notation. Thus $5$ is expressed as $111$, since $$ 5 = 1 + 0 \cdot 2 + 1 \cdot 4 . $$ The coefficients in such an expression are called bits. With this mind, you have only to ask three questions: is the $i$-th bit equal to $0$? for $i = 0$, $1$, and $2$. Whether the answer is ‘yes’ or ‘no’, you gain one bit of information.
The drawback to this procedure is that you will never get by with fewer than three questions, whereas with a naive strategy you might be lucky and get it in one. Sure, but you are more likely to be unlucky! In the naive scheme the average number of guesses is $(1 + 2 + \cdots + 7 + 8)/8 = 4.5$, whereas in the other it is $3$. When the number of choices is $2^{3}$ the difference in the average number of questions asked is small, but if $2^{3}$ is replaced by $2^{20}$ the difference is huge.
Very generally, a game with $2^{n}$ possible items may be played by asking only $n$ questions with ‘yes’ or ‘no’ answers and receiving in return $n$ bits of information. Another way to put this: a string of zeroes and ones of length $n$, chosen randomly, conveys $n$ bits of information. As a simple variant of this, suppose each item of the string is a number in the range $[0, 2^{k})$. Then a string of such numbers of length $n$ is equivalent to one of $nk$ bits, and hence conveys $nk$ bits of information.
But now Shannon posed the question: suppose the bits are not chosen randomly? Then less information will be conveyed, in general. Can we measure how much?
For example, suppose the string is of length $2n$, and may therefore be considered a string of $n$ numbers in the range $[0, 4)$. Suppose further that each $k$ is constrained to the range $[0, 3)$ (i. e. with $0 \le k \le 2$), so that in effect the string is the expression of a number in base $3$. I’ll call is a $3$-string. Instead of $4^{n}$ possible strings, there are only $3^{n}$, so that fewer than $2n$ bits of information can be conveyed. Assuming that the individual ‘digits’ are chosen randomly, how much information is conveyed by such a $3$-string?
The most fruitful answer tells what happens as $n$ becomes larger and larger. Large strings of $n$ integers in the range $[0, 3)$ can be compressed, and more efficiently compressed for large $n$. A single integer $0 \le k \le 2$ requires two bits to be expressed, one in the range $[0, 9)$ requires $4$ bits, or twice as many. But $3^{3} = 27 \lt 2^{5} = 32$, so a string of $3$ requires only $5$ bits instead of $6$. In general, if $$ 2^{m-1} \lt 3^{n} \lt 2^{m} $$ then more than $m-1$ bits are required to specify every $3$-string of length $n$, but fewer than $m$. We can find a formula for $n$, in fact, since extracting $n$-th roots gives us $$ 2^{m/n – 1/n} \lt 3 \lt 2^{m/n} . $$ Since we can write $3 = 2^{\log_{2} 3}$, this is equivalent to $$ { m\over n } – { 1 \over n } \lt \log_{2} 3 \lt { m \over n } . $$ so that for large $n$ we see that $m$ is approximately $n \log_{2} 3$. What Shannon says is that a random $3$-string of length $n$ carries $n \log_{2} 3$ bits of information. Since there are $3^{n}$ such strings, if these are chosen randomly each one has probability $p = 1/3^{n}$ of occurring. Shannon’s way of putting this becomes in general the recipe:
Note that since $p \le 1$, we know that $\log_{2} \frac{1}{p} \ge 0$, so this is always non-negative. When $p = 1$ the event always takes place. There are no surprises. Sure enough, $\log_{2} 1 = 0$, so Shannon’s rule says that no information is conveyed. The rarer an event, the more surprising it is, and the more information it conveys: “Dog bites man” is nothing new, but “Man bites dog” is a different story.
Let’s look at one simple example. Suppose we are again playing ‘three questions’. You feel lucky and can’t resist blurting out, “Is the number 6?” If the answer is ‘yes’, you have acquired, as we have already seen, three bits of information. But how much information does a ‘no’ gives you? All we can say immediately is that it isn’t much, because it has only reduced the number of possibilities fom $8$ to $7$. Now if, we assume, the answers are chosen at random, the probability of getting a ‘yes’ here is $p = 1/8$, so the probability of getting a ‘no’ is $1 – p= 7/8$. Shannon assigns to it $\log_{2} 8/7 \sim 0.193$ bits of information.
For us, a random event is one with a finite number of outcomes. Suppose the $i$-th event to have probability $p_{i}$. If the event takes place a large number of times, what is the average amount of information seen? The $i$-th event has probability $p_{i}$, and the associated amount of information is $\log_{2} p_{i}$, so the expected average is
$$ \sum_{i} p_{i} \log_{2} p_{i} . $$
Even the case $p_{i} =0$ is allowed, since
$$ \lim_{p\rightarrow 0} p \cdot \log_{2} p = 0 . $$
This average is what Shannon calls the entropy of the event, measured in bits. If the event has two outcomes, with probabilities $p$ and $1-p$, the entropy is
$$ (1-p)\log_{2} (1-p) + p \log_{2} p . $$
Its graph looks like this:
If $p=0$ or $p=1$ there is no probability involved, and no information. The maximum possible entropy occurs when $p=1-p = 1/2$. This is when the maximum uncertainty is present, and in general entropy is a measure of overall uncertainty.
This last remark remains true when any number of outcomes are involved:
That is to say, whenever $(p_{i})$ is any sequence of $n$ numbers $p_{i} \ge 0$ with $\sum p_{i} = 1$ then
$$ \sum_{i} p_{i} \log_{2} p_{i} \le \log_{2} n . $$
This is immediately evident when $n=2$, since the graph of $y = \log x$ is concave downwards.
In general it can then be derived by mathematical induction on $n$.
What does this have to do with Wordle?
When you enter a candidate word into the Wordle display, the game replies by coloring your entry with one of three colors—it is giving you information about your proposal. But Wordle differs from 20 questions in that you can use this information to make your next guess. How much information is the game giving you? How can you best use this information to make your next submission?
The secret word is effectively a choice of one of the $2,309$ valid answers. Each of these presumably has equal probability. But the coloring reduces this number considerably—the true answer has to be one of those compatible with the coloring. For example, a few days ago I chose slate as my initial guess, and got a coloring
I can scan through all possible answers and check which ones would give me this coloring. It happens that my choice was very bad, since there were $164$ words that do this. We can see exactly how bad this is by making a graph like the following:
This graph was constructed in the following way: I scanned through all of the 2,309 possible answers and computed the coloring it would give. I used this to list for each colour all of the compatible answers. I made a list of the sizes for each color, and then sorted the list by magnitude. For example, the largest count was 221, corresponding to all grays. But the second highest was the one I got, at 164 (marked on the graph). As you can see, there was a very large reservoir of things I might have hoped for.
Could I have made a better choice of first guess? Well, what should be clear from what I write just above, each possible first guess gives me a graph like the one above. What I would like to see is a graph with a somewhat uniform height, for which the likelihood of narrowing my choices down is large. I display below a few graphs for other first guesses.
It turns out that it is impossible to get a uniform height, but that some choices do much better than others. The point is that the uniformity is maximized if the entropy of a certain probability is maximized. Every choice of a starting word assigns a colour to every possible answer. These colors partition the set of possible answers, and if $n_{i}$ answers gives rise to the same color $i$ then $\sum n{i} = 2,309 = n$. I set $p_{i} = n_{i}/n$. It is the entropy of this probability distribution that is displayed in the graphs above, and you can see that a choice is better if the entropy is relatively large.
The idea of using entropy to find optimal ways of playing Wordle have proliferated on the Internet. Many have used entropy to make a best first guess—among these are crane, which is the one used by the official NYT advisors, and slate. Very few of these add something new, and most seem to be just taking their ideas from the video of Grant Sanderson that I mention below. (This seems to be the original investigation of entropy.)
I don’t want to add to this literature, but I do want to discuss the question of best second choices, about which less is said. It is a relatively simple calculation to list all possible answers that will colour a first guess in a given way. For example, as I have already mentioned my choice of slate above came up with 164 possibilities. This is a severe reduction of the original 2,409. But one of the quirks of Wordle is that choosing your second guess from this list might not be best. For example, if you get the coloring
you know that the secret word must be sharp, shard, or share. The obvious thing to then do is try these, one by one. However, that might take three tries, while a careful choice of some totally different word—for example pedal—will give it to you in two tries by eliminating two of the three possibilities.
Absolutely optimal strategies for Wordle are now known and posted on the Internet. But these miss the real point—I’d like to see more theoretical discussion of exactly what Wordle’s colorings tell you.
I have ‘borrowed’ several ideas from this well known and impressive YouTube video.