**Courtney Gibbons**

**Hamilton College**

My interest in applied algebra was a long time coming. I’m not exactly a fan of living in reality, so the idea of taking something so lovely (to me) as algebra and applying it to things like disease modeling or phylogenetic trees seemed too, well, *real*. That’s not to say I didn’t realize how important these applications are, but those weren’t applications that captured my interest or imagination—at least, not right away!

Around 2010, I came across a paper by Elizabeth Arnold, Stephen Lucas, and Laura Taalman that described how to use algebra (particularly Groebner bases) to solve Sudoku (see reference [1]). The principles are the same as in many “real world” applied algebra settings, and thanks to this paper, I started to get interested in using algebra to help better understand the world we live in. At the same time, the 2010 Mathematics Subject Classification added “applications of commutative algebra (e.g., to statistics, control theory, optimization, etc.)” (13P25) to its list.

Today’s blog post will walk through some of the commutative algebra and algebraic geometry involved in solving a 4 by 4 Sudoku puzzle (called “shidoku”). The interested can play along by downloading this Macaulay2 file, shidoku.m2, and running computations on the University of Melbourne’s Macaulay2 web server: https://www.unimelb-macaulay2.cloud.edu.au/#home.

First off, let’s review the rules of the game. In a 4 by 4 grid, there are sixteen cells arranged into four rows, four columns, and four 2 by 2 cages that each take up a corner of the square.

Each cell can be populated with the numbers 1, 2, 3, or 4 in such a way that each row, column, and cage contains all four numbers exactly once.

Those rules describe everything you need to play on an empty board, but a board usually already has some cells filled in. Give it a whirl:

When you’ve finished, play it again—but get rid of the blue 4 and see how many solutions you can find. When you finish that one, put the blue 4 back, add an additional 4 to the fourth row and second column, and see what happens.

Regarding the variations, I’m pretty sure the etiquette of puzzle creation insists that a “good” puzzle has a unique solution—but bear with me! I promise I’m breaking the rules of etiquette for a good reason! Anyway, given how quickly you can mentally solve these puzzles, the natural question is: why bother with algebra? Aside from the obvious (and somewhat cheeky) answer, *why NOT?*, I often find it useful to understand mathematical ideas by seeing them applied to a situation I know pretty well. In this case, it’s solving a small logic puzzle.

The first step in applying algebra to a puzzle is figuring out how to model the game and its rules. If we take a big-picture look at what we’re doing, we’re trying to find a very special point in sixteen-dimensional space that satisfies the rules of the game. So, we could imagine setting up a bunch of polynomials in sixteen variables, one variable for each cell, that describe the rules and the clues so that a (hopefully unique!) zero of all the polynomials represents a (hopefully unique!) solution to our puzzle.

In other words, we’re going to start with the polynomial ring $\mathbb{C}[x_{11}, x_{12}, \ldots, x_{44}]$, and we’re going to make an ideal $I$ that represents our puzzle. Solutions to our puzzle will belong to the set $V(I) = \{(a_{11},a_{12},\ldots,a_{44}) \, : \, a_{ij} \in \mathbb{C} \text{ and } f(a_{11},a_{12},\ldots,a_{44}) = 0 \, \forall \, f \in I\}$.

As a reminder—or a teaser—an * ideal* $I$ in a (commutative) ring $R$ is a nonempty set that is closed under addition, subtraction, and the “multiplicative absorption” property (less poetically called “scalar multiplication” by some): for all $a$ in $I$ and $r \in R$, $ar \in I$. Its partner concept from geometry is that of a

Here’s a little example in a more familiar setting. Consider the polynomials $x+y$ and $y^3 – x^2$. They generate an ideal, $J = \langle x+y,y^3-x^2 \rangle$, which includes all linear combinations (with “scalars” from $\mathbb{R}[x,y]$) of the generators. Setting both generators equal to zero, we plot a line and a cusp, and their simultaneous solutions are the points $(1,-1)$ and $(0,0)$ in $\mathbb{R}^2$. That means $V(J) = \{(1,-1),(0,0)\}$.

You can read more about ideals and varieties, their interactions, and the algorithms that make them nice to work with in [2].

So, back to our games. Let’s build the ideal $I$ that represents the game. The clues on the board are easiest to encode because if we know that the cell in row one, column three must be 3, we must ensure that points in $V(I)$ satisfy $a_{13} = 3$. Since we insist (by definition of $V(I)$) that the polynomial $x_{13} – a_{13} = 0$ when we evaluate it at any point in $V(I)$, this means $x_{13} – 3$ is in our set of clues. You can work out the rest of the clues’ polynomials similarly.

The rules of the game are a little more complicated to model, but let’s muddle through. We know that we want each cell to be one of the numbers 1, 2, 3, or 4, which means for each cell, we have a polynomial $(x_{ij} – 1)(x_{ij} – 2)(x_{ij} – 3)(x_{ij} – 4)$. I suppose I could have tried fiddling with my coefficient ring or solution space to work modulo 4, but some of the algebra I want to use requires that we are working over an algebraically closed field. So I’ve chosen to work over the complex numbers (although if you’re playing with the code, you’ll notice I just picked a “big enough” finite field so that Macaulay2 wouldn’t gripe at me!).

We also know that we want each row (respectively, column; doubly respectively, cage) to contain the numbers 1, 2, 3, and 4 exactly once. In the case of the first row, this means solutions satisfy $x_{11} + x_{12} + x_{13} + x_{14} – 10$ and $x_{11}x_{12}x_{13}x_{14} – 24$. Alas, $1 + 1 + 4 + 4 = 10$ satisfies the addition rule if we leave off the multiplication rule, and $2\cdot 2\cdot 2 \cdot 3 = 24$ satisfies the multiplication rule if we leave off the addition rule. If we leave off the 1-through-4 rules for each cell, we get fun complex solutions like $a_{11} = 1+\sqrt{-5}, a_{12} = 1-\sqrt{-5}, a_{13} = 2, a_{14} = 2$; it’s an exercise for the reader to make sure these satisfy the addition and multiplication rules.

This set completely describes our game board: two rules for each row, two for each column, two for each cage, and a rule for each cell. Putting all these polynomials together along with the clues, and then closing them up under addition, subtraction, and multiplication by elements of $R$, we find the ideal $I$ we’re interested in. And the points in $V(I)$ are precisely the solutions to our game.

The connection between our ideal $I$ and our variety $V(I)$ is via one of the best theorems out there: Hilbert’s Nullstellensatz! One form of the Nullstellensatz states that, over an algebraically closed field, there’s a one-to-one correspondence between maximal ideals and varieties consisting of a single point.

Let’s think about this in the two-variable example. The variety $V(J)$ can be expressed a union of simpler varieties, $V(J) = \{(1,-1)\} \cup \{(0,0)\} = V(x-1,y+1) \cup V(x,y)$. In terms of the Nullstellensatz, the point $(1,-1)$ corresponds to the maximal ideal $\langle x-1,y+1\rangle$ and the point $(0,0)$ corresponds to the maximal ideal $\langle x,y \rangle$. We don’t quite have that $J$ is the intersection of these ideals, but it *does* satisfy $J \subseteq \langle x-1, y+1 \rangle \cap \langle x, y\rangle$. (Why not equality in general? If you think about varieties as sets of points, the points themselves don’t carry information about whether they’re single or double or triple roots of a polynomial. By working with ideals, we can factor polynomials and recover multiplicity information about roots that varieties aren’t fine enough to catch.)

Back to our game! If we’re looking for points that solve our game on the algebraic geometry side, we harness the power of the Nullstellensatz to look for all the maximal ideals that contain our ideal $I$ on the algebra side. And there’s an app for that! By app, I mean a technique called “primary decomposition” that can be done computationally. Emmy Noether played a big role in establishing the generality and usefulness of primary decompositions. She proved that in a Noetherian ring, every ideal can be decomposed as a finite intersection of primary ideals—giving us another kind of factorization. The curious reader will find primary decomposition covered in commutative algebra classics like [3] and [4].

If your Sudoku board doesn’t have a unique solution, calculating the primary decomposition of $I$ will let you find *all* the solutions—and if the board has incompatible clues, you’ll learn that through primary decomposition, too.

If you’re curious about the number of solutions to the empty board, you (by which I mean your computer or U. Melbourne’s) can calculate the primary decomposition of the game board with no clues to find that there are (spoiler!) 188 solutions to the blank board.

These tools—ideals, varieties, primary decomposition (among others)—form your starter kit for solving lots of real-world problems. In fact, these tools are so powerful that the 2020 Mathematics Subject Classification newly includes “statistics on algebraic and topological structures” covering algebraic statistics (62R01), statistical aspects of big data/data science (62R07), and topological data analysis (62R40), as well as new classes in 14Q for computational aspects of algebraic geometry. Applied algebra is a bustling and booming research area! For a charming introduction to algebraic statistics applied to computational biology, pick up [5].

**References:**

[1] Arnold, Elizabeth; Lucas, Stephen; Taalman, Laura. *Groebner Basis Representations of Sudoku*. The College Mathematics Journal, Vol. 41, No. 2, March, 2010.

[2] Cox, David A.; Little, John; O’Shea, Donald. *Ideals, Varieties, and Algorithms: an introduction to computational algebraic geometry and commutative algebra.* Fourth edition. Undergraduate Texts in Mathematics. Springer, 2015.

[3] Atiyah, M. F.; Macdonald, I. G. *Introduction to Commutative Algebra*. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont. 1969.

[4] Matsumura, Hideyuki. *Commutative Ring Theory*. Translated from the Japanese by M. Reid. Second edition. Cambridge Studies in Advanced Mathematics, 8. Cambridge University Press, Cambridge, 1989.

[5] *Algebraic Statistics for Computational Biology*. Edited by Lior Pachter and Bernd Sturmfels. Cambridge University Press, New York, 2005.

–

**Author’s Note:** I wrote the first draft of this blog post before the US Supreme Court decision that gutted abortion rights (kicking off a season of decisions and opinions that also upended climate regulation, tribal law, and more). I hope readers found Sara Stoudt’s column last month timely! I also wish all readers of this blog the privilege and comfort of finding solace in mathematics, pure or applied, during turbulent times.

*If your eyebrows quirked a bit at seeing “abortion” in an AMS feature column, I urge you to read Karen Saxe’s excellent piece over at the MAA’s Math Values Master Blog, Mathematicians’ Case for Preserving the Right to Abortion. As a mathematician and soon-to-be first-time mom, I have a redoubled commitment to (and new appreciation for) a person’s right to choose when and if to carry a pregnancy to term.*

**Sara Stoudt**

**Bucknell University**

Can data help support or refute claims of wrongdoing? Take a case about claimed hiring discrimination. What information would you want to know about a company’s hiring practices before you made a decision? Maybe you would want to compare the demographics of the application pool to the demographics of those actually employed by the company to see if there are any discrepancies. If you found any, you might investigate further to see if those discrepancies are too large to have happened just by chance.

Statistical ideas abound in Kimberlé Crenshaw’s 1989 paper discussing antidiscrimination court cases. This paper also first defined the term “intersectionality.” Intersectionality provides a framework to explain that different elements of a person’s identity combine to create privilege or pathways to discrimination. If we try to think about this statistically, it means that for a response variable of “how people treat you” there might be an interaction effect of, for example, race and sex, as well as an additive effect of those characteristics individually. For example, a Black man may be treated differently than a Black woman. Although this term is often used in social discourse, did you know it has its origins in a legal setting?

Kimberlé Crenshaw in 2018. Photo by the Heinrich Böll Foundation, CC-BY SA 2.0.

We will consider the first two cases considered in Crenshaw’s work and then a case brought up by Ajele and McGill in a further study of intersectionality in the law to make some connections to statistical ideas. Full disclosure before moving forward: I do not have any legal training, just an interest in how statistics is used in the courtroom, so this is my interpretation of these court summaries. If you have extra insight to share about the legal process, please reach out!

While this post was in press, the United States Supreme Court made a decision to overturn Roe v. Wade. We acknowledge that reading about court case implications is particularly heavy at this time. We frame some of the statistical questions that these cases bring up as intellectual exercises to emphasize the way that small, seemingly abstract decisions can have huge impacts on millions of people’s lives.

In this case, a group of Black women alleged that General Motors’ system of using seniority as a factor in determining who was laid off during a recession continued the effects of past discrimination against Black women. Importantly, the court would not allow the class of “Black woman” to be protected but required the plaintiffs to argue a sex discrimination case or a race discrimination case, but not both. As Arehart points out, the use of “or” in the Civil Rights Act (protects against discrimination based on race, color, religion, sex, *or* nationality) has led courts to interpret this as a plaintiff needing to choose one characteristic to focus on in their case.

There are some details that led the plaintiffs in this case to choose to pursue a sex discrimination claim. It was revealed during the case that General Motors did not hire Black women before the Civil Rights Act of 1964, so when everyone hired after 1970 was laid off, that meant that Black women were more likely to have less seniority. However, because white women were hired before 1964, there was a large enough pool of women who were not laid off that the court decided there was not enough evidence to support sex discrimination in this policy.

What does this have to do with statistics? We can formulate this situation into an example of Simpson’s Paradox. When employee outcomes were examined overall, there was no evidence of discrimination between men and women. However, if employee outcomes were to be further broken down by race, there would have been a very clear discrepancy between the Black women and white women.

To look at it visually, there is a very narrow pathway towards remaining at the company for Black women (in purple) while there seems to be a reasonable pathway towards remaining at the company for all women (in red).

What if a plaintiff was allowed to combine identities in a claim? In this case, the plaintiff, a Black woman, alleged discrimination based on race *and* sex. The court then determined that because the claim was made as a Black woman the plaintiff could not represent all Black workers nor all female workers. This limited the pool of workers that could be used in the statistics supporting the discrimination claim. The plaintiff could not use data for all female workers to make an argument, nor could they use data for all Black workers to make an argument. Instead, they were left with the small number of Black women as their data pool with which they could make an argument.

By limiting the pool of people eligible to be included in an analysis, the power to detect a real discrimination effect decreases. Consider a null hypothesis that the company is not discriminating based on race and sex. The power to reject that hypothesis when it is actually false is related to the sample size of each group, making a small group size a limiting factor. The court’s decision effectively raised the probability of a false negative, i.e., falsely concluding that there was no discrimination when there actually was.

If we consider a simplified framing of this question and determine the difference between the proportion of Black women promoted and the proportion of non-Black women promoted, we can use the *pwr* R package to investigate the power to detect a difference in proportions with unequal sample sizes. Take this investigation by Seongyong Park. They find that if the proportion of those who were fired in two groups is 0.15 and 0.30 (one group is twice as likely to be fired than the other) and both groups have an equal number of people in them, the power to detect the difference is about 0.86. However, if one group is 10 times as large as the other, the power drops to 0.69. Go ahead and use this code to investigate other situations! What would it take for the power to drop to 0.5?

Is there another way for a plaintiff to combine identities in a claim and still face a statistical challenge? In this case, a Black woman alleged that she was discriminated against due to her race and sex. The court did evaluate both race and sex claims, but it did so separately. The court found no evidence of race discrimination alone nor evidence of sex discrimination alone, but the interaction was not investigated.

What makes this case particularly interesting? I picked this case because of a footnote about the statistician expert witness. From the case overview:

Dr. Jane Harworth, Ph.D., an expert in statistical analysis, examined applicant flow and hiring data for 1975-1983 and performed a binomial distribution analysis. When the raw data involved small pools, she utilized the Fisher’s Exact Test, a more precise version of the Student T Test. Dr. Harworth testified that there is no statistical support for the allegations of the existence of a non-neutral policy or of a pattern or practice of discrimination against blacks or females. Her analysis showed that the actual numbers of black or female hirees were within the range of two standard deviations. She further observed that the success rate of blacks and females exceeded whites and males, respectively.

This is an interesting example of how statistics are explained to the court. Note the translation of Fisher’s Exact Test as “a more precise version of the Student T test” and the decision to focus on plus or minus two standard deviations. However, there is nothing technically preventing a Fisher’s Exact Test from being used to compare Black women to everyone else.

Time for another exercise for the reader! I don’t love the binary distinctions in these court case scenarios, so let’s pick a different set of categories to work with. Consider a population of 100 people who can prefer summer or winter and who can prefer vanilla or chocolate. I’m considering this information when determining who to be friends with. Can you design a situation where it does not look like I discriminate based on season preference nor flavor preference yet it does look like I prefer to befriend a particular combination of of season *and* flavor preference? Here’s a hint: what if the recession happened a little earlier in the *Moore* example such that anyone hired after 1964 was laid off? It might be useful to sketch a 2×2 table.

Decisions made about how to measure discrimination involve statistical decisions behind the scenes. In fact, Crenshaw points this out in a footnote:

“A central issue in a disparate impact case is whether the impact proved is statistically significant. A related issue is how the protected group is defined. In many cases a Black female plaintiff would prefer to use statistics which include white women and/or Black men to indicate the policy in question does in fact disparately affect the protected class. If, as in *Moore*, the plaintiff may use only statistics involving Black women, there may not be enough Black women employees to create a statistically significant sample.”

Thinking about statistical decisions in context as well as the implications of court precedent or practice in terms of statistical concepts can help us both refine our practice of statistics and consider the consequences of our work. Real people are impacted by data-driven decisions; we must recognize and bear the responsibility of that.

References and Resources

- “The intersectionality wars” by Jane Coaston
- “Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory, and Antiracist Politics” by Kimberle Crenshaw
- “Intersectionality and Identity: Revisiting a Wrinkle in Title VII” by Bradley Allan Areheart
- “Intersectionality in Law and Legal Context” by Grace Ajele and Jena McGill
- Try searching for “t-test” within court documents.

**William Casselman**

**University of British Columbia**

The game **Wordle**, which is found currently on the *New York Times* official Wordle site, can be played by anybody with internet access. It has become extremely popular in recent times, particularly among mathematicians and programmers. One programmer comments facetiously, “At the current rate, I estimate that by the end of 2022 ninety-nine percent of all new software releases will be Wordle clones.”

The point for us is that it raises intriguing questions about the nature of information, and offers good motivation for understanding the corresponding mathematical theory introduced by Claude Shannon in the 1940s.

When you visit the official Wordle web site, you will be faced with an empty grid that looks like this:

Every day a secret new word of five letters is chosen by the site, and you are supposed to guess what it is by typing candidates into successive rows, responding to hints offered by the site regarding them. For example, today (May 22, 2022) I began with the word **slate**, and the site colored it like this:

What this means is that the secret word of the day has no letters “S”, “L”, “A”, or “T”, but does have a “E” in some location other than the last. I next entered the word **homer** and got in response:

This means that the secret word does not contain at all either an “H” or an “R”; contains an “O” in the second place and an “E” in the fourth place; and contains an “M” somewhere other than the third place. With my next choice I was lucky:

This is a pretty good session—the average game is predicted to require about three and a half tries. So three is better than average, and most of the time you should get the answer in at most four.

I must now tell you the precise rules for coloring proposed answers. If neither the secret word nor your proposal has repeated letters, the rules are very simple: (1) a square is colored green if your letter and that of the secret word are the same at that location; (2) a square is colored yellow if the secret word does contain the corresponding letter somewhere, but not at this particular location; (3) the square is colored gray if the corresponding letter in your guess does not appear at all in the secret word.

But if there are repeated letters, there is some potential ambiguity that has to be dealt with. First of all, all exact coincidences are colored green, and as this is done they are removed from your guess and, internally, from the secret word. Under consideration there are now two words of length possibly less than five, for which there are no exact coincidences. The remaining guess is now scanned *left to right*. A location is colored yellow if it occurs somewhere in the secret word, and it is removed from the secret word. The location is colored grey if it now occurs nowhere in the secret word.

For example, if the secret word is **decoy** and your guess is **odder**, since there is only one “D” in **decoy** it will be colored

Scanning left to right is an arbitrary choice, and scanning in the opposite direction would give a different coloring. One consequence of this rule and some other elementary reasoning is that some colorings, such as

or

can never occur.

Incidentally, the NYT Wordle site contains a convenient keyboard displayed underneath the grid, which illustrates graphically how the coloring is to be interpreted.

There are a few more somewhat arbitrary things you should know about. The principal one is that the Wordle grid will accept only a proper subset of English words of five letters, and many of the ones it does accept will probably be unfamiliar to you, such as **aahed**, **aalii**, and **aarti**, as well as **zygal**, **zygon**, and **zymic**. If you submit a word that will not be accepted, the Wordle display will complain by shuddering. The current list of words that will not cause a shudder has 12,957 words on it, and is itself divided into two subsets—the list of the approximately 2,300 words which are in fact possible answers (i.e., made up of words which should be familiar) and the rest (compiled from a huge list of all possible words in English text), which can help you to find answers even though they are not themselves among the possible answers. (You might see on the Internet mentions of 12,972 acceptable submissions. This was the original number. The game changed a bit when the NYT took it over, eliminating a few of the accepted words that might offend some people.)

The Internet is full of advice on how best to play, but I’ll say very little about that. What is interesting to a mathematician is that many of the proposed strategies use the notions introduced by Claude Shannon to solve problems of communication and elucidate the notion of redundancy.

In fact, redundancy in English is what Wordle is all about. What do I mean by redundancy? Most succinctly, that not all sequences of five letters of the alphabet are English words. (One constraint is that the ones that do occur must be pronounceable.) In other words, certain sequences of letters cannot appear. For example, in standard English (although not in Wordle’s list of acceptable words, alas) a “q” is always followed by a “u”, so the pairs “qa”, “qb”, … are forbidden. The “u” following a “q” is therefore *redundant*. Another way of saying this is that “u” after “q” conveys no new information—it will not help to distinguish one word from another.

For an example more relevant to Wordle, suppose you find yourself facing the array

Just as “u” always comes after “q”, here the last letter must be either “d”, “e”, or “p”, making possible answers **shard**, **share**, or **sharp**. So here the final letter carries information—it picks out one of three possibilities—but not a lot, since the possible options are very limited.

These two examples illustrate the general fact: *information is the resolution of uncertainty.* The more uncertainty there is, the more information is conveyed by choosing one of the options. The mathematical theory of information initiated by Shannon makes this quantitative.

To understand the relation between information and mathematics, I’ll look now at a simpler guessing game, a variant of ‘twenty questions’. In this game, somebody chooses a random number $n$ such that $0 \le n \lt 2^{3} = 8$ (i.e., in the non-inclusive range $[0, 8)$), and asks you to guess what it is. Any question you ask will be answered with only a ‘yes’ or ‘no’. *How many questions might you have to ask?*

The simplest strategy is to start with “Is it 0?” and continue with possibly 7 more. But this is certainly unnecessarily inefficient. The best way to phrase the optimal procedure is to express numbers in the range $[0, 8)$ in base $2$ notation. Thus $5$ is expressed as $111$, since $$ 5 = 1 + 0 \cdot 2 + 1 \cdot 4 . $$ The coefficients in such an expression are called **bits**. With this mind, you have only to ask three questions: *is the $i$-th bit equal to $0$?* for $i = 0$, $1$, and $2$. Whether the answer is ‘yes’ or ‘no’, you gain one bit of information.

The drawback to this procedure is that you will never get by with fewer than three questions, whereas with a naive strategy you might be lucky and get it in one. Sure, but you are more likely to be unlucky! In the naive scheme the average number of guesses is $(1 + 2 + \cdots + 7 + 8)/8 = 4.5$, whereas in the other it is $3$. When the number of choices is $2^{3}$ the difference in the average number of questions asked is small, but if $2^{3}$ is replaced by $2^{20}$ the difference is huge.

Very generally, a game with $2^{n}$ possible items may be played by asking only $n$ questions with ‘yes’ or ‘no’ answers and receiving in return $n$ bits of information. Another way to put this: *a string of zeroes and ones of length $n$, chosen randomly, conveys $n$ bits of information.* As a simple variant of this, suppose each item of the string is a number in the range $[0, 2^{k})$. Then a string of such numbers of length $n$ is equivalent to one of $nk$ bits, and hence conveys $nk$ bits of information.

But now Shannon posed the question: *suppose the bits are not chosen randomly?* Then less information will be conveyed, in general. Can we measure how much?

For example, suppose the string is of length $2n$, and may therefore be considered a string of $n$ numbers in the range $[0, 4)$. Suppose further that each $k$ is constrained to the range $[0, 3)$ (i. e. with $0 \le k \le 2$), so that in effect the string is the expression of a number in base $3$. I’ll call is a **$3$-string**. Instead of $4^{n}$ possible strings, there are only $3^{n}$, so that fewer than $2n$ bits of information can be conveyed. Assuming that the individual ‘digits’ are chosen randomly, how much information is conveyed by such a $3$-string?

The most fruitful answer tells what happens as $n$ becomes larger and larger. Large strings of $n$ integers in the range $[0, 3)$ can be compressed, and more efficiently compressed for large $n$. A single integer $0 \le k \le 2$ requires two bits to be expressed, one in the range $[0, 9)$ requires $4$ bits, or twice as many. But $3^{3} = 27 \lt 2^{5} = 32$, so a string of $3$ requires only $5$ bits instead of $6$. In general, if $$ 2^{m-1} \lt 3^{n} \lt 2^{m} $$ then more than $m-1$ bits are required to specify every $3$-string of length $n$, but fewer than $m$. We can find a formula for $n$, in fact, since extracting $n$-th roots gives us $$ 2^{m/n – 1/n} \lt 3 \lt 2^{m/n} . $$ Since we can write $3 = 2^{\log_{2} 3}$, this is equivalent to $$ { m\over n } – { 1 \over n } \lt \log_{2} 3 \lt { m \over n } . $$ so that for large $n$ we see that $m$ is approximately $n \log_{2} 3$. What Shannon says is that *a random $3$-string of length $n$ carries $n \log_{2} 3$ bits of information.* Since there are $3^{n}$ such strings, if these are chosen randomly each one has probability $p = 1/3^{n}$ of occurring. Shannon’s way of putting this becomes in general the recipe:

Note that since $p \le 1$, we know that $\log_{2} \frac{1}{p} \ge 0$, so this is always non-negative. When $p = 1$ the event *always* takes place. There are no surprises. Sure enough, $\log_{2} 1 = 0$, so Shannon’s rule says that no information is conveyed. *The rarer an event, the more surprising it is, and the more information it conveys*: “Dog bites man” is nothing new, but “Man bites dog” is a different story.

Let’s look at one simple example. Suppose we are again playing ‘three questions’. You feel lucky and can’t resist blurting out, “Is the number 6?” If the answer is ‘yes’, you have acquired, as we have already seen, three bits of information. But how much information does a ‘no’ gives you? All we can say immediately is that it isn’t much, because it has only reduced the number of possibilities fom $8$ to $7$. Now if, we assume, the answers are chosen at random, the probability of getting a ‘yes’ here is $p = 1/8$, so the probability of getting a ‘no’ is $1 – p= 7/8$. Shannon assigns to it $\log_{2} 8/7 \sim 0.193$ bits of information.

For us, a **random event** is one with a finite number of outcomes. Suppose the $i$-th event to have probability $p_{i}$. If the event takes place a large number of times, what is the average amount of information seen? The $i$-th event has probability $p_{i}$, and the associated amount of information is $\log_{2} p_{i}$, so the expected average is

$$ \sum_{i} p_{i} \log_{2} p_{i} . $$

Even the case $p_{i} =0$ is allowed, since

$$ \lim_{p\rightarrow 0} p \cdot \log_{2} p = 0 . $$

This average is what Shannon calls the **entropy** of the event, measured in bits. If the event has two outcomes, with probabilities $p$ and $1-p$, the entropy is

$$ (1-p)\log_{2} (1-p) + p \log_{2} p . $$

Its graph looks like this:

If $p=0$ or $p=1$ there is no probability involved, and no information. The maximum possible entropy occurs when $p=1-p = 1/2$. This is when the maximum uncertainty is present, and in general entropy is a measure of overall uncertainty.

This last remark remains true when any number of outcomes are involved:

- If there are $n$ outcomes, the maximum possible entropy is when all $p_{i} = 1/n$, and in this case the entropy is $\log_{2}n$.

That is to say, whenever $(p_{i})$ is any sequence of $n$ numbers $p_{i} \ge 0$ with $\sum p_{i} = 1$ then

$$ \sum_{i} p_{i} \log_{2} p_{i} \le \log_{2} n . $$

This is immediately evident when $n=2$, since the graph of $y = \log x$ is concave downwards.

In general it can then be derived by mathematical induction on $n$.

What does this have to do with Wordle?

When you enter a candidate word into the Wordle display, the game replies by coloring your entry with one of three colors—it is giving you *information* about your proposal. But Wordle differs from 20 questions in that you can use this information to make your next guess. *How much information is the game giving you? How can you best use this information to make your next submission?*

The secret word is effectively a choice of one of the $2,309$ valid answers. Each of these presumably has equal probability. But the coloring reduces this number considerably—the true answer has to be one of those compatible with the coloring. For example, a few days ago I chose **slate** as my initial guess, and got a coloring

I can scan through all possible answers and check which ones would give me this coloring. It happens that my choice was very bad, since there were $164$ words that do this. We can see exactly how bad this is by making a graph like the following:

This graph was constructed in the following way: I scanned through all of the 2,309 possible answers and computed the coloring it would give. I used this to list for each colour all of the compatible answers. I made a list of the sizes for each color, and then sorted the list by magnitude. For example, the largest count was 221, corresponding to all grays. But the second highest was the one I got, at 164 (marked on the graph). As you can see, there was a very large reservoir of things I might have hoped for.

Could I have made a better choice of first guess? Well, what should be clear from what I write just above, each possible first guess gives me a graph like the one above. What I would like to see is a graph with a somewhat uniform height, for which the likelihood of narrowing my choices down is large. I display below a few graphs for other first guesses.

It turns out that it is impossible to get a uniform height, but that some choices do much better than others. The point is that the uniformity is maximized if the entropy of a certain probability is maximized. Every choice of a starting word assigns a colour to every possible answer. These colors partition the set of possible answers, and if $n_{i}$ answers gives rise to the same color $i$ then $\sum n{i} = 2,309 = n$. I set $p_{i} = n_{i}/n$. It is the entropy of this probability distribution that is displayed in the graphs above, and you can see that a choice is better if the entropy is relatively large.

The idea of using entropy to find optimal ways of playing Wordle have proliferated on the Internet. Many have used entropy to make a best first guess—among these are **crane**, which is the one used by the official NYT advisors, and **slate**. Very few of these add something new, and most seem to be just taking their ideas from the video of Grant Sanderson that I mention below. (This seems to be the original investigation of entropy.)

I don’t want to add to this literature, but I do want to discuss the question of best second choices, about which less is said. It is a relatively simple calculation to list all possible answers that will colour a first guess in a given way. For example, as I have already mentioned my choice of **slate** above came up with 164 possibilities. This is a severe reduction of the original 2,409. But one of the quirks of Wordle is that choosing your second guess from this list might not be best. For example, if you get the coloring

you know that the secret word must be **sharp**, **shard**, or **share**. The obvious thing to then do is try these, one by one. However, that might take three tries, while a careful choice of some totally different word—for example **pedal**—will give it to you in two tries by eliminating two of the three possibilities.

Absolutely optimal strategies for Wordle are now known and posted on the Internet. But these miss the real point—I’d like to see more theoretical discussion of exactly what Wordle’s colorings tell you.

- The official Wordle page
- Wordle word lists as text files (both the answers and the words accepted)
- Grant Sanderson’s BlueBrown first Wordle page
I have ‘borrowed’ several ideas from this well known and impressive YouTube video.

- Wordle-solving state of the art

A summary of optimal decision strategies for playing Wordle. - A brutal decision tree for making Wordle choices.
- A very full list of all words found in English text
- An interesting if somewhat vague discussion thread by the well known mathematician Timothy Gowers.
- A Quanta article on Shannon and Wordle

**Ursula Whitcher
Mathematical Reviews (AMS)**

Mathematicians and physicists both love symmetry, but depending on who you’re talking to, the implications of a simple statement such as “This theory admits a symmetry” can be very different. In this column, I’ll describe one attempt at interdisciplinary translation using an entire rainbow of colors. Our subject is Adinkras. Here is an example:

Using Adinkras, we can move from physics questions to algebra questions to combinatorics questions. I’ll sketch ways to think about Adinkras from all three points of view, on the way to providing a formal definition. But to start with, let me introduce you to the inventors of mathematical Adinkras—and the symbols’ namesake.

The physicists Michael Faux and Sylvester James Gates, Jr.—Jim Gates, for short—first described Adinkras in a 2004 paper.

Jim Gates (US Department of Energy)

Michael Faux

In their introduction, Faux and Gates wrote:

The use of symbols to connote ideas which defy simple verbalization is perhaps one

of the oldest of human traditions. The Asante people of West Africa have long been

accustomed to using simple yet elegant motifs known as Adinkra symbols, to serve

just this purpose. With a nod to this tradition, we christen our graphical symbols as

“Adinkras.”

(Some traditions say the name “Adinkra” derives from the name of an early nineteenth-century ruler of Gyaman, a West African kingdom founded by the Bono people.)

As an example, here are two Adinkra symbols used in early twentieth-century Ghana: *Nkyimkyim*, the twisted pattern, and *Aya*, the fern, which can also mean “I am not afraid of you.”

Faux and Gates were motivated by the desire to understand a physical concept known as *supersymmetry*. This concept arose as a theoretical attempt to organize the vast quantities of information involved in particle physics.

In the Standard Model of particle physics, the fundamental components of the universe are the following types of particles:

- Six leptons (electrons, neutrinos, etc.) and their antiparticles
- Six quarks and their antiparticles.
- Four gauge bosons (photons, gluons, etc.)
- The Higgs boson

Some of these particles make up familiar types of energy and matter. For example, up and down quarks combine to make protons and neutrons, and thus atomic nuclei. Photons are the particles that make up beams of light. Other components of the Standard Model are harder to detect: the IceCube detector at the South Pole hunts for faint flashes of light due to rare neutrino interactions, and confirming the Higgs boson’s existence entailed detailed analysis of particle decay within the particle accelerator known as the Large Hadron Collider.

The fundamental particles in the Standard Model can be divided into two types, *bosons* and *fermions*. Bosons transmit fundamental forces—for example, photons carry electromagnetic energy—while we’ve already noted that fermions can combine to make up matter. Other distinctions are more technical. Every boson has an intrinsic integer amount of angular momentum—spin—when measured in units of Planck’s constant $\hbar$, while bosons have half-integer spin ($\frac{1}{2}$, $\frac{3}{2}$, etc.) Furthermore, bosons can cluster: identical bosons can share the same quantum state. In contrast, identical fermions must occupy distinct quantum states, a principle sometimes known as the *Pauli exclusion principle*.

Supersymmetry is a theoretical physical symmetry that exchanges bosons and fermions. In particular, in a supersymmetric theory, every type of boson must have a fermion partner, and vice versa.

“Where is my superpartner?”

Despite years of searching, experimental physicists have not found evidence of superpartners for any of the particles in the Standard Model. That doesn’t mean understanding supersymmetry isn’t useful! After all, a real-world tennis ball’s path will deviate from a perfect parabola due to air resistance, but parabolas are not meaningless. Indeed, the equations underlying quantum supersymmetry have proved useful in understanding optical and condensed-matter systems at larger scales.

Studying supersymmetry in physically realistic situations requires a tremendous amount of physical and mathematical sophistication. We’re going to simplify as much as possible: all the way down to zero spatial dimensions! In other words, from now on we’ll assume that all of our bosons and fermions live on a single point.

In physics at the scales where human beings usually operate, once you’ve restricted to a single point, nothing much can happen. But in quantum physics, particles can have intrinsic properties without obviously moving. For example, an electron can act like a single point with angular momentum, but a single point doesn’t have anywhere to spin. We’ll represent the intrinsic information in our model mathematically by writing our bosons and fermions as functions of a time parameter $t$. In physicist’s language, this is a one-dimensional theory: that one dimension is *time*.

Let’s suppose we have $m$ fundamental bosons, $\{\phi_1, \dots, \phi_m\}$, and $m$ fundamental fermions, $\{\psi_1, \dots, \psi_m\}$. We also assume that we have $N$ *supersymmetry operators* $\{Q_1, \dots, Q_N\}$. Each supersymmetry operator transforms bosons to fermions, and vice versa. We write this in a notation similar to function notation. For example, $Q_1 \phi_2$ should be a fermion (perhaps one of our fundamental fermions, or perhaps a transformation of a fundamental fermion, such as $-\psi_3$).

In quantum physics, the order in which you do things matters: for example, the outcome of two measurements may depend on which one you try first. That means that $Q_I Q_J$ and $Q_J Q_I$ could be different. We can measure this using the *anticommutator*, which we write with curly brackets:

\[\{A, B\} = AB + BA.\]

With this notation in hand, we’re ready to describe the rules that the supersymmetry operators have to follow. In order to do so, we’ll have to use a little bit of calculus notation, namely, the time derivative $\frac{d}{dt}$. If this notation is unfamiliar to you, don’t worry. You can think of $\frac{d}{dt}$ as telling you something about the way a boson or fermion changes over a small period of time. But we’ll replace all our derivatives by diagrams when we start building Adinkras!

Here are the two rules for supersymmetry operators.

- If $I \neq J$, then

\[\{Q_I, Q_J\} = 0.\]

In other words, for $I \neq J$, $Q_I Q_J = -Q_J Q_I$: if you apply two different supersymmetry operators in two different orders, the results will be opposites. - Each supersymmetry operator $Q_I$ satisfies

\[\{Q_I, Q_I\} = 2 i \frac{d}{dt}.\]

Here $i$ is the square root of $-1$. (Physicists have standard rules for translating quantum quantities involving complex numbers into real-world observable measurements.) We can simplify this rule to $Q_I Q_I = i \frac{d}{dt}$.

We’re interested in understanding how many genuinely different solutions to the supersymmetry operator rules we can find. Even for small examples, there may be multiple possibilities. Each one is called a *representation of the supersymmetry algebra*. For example, suppose we have one fundamental boson $\phi_1$, one fundamental fermion $\psi_1$, and one supersymmetry operator $Q_1$. We need to specify $Q_1 \phi_1$ and $Q_2 \psi_2$. One possibility is:

$$ Q_1 \phi_1 = \psi_1 $$

$$ Q_1 \psi_1 = i \frac{d}{dt} \phi_1.$$

Another possibility is:

$$ Q_1 \phi_1 = \frac{d}{dt} \psi_1 $$

$$ Q_1 \psi_1 = i \phi_1.$$

You can check that in either case, $Q_1 Q_1 \phi_1 = i \frac{d}{dt} \phi_1$ and $Q_1 Q_1 \psi_1 = i \frac{d}{dt} \psi_1$.

To organize our possible supersymmetry operator representations, we use graphs. These are graphs in the sense of graph theory, with vertices (dots) connected by edges (line segments). We use $m$ open vertices and $m$ closed vertices. Each open vertex represents a fundamental boson $\phi_j$ together with its derivatives and constant multiples; each closed vertex represents a fundamental fermion together with its derivatives and multiples. We pick $N$ possible edge colors, one for each of the $N$ supersymmetry operators. We think of acting on a boson or fermion by a supersymmetry operator as traveling along the edge of the appropriate color. Since we can act on any boson or fermion by any of the operators, we need an edge of each possible color attached to every boson vertex and every fermion vertex. Here is an example of the resulting structure.

This example has a property we haven’t discussed yet, corresponding to the supersymmetry algebra rule $Q_I Q_J = -Q_J Q_I$. Applying two different supersymmetry operators in different orders yields the same result, up to a sign change. That means that if we pick a vertex and a pair of edge colors, it shouldn’t matter which edge we follow first: for example, blue then green should have the same result as green then blue.

We can summarize the requirements we’ve described so far using the notion of a *chromotopology*. A chromotopology is a finite simple graph (that is, it has a finite number of vertices and edges, and none of the edges starts and ends in the same place) with the following properties:

*Bipartite*(there are two types of vertices, and each edge has one endpoint of each type)*$N$-regular*with a consistent edge coloring (every edge is colored one of $N$ colors, and every vertex has $N$ incident edges, one of each color)*Quadrilateral*(every simple cycle with two colors is of length 4).

The quadrilateral property is a different way of describing the requirement that following a pair of edge colors in either order has the same result: instead of considering two possible paths away from the same vertex, we notice that following one path away from a vertex and the other path back again brings us back to the start (so blue-green-blue-green will return us to the vertex where we started).

To make an Adinkra, we start with a chromotopology and add two additional structures. The first is a *height assignment*. To make a height assignment, we place our bosons and fermions on different levels, and require that every time we follow an edge, we go up or down one level. For example, when $N=1$, there are two possible height assignments, one with the boson on the lowest level and one with the fermion on the lowest level. You might notice that the example Adinkra and chromotopology we’ve already seen are drawn with consistent height assignments.

$N=1$ height assignment, boson on the lowest level

$N=1$ height assignment, fermion on the lowest level

In the conversion between Adinkras and supersymmetry representations, the height assignment tells us when to take derivatives. We use the convention that going up a level does not use a time derivative, but going back down does. Thus, in the one-dimensional example, our choice of height specifies whether we take a derivative when going from fermion to boson or from boson to fermion.

Our final structure is an *odd dashing*. We know that in a chromotopology, picking a vertex and a pair of colors gives us a 4-cycle of edges that will bring us back where we started. To make an odd dashing, for each of these 4-cycles, we dash an odd number of edges—either just one, or three of the four. For example, here is a possible dashing of our example $N=3$ chromotopology—making it into an Adinkra.

Here is a different odd dashing of the same chromotopology—and thus a different Adinkra!

The dashings tell us where to place minus signs in our supersymmetry representations. Because there are an odd number of dashed edges in each two-color 4-cycle, when we flip the order in which we follow a pair of colors, we will also change the number of dashed edges we encounter. This guarantees that the minus sign condition in the supersymmetry algebra rule $Q_I Q_J = -Q_J Q_I$ will be satisfied.

Adinkras help us generate a multitude of supersymmetry algebra representations. They also give us new ways to visualize relationships between distinct representations: for example, we can classify supersymmetry algebra representations that correspond to distinct dashings of the same chromotopology and height assignment, or move from representation to representation by raising and lowering vertices to different levels. The result is a powerful tool for exploration and discovery, in both physics and mathematics.

- Charles F. Doran, Michael G. Faux, S. James Gates, Jr., Tristan Hübsch, Kevin M. Iga, Gregory D. Landweber, and Robert L. Miller, Codes and supersymmetry in one dimension.
- Adv. Theor. Math. Phys. Vol. 15, No. 6 (2011), 1909-1970
- A classification of chromotopologies using error-correcting codes.
- Michael Faux and S. James Gates, Jr., Adinkras: A graphical technology for supersymmetric representation theory.
- Phys. Rev. D (3) Vol. 71 (2005), 065002
- The first Adinkras paper.
- S. James Gates, Jr., Surprises in Supersymmetry (Perimeter Institute).
- Brief YouTube lecture with enthusiastic comments on Adinkras.
- S. James Gates, Jr., The 1,358,954,496 Matrix Elements To Get From SUSY Diff EQ’s To Pictures, Codes, Card Games, Music, Computers and Back Again (American Physical Society).
- 2017 Kavli lecture.
- Yan X. Zhang, Adinkras for Mathematicians.
- Transactions of the American Mathematical Society Vol. 366, No. 6 (June 2014), pp. 3325-3355
- A useful introduction focused on Adinkra combinatorics.

**Joe Malkevitch
York College (CUNY)**

When looking at a body of mathematical ideas, one might look for the “atoms” or parts so that one could see the whole by having insight into its parts. If in some future state of the Earth there were no automobiles and some humans came across a well preserved car from the 1970’s, but with no prior knowledge of what such a thing was, how might they interpret what they were looking at? An archaeologist at that time might try to understand its parts as a way to think through what the whole thing was good for. Perhaps these people might decide it was a small moveable house?

In an earlier Feature Column essay I looked at how by studying primes 2, 3, 5, 7, … we get insight into big integers such as 1111113. There I also looked at partitions of positive integers—for example, $5 = 4 + 1$ and $5 = 3 + 1 + 1$ are but two of the partitions of 5. Words connoting or related to *decomposition* in English are: decompositions, dissections, factoring, irreducible, etc. It is not uncommon in mathematics to use words as technical vocabulary that suggest ideas that a word has in more ordinary usage, that is non-mathematical contexts. For example, consider the word "irreducible". This suggests something that cannot be broken up into parts.

Before addressing the issue of geometric compositions in earnest, as a teaser remember that one of the most important and well known theorem in mathematics is the Pythagorean Theorem, though attributing it to Pythagorus or even the Pythagoreans distorts the history of this remarkable result, which can be viewed as a result in algebra or a result in geometry. The Pythagorean Theorem states that if one has a right triangle (one where two sides meet at a 90 degree angle—that is, are perpendicular), the square (in the algebraic) sense of the lengths of the side opposite the right angle is the sum of the squares of the lengths of the other two sides. But this theorem about lengths can also be interpreted as a statement about the areas of the geometric squares that can be constructed on the sides of a right triangle. Here are diagrams (Figures 1 and 2) that support one of the many proofs of the Pythagorean Theorem that involved moving around pieces of squares and assembling them to form other squares.

Figure 1 (A proof of the Pythagorean Theorem based on decomposing the squares into pieces that can be reassembled in other ways. Diagram courtesy of Wikipedia.)

Figure 2 (A proof of the Pythagorean Theorem based on reassembling the pieces of squares on the sides of a right triangle, shown in white. Image courtesy of Wikipedia.)

Proofs of theorems using "physical models" such as the diagrams in Figures 1 and 2 are quite compelling because of the amazing ability of humans to input and process visual information. The eye responds to issues related to the length of segments and the area of regions, even if sometimes the fact that area scales as the square of length rather than length itself sometimes causes individuals to make misleading judgments about diagrams.

One might try to understand geometric objects in terms of the parts that make them up. These parts might be described as: points, lines, membrane patterns, corners, curves, etc. Sometimes these parts can be viewed with terms that overlap. In describing a shirt one might use terms like sleeves and buttons and in describing a car one might mention windows and wheels. Here we will give special consideration to the notion of a polygon.

After the point and the line, among the most fundamental of geometrical objects is the triangle. A triangle is a collection of three points not on a line and the segments joining pairs of the points which are known as the vertices of the triangle. When we classify polygons that are drawn on a flat piece of paper in the plane, we can do so by counting the number of corners of the polygon or by counting the number of sides (edges) of the polygon. We can think of a polygon as a collections of points joined by sticks with no membrane filling in the result or we can include the *interior* of the polygon along with the "sticks." There are pros and cons for defining shapes in particular ways. Here I just want to point out that we can classify polygons drawn in the plane, not only by their number of corners, but by whether or not the polygon intersects itself or has notches. We say a set $X$ is *convex* if given any two elements $u$ and $v$ of $X$ the line segment joining $u$ and $v$ is contained in (is a subset of) $X$.

Figure 3 (A diagram illustrating the idea of convexity. Courtesy of Wikipedia.)

Figure 3 provides an illustration of this fundamental concept of modern geometry. Intuitively, a convex set is one that does not have notches or holes. In particular polygons drawn in the plane are usually defined (as are circles) to be dots connected by straight line segments—"sticks"—without the points in the interior. Triangles with their interiors are convex sets but as soon as we have a polygon with more than 3 vertices we can have non-convex polygons, polygons whose vertices do not lie in a plane, or polygons whose sides self-intersect. In some situations polygons are allowed to have several consecutive vertices lying along a straight line, but often it is required that pairs of consecutive vertices not lie along the same line. This polygon has 6 vertices (corners) and 6 sides, and is thus often described as a hexagon, in this case, a non-convex hexagon.

Historically, attention has been given to the length of sides and the measure of the interior (and sometimes exterior) angles of polygons. When the angles of a convex polygon are equal and its sides have the same length, it is called *regular*. However, one can consider polygons where the sides are all equal, that is the polygon is *equilateral*, or the angles are all equal, in which case it is *equiangular*. The polygon in Figure 4 is an equilateral hexagon. Its angles are not all equal, but there are three different sizes of angle, equal in pairs. If one adopts a partition-style way of classifying this polygon it is an example of a non-convex $\{6\}$; $\{2, 2, 2\}$ since there are six sides of equal length, and three types of angles equal in pairs.

Figure 4 (A non-convex hexagon where all of the sides have equal length.)

Figure 5 shows a small sampler of polygons, one convex and others non-convex. In one case, all consecutive pairs of sides of the polygon meet at right angles (a *rectilinear* or *orthogonal* polygon).

Figure 5 (A sampler of different kinds of polygons, convex and non-convex.)

The simplest polygon is one that has only three vertices, a triangle. In the spirit of decomposition, it is natural to ask if every plane simple polygon can be decomposed using existing vertices into triangles. While this may seem intuitively obvious, it is actually not that easy to prove this fact, though it is true. It may seem intuitively clear that the 3-dimensional analogues of polyhedra, including ones that are non-convex but have the topology of a sphere can be decomposed into tetrahedra (e.g. the "atoms" of 3-dimensional convex polyhedra, as it were), but this is in fact not true!

Given a polygon with vertices drawn in the plane, it is always possible to subdivide that polygon into triangles using existing vertices. However, for some decomposition problems it is of interest to add additional vertices to the sides of the polygon as part of the decomposition effort, and sometimes one might also allow having vertices in the interior of the polygon. Thus in Figure 12 you can see how a square can be subdivided into triangles by adding some additional vertices along the side of the square and also how to do the subdivision using an additional vertex. Figure 6 shows a simple (no self-intersections) non-convex polygon with 11 vertices (11 sides). Using 8 segments joining existing vertices this 11-gon can be subdivided into triangles using 8 diagonals and giving rise to 9 triangles. In fact there are many other such triangles starting with this same polygon but all of them will involve using 8 diagonals and give rise to 9 triangles. In general any simple $n$-gon ($n$ at least 4) can be triangulated using

$(n-3)$ diagonals into $(n-2)$ triangles. (One can see this using Euler’s Polyhedral Formula for a connected graph: $V + F – E = 2$ for a connected graph drawn in the plane.)

Figure 6 (A simple non-convex polygon converted using diagonals to a polygon subdivided into triangles.)

Figure 7 and Figure 8 show (Figure 7) a convex 9-gon subdivided by 6 diagonals into 7 triangles and (Figure 8) a non-convex 9-gon subdivided by 6 diagonals into 7 triangles. Unlike the triangulation in Figure 6, each of these triangulations includes one (or more) triangles which share edges with three other triangles, something which does not occur for Figure 6.

Figure 7 (A convex polygon subdivided into triangles.)

Figure 8 (A non-convex polygon partitioned into triangles using the diagonal edges shown in red.)

In Figure 8 I have called attention to the edges that subdivide the original polygon into triangles by coloring the subdividing diagonals red. As a problem in graph theory (the theory of diagrams involving dots and the lines that join them), you may want to think about the question:

Given a collection of edges, when can they serve as the diagonals for a plane $n$-gon that turns the $n$-gon into a triangulated polygon?

A remarkable theorem involving decompositions is that if one has two plane simple polygons of the same area, it is possible to decompose either of the polygons into polygonal pieces that can be reassembled to form the other polygon. This result is known as the Wallace-Bolyai-Gerwien Theorem. By way of illustration, Figure 9 shows a way to decompose a square and equilateral triangle of equal area into polygonal parts that can be used to form the other shape. The decomposition shown uses the minimal number of pieces. A lot of research has been done on *equidecomposibility* with a minimal number of pieces and where one uses particular shapes for the pieces in the decomposition.

Figure 9 (A square and equilateral triangle of the same area can be decomposed into 4 pieces which can be assembled to form the other shape. Image courtesy of Wikipedia.)

It is natural to ask if, in 3 dimensions, two polyhedra with the same volume can have one be decomposed (cut) into pieces and reassembled to form the other. This problem was solved by Max Dehn (1876-1952) and was one of a set of famous problems designed by David Hilbert (1862-1943), whose solution he thought would create progress in a variety of mathematical areas. Figure 10 shows two 3-dimensional polyhedra, one decomposed into the other. Dehn provided tools for telling when this could be done. However, it may surprise you to learn that the analogue of what we see in Figure 9 can’t be achieved, that is, a regular 3-cube (see left of Figure 10) can’t be cut into polyhedral pieces and reassembled to a regular tetrahedron of the same volume.

Figure 10 (A cube decomposed into a triangular prism of the same volume. Image courtesy of Wikipedia.)

A natural question about decomposition of a polygon is whether or not the polygon can be decomposed into convex pieces, and in particular triangles, which will automatically be convex, where the pieces can be made to have equal area.

Each of the squares in Figure 11 can be thought of as squares of side length 2 and both have been divided into 4 congruent triangles, and thus, for the one on the left the 4 triangles have the same area and the same is true on the right. It is interesting that in each case the triangles are special because they are right triangles and thus satisfy the Pythagorean theorem. You can verify for yourself that on the left, the right triangles are isosceles right triangles with sides $\sqrt{2}$, $\sqrt{2}$, and 2 while on the right the triangles are scalene (all three sides of different lengths) and these lengths are 1, 2, $\sqrt{5}$. Also note that although the squares above are meant to be congruent, both of side length 2, it may not appear that the smaller triangles on each side have equal area but you can verify using the Pythagorean Theorem that the small triangles on the left have the same area as the smaller triangles on the right.

Figure 11 (Two different ways to subdivide congruent initial squares into 4 congruent triangles.)

The decompositions of squares into equal area triangular parts in Figure 12 can be extended to decomposing a square in various ways into an even number of parts. The decompositions shown in Figure 11 have the property that when two triangles touch each other they touch along a complete edge of another triangle. Are other kinds of decompositions of a square into an even number of triangles possible? Figure 12 shows an example of how a square can be decomposed into 6 parts, and not all of the triangles are right triangles, where some of the triangles meet other triangles along a part of an edge rather than a full edge, and yet the 6 parts can be shown to have equal area. Triangles which touch in this way are said not to meet edge-to-edge. In recent years interest in tilings of the plane as well as polygons into pieces with special properties allows for tiles that don’t meet edge-to-edge.

Figure 12 (A decomposition of a square into 6 triangles of equal area, where not all of the triangles meet edge-to-edge. Image courtesy of Wikipedia.)

What almost certainly has already occurred to you regarding this discussion is decomposing a square into an odd number of triangles with the same area. If you try to find such a dissection you will find that you are not able to do this! So you might try to prove that it cannot occur. However, if you are like many people you will not find it so easy to do this. Recently, the following theorem has been associated with the name Paul Monsky:

Theorem (1970): It is not possible to decompose a square into an odd number of triangles of equal area.

Like many easy-to-state questions that are not so easy to demonstrate there is a lot of history in how Monsky came to show his result. It might seem that the history would go back into ancient times but in fact the problem seems to have been born quite recently. The origin of the problem appears to have occurred in 1965 with Fred Richmond and other names associated with the problem are John Thomas and Sherman Stein. While many saw the problem with the publication of Monsky’s proof in 1970, it was work of Sherman Stein that magnified interest in the problem under the title of what have come to be called equidissection problems. What Stein basically did was to ask what other shapes in the plane were such that they could not be dissected into triangles of equal area. You might want to think about this issue for yourself and perhaps you can come up with some new variants that other people have not thought of. After all why restrict oneself to squares!

The fact that a square can’t be decomposed into an odd number of squares of the same area does not mean that one can’t think about dividing squares into an odd number of triangles such as those in Figure 13. We know that the

triangles in such a decomposition cannot have equal area, but researchers have investigated the issue of how close to being equal in area they can be made in terms of the number of triangles in the decomposition.

Figure 13 (A square decomposed into an odd number of triangles. Such a decomposition with all the triangles having equal area is not possible.)

Much more recent than the ideas that lead to Monsky’s Theorem has been the idea of investigating when one can take a convex polygonal region in the plane and decompose it into $N$ convex pieces which have the same area and perimeter! A published version of this challenge appeared in a paper posted to the ArXiv in 2008 by R. Nandakumar and N. Ramana Rao which has subsequently, in the spirit of Monsky’s Theorem, attracted additional interest in this question and its generalizations.

This problem, while very easy to state, has inspired a huge amount of new geometrical facts as well as many new questions. It is very common in the process of making progress on one mathematical problem that it opens up new questions, the need to invent new tools and sometimes whole new clusters of mathematical questions.

We have looked at how surprisingly rich and complex the environment of decomposing a polygon into triangles can be and, in particular, the decomposition of a square into triangles of equal area. What about decomposing a square into squares subject to various rules? Clearly one can take a square and decompose it into various numbers of other squares of the same area, and the number of squares in the decomposition can be even or odd. However, much earlier than the interest in what has come to be called Monsky’s Theorem, a group of British students, all of whom went on to distinction in various ways, R.L. Brooks, Cedric Smith, Arthur Stone and William Tutte, while students at Cambridge University in the 1930s looked at a problem which has lead to much important and interesting work in various parts of mathematics and is, again, very much a decomposition theorem. The idea is to take a square (or once generalized as a question, rectangle) and divide it into squares with the initially curious restriction that all of the squares in the decomposition have *different* side lengths. This problem has come to be known as the perfect square problem. Figure 14 shows an interesting example that fails to achieve this goal but is nonetheless striking for what it does accomplish.

Figure 14 (A square of relatively small side length subdivided into smaller squares some of which have the same edge length. Image courtesy of Wikipedia.)

Figure 15 shows an example of a square with the property that all the subdividing squares have different edge lengths.

Figure 15 (A square subdivided into squares all of which have different edge lengths. Image courtesy of Wikipedia.)

There are many "windows" (some not square) that serve as entries into mathematical insights and investigations. Looking at parts or decompositions of shapes as well as numbers leads to lots of fascinating mathematics and its applications.

**References**

Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society’s MathSciNet can be used to get additional bibliographic information and reviews of some of these materials. Some of the items above can be found via the ACM Digital Library, which also provides bibliographic services.

Abrams, Aaron, and Jamie Pommersheim. "Generalized dissections and Monsky’s theorem." Discrete & Computational Geometry (2022): 1-37.

Alsina, Claudi, and Roger B. Nelsen. Math made visual: creating images for understanding mathematics. Vol. 28. American Mathematical Soc., 2006.

Akopyan, Arseniy, Sergey Avvakumov, and Roman Karasev. "Convex fair partitions into an arbitrary number of pieces." arXiv preprint arXiv:1804.03057 (2018).

Alsina, Claudi, and Roger B. Nelsen. A Cornucopia of quadrilaterals. Vol. 55. American Mathematical Soc., 2020.

Frederickson, G. Dissections: Plane and Fancy. New York: Cambridge University Press, pp. 28-29, 1997.

Hoehn, Larry. A New Proof of the Pythagorean Theorem: Mathematics Teacher. February 1995; NCTM: Reston, VA.

Jepsen, Charles, and Roc Yang. "Making Squares from Pythagorean Triangles." The College Mathematics Journal 29.4 (1998): 284-288.

Karasev, Roman, Alfredo Hubard, and Boris Aronov. "Convex equipartitions: the spicy chicken theorem." Geometriae Dedicata 170.1 (2014): 263-279.

Kasimatis, Elaine Ann. "Dissections of regular polygons into triangles of equal areas." Discrete & Computational Geometry 4.4 (1989): 375-381.

Katz, Victor J. . A History of Mathematics. 1993; Harper Collins: New York, New York.

Loomis, E. S. The Pythagorean Proposition: Its Demonstrations Analyzed and Classified and Bibliography of Sources for Data of the Four Kinds of "Proofs," 2nd ed. Reston, VA: National Council of Teachers of Mathematics, 1968.

Machover, M. "Euler’s Theorem Implies the Pythagorean Proposition." Amer. Math. Monthly 103, 351, 1996.

Maldonado, Gerardo L., and Edgardo Roldán-Pensado. "Dissecting the square into seven or nine congruent parts." Discrete Mathematics 345.5 (2022): 112800.

Maor, Eli. The Pythagorean theorem: a 4,000-year history. Vol. 65. Princeton University Press, 2019.

Mead, David G. "Dissection of the hypercube into simplexes." Proceedings of the American Mathematical Society 76.2 (1979): 302-304.

Monsky, Paul. "On dividing a square into triangles." The American Mathematical Monthly 77.2 (1970): 161-164.

Nelsen, Roger B. Proofs without words: Exercises in visual thinking. No. 1. MAA, 1993.

Posamentier, Alfred S. The Pythagorean theorem: the story of its power and beauty. Prometheus books, 2010.

Nandakumar, R., and N. Ramana Rao. "Fair partitions of polygons: An elementary introduction." Proceedings-Mathematical Sciences 122.3 (2012): 459-467.

Rooney, Elaine Ann Kasimatis. "DISSECTION OF REGULAR POLYGONS INTO TRIANGLES OF EQUAL AREAS." (1987): 4188-4188.

Stein, Sherman. "Equidissections of centrally symmetric octagons." Aequationes Mathematicae 37.2 (1989): 313-318.

Stein, Sherman K. "Cutting a polyomino into triangles of equal areas." The American Mathematical Monthly 106.3 (1999): 255-257.

Stein, Sherman. "A generalized conjecture about cutting a polygon into triangles of equal areas." Discrete & Computational Geometry 24.1 (2000): 141-145.

Stein, Sherman. "Cutting a polygon into triangles of equal areas." The Mathematical Intelligencer 26.1 (2004): 17-21.

Su, Zhanjun, and Ren Ding. "Dissections of polygons into triangles of equal areas." Journal of Applied Mathematics and Computing 13.1 (2003): 29-36.

Wang, Yang, Lei Ren, and Hui Rao. "Dissecting a square into congruent polygons." Discrete Mathematics & Theoretical Computer Science 22 (2020).

]]>**Sara Stoudt**

**Bucknell University**

Fitting a line to a set of points… how hard can it be? When those points represent the temperature outside and a town’s ice cream consumption, I’m really invested in that line helping me to understand the relationship between those two quantities. (What if my favorite flavor runs out?!) I might even want to predict new values of ice cream consumption based on new temperature values. A line can give us a way to do that too. But when we start to think more about it, more questions arise. What makes a line “good”? How do we tell if a line is the “best”?

A technique called ordinary least squares (OLS), aka linear regression, is a principled way to pick the “best” line where “best” is defined as the one that minimizes the sum of the squared distances between the line and each point. We chant the assumptions of OLS and know what to look for in diagnostic plots, but where do these assumptions come from? Are there assumptions that are more hurtful than others if broken? These questions are something I even second guess myself, though I frequently use and teach linear regression.

While proving “good” performance of OLS, we need to make certain assumptions to streamline the process. Walking through the proof reveals what assumptions are most crucial and what the impact is of breaking each one.

Let’s first write out our theoretical model in matrix notation and recall what the dimensions of each piece of the puzzle are. We are looking for a linear relationship between $X$ and $Y$, but we know there may be some error, which we represent by $\epsilon$.

The OLS approach estimates the unknown parameters $\beta$ by $\hat{\beta} = (X’X)^{-1}X’Y$. Here $X’$ is the transpose of $X$, so the product $X’X$ is a square matrix.

Have we implicitly assumed anything yet? Actually yes, there is an inverse in play! What would $X’X$ have to look like to be invertible? Let’s put our linear algebra hats on.

This matrix would have to be “full rank”. Informally, that means that there is no redundant information in the columns of $X$. So if the number of predictor variables $p$ is greater than the number of data points $n$, that would be a problem. Even if $p < n$, the columns of $X$ still need to be linearly independent. Multicollinearity (when one covariate is highly correlated with another covariate) would make us suspicious of this. Let’s keep that in mind just in case it comes back to haunt us.

Now that we have our estimator for $\beta$, there are two properties we want to check on.

- We want to on average get the “right” answer.
- We want to be able to describe the uncertainty in our estimate. In other words, how much would our estimate wiggle around if we happened to have a new sample?

Let’s start with determining if we “expect” to get the right answer on average. It will help to rewrite $Y$ based on the proposed model.

As we break this expression down, we can see that we are on the right track. However, somehow that second term needs to have expectation zero.

How can we make this happen? This is where two regression assumptions are born. First we need the errors, $\epsilon$, to be independent of $X$. This seems plausible. If the errors depend on $X$, somehow we still have some information leftover that is not accounted for in the model. If the errors *did* depend on $X$, that would be a form of heteroscedasticity (non-constant variance for short). That sets our OLS assumption alarm bells ringing. We also need to make an assumption about the magnitude of the errors themselves. It would be nice if they were not systematically positive or negative (that doesn’t seem like a very good model), so assuming they are on average zero seems like a reasonable path forward.

With these assumptions in hand we now have an estimator that is *unbiased*. We expect to get the right answer on average. So far we have seen the motivation for two assumptions. If either of those are broken, this unbiasedness is not guaranteed. So if our goal is to make predictions or interpret these coefficients in context, we will be out of luck if these assumptions aren’t met. The next step is to understand the estimator’s uncertainty. Let’s see what other assumptions reveal themselves in the process.

So let’s recap what we’ve had to assume about the errors so far.

What *haven’t* we had to assume yet? Normality! Then why is everyone always worried about that when doing regression? Why all of those qqplots?! Well, we do need to know the whole sampling distribution of the estimates if we want to do inference, or say something about a more general population based on a sample. If we assume the distribution that those errors have is actually Normal, that lets us get a normal sampling distribution of $\hat{\beta}$.

Ah, but there is that pesky inverse again! If we have multicollinearity, that inverse might get unstable, affecting our understanding of the spread of the sampling distribution. And of course nothing in life is perfectly normal. However, we can often get an approximately normal sampling distribution if our sample size is large enough thanks to the Central Limit Theorem (an explainer for another time), so there is some robustness to breaking this particular assumption.

We’ve now seen all of the regression assumptions unfold, so we should be able to build a sort of hierarchy of assumptions.

This hierarchy is also borne out in simulation studies like this one that try to stress test OLS. You may even want to code up a simple simulation yourself if you want to look for any gaps between the theory and practice.

Some broken assumptions are fixable though. We might transform variables to deal with a lack of linearity or use a Generalized Linear Model that allows the relationship between $X$ and $Y$ to take a non-linear form. We might just be happy with estimating the best linear projection of the true relationship. If errors aren’t independent and/or do not have constant variance, methods like generalized least squares (GLS) can step in. If errors are not normally distributed, they might be approximately normally distributed thanks to the Central Limit Theorem, or we might want to use a Generalized Linear Model (GLM) that can handle error distributions of other forms. We always have options!

It’s great to see the theory behind the scenes informing the practice, and in general, assumptions of any method have to come from somewhere. Seeking out where in the math assumptions help make our lives easier can help demystify where these assumptions actually come from and gives some insight into which ones, if broken, are more dangerous than others. Happy line fitting!

$Y = X \beta + \epsilon$

$\hat{\beta} = (X’X)^{-1}X’Y$

$E[\hat{\beta} | X] = \beta$

$\hat{\beta} = (X’X)^{-1} X’ Y = (X’X)^{-1} X’ (X\beta + \epsilon)$

$= (X’X)^{-1} X’X\beta + (X’X)^{-1} X’\epsilon$

$= \beta + (X’X)^{-1}X’ \epsilon$

$E[\hat{\beta} | X] = E[(\beta + (X’X)^{-1} X’ \epsilon) | X]$

$ = E[\beta | X] + E[(X’X)^{-1}X’ \epsilon | X]$

$ = \beta + (X’X)^{-1} X’ E[\epsilon | X]$

$E[\epsilon | X] = 0?$

$\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$

$\hat{\beta} \sim N(\beta, \sigma^2 (X’X)^{-1})$

- Statistical Models: Theory and Practice by David A. Freedman,
*Cambridge University Press*(2009) - Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua D. Angrist and Jörn-Steffen Pischke,
*Princeton University Press*(2009)

Special thanks to my students in my Stat 2 class this semester for raising good questions about these regression assumptions and to my colleagues for helping me work through this hierarchy before reporting back. Also, thanks to Features Writer Courtney R. Gibbons for including drawings in her last post. That inspired me to do some doodling myself.

Note: The website https://drawdata.xyz/ made it possible to generate data to make the plots in the opening image.

]]>David Austin

Grand Valley State University

I was running some errands recently when I spied the Chicken Robot ambling down the sidewalk, then veering into the bike lane. A local grocery chain needs to transport rotisserie chickens from one of its larger stores to its downtown market. Their solution? A semi-autonomous delivery vehicle that makes the three-mile journey navigating with GPS.

I was curious how this worked. The locations provided by most GPS systems are only accurate to within a meter or so. How could those juicy chickens know how to stay so carefully in the center of the bike lane?

A typical solution to problems like this is an elegant algorithm, known as Kalman filtering, that’s embedded in a wide range of technology that most of us frequently either use or benefit from in some way. The algorithm allows us to efficiently combine our expectations about the state of a system, the chickens’ physical location, for instance, with imperfect measurements about the system to develop a highly accurate picture of the system.

This column will describe some simple versions of Kalman filtering, the main observation that makes it work, and why it’s such a great idea.

A first example

Let’s begin with a simple example. Suppose we would like to determine the weight of some object. Of course, any scale is imperfect so we might weigh it repeatedly on the same scale and find the following weights, perhaps in grams.

50.4 | 46.0 | 46.1 | 45.1 | 48.9 | 42.6 | 50.7 | 45.7 | 47.8 | 46.7 |

As the result of each weighing is recorded, we update our estimate of the weight as the average of all the measurements we’ve seen so far. That is, after obtaining $z_n$, the $n^{th}$ weight, we could estimate the object’s weight by finding the average $$ x_n = \frac1n(z_1 + z_2 + \ldots + z_n). $$ The result is shown below.

As the next weight $z_{n}$ is recorded, we update our estimate:

$$

\begin{aligned}

x_{n} & = \frac1{n}\left(z_1 + z_2 + \ldots + z_{n}\right) \

x_{n} & = \frac1{n}\left((n-1)x_{n-1} + z_{n}\right) \

x_{n} & = x_{n-1} + \frac1{n}\left(z_{n} – x_{n-1}\right)

\end{aligned}

$$

This expression tells us how each new measurement influences the next estimate. More specifically, the new estimate $x_{n}$ is obtained from the previous estimate $x_{n-1}$ by adding in $\frac1{n}(z_{n} – x_{n-1})$. Notice that this term is proportional to the difference between the new measurement and the previous estimate with the proportionality constant being $K_n = \frac1{n}$. We call $K_n$ the * Kalman gain;* the fact that it decreases as we include more measurements reflects our increasing confidence in the estimates $x_n$ so that, as $n$ increases, new measurements have less effect on the updated estimates.

Suppose now that we have some additional information about the accuracy of our measurements $z_n$. For example, if $z_n$ is the chickens’ location reported by a GPS system, the true location is most likely within a meter or so. More generally, we might imagine that the true value is normally distributed about $z_n$ with standard deviation $\sigma_{z_n}$.

Returning to our weighings, we could perhaps estimate $\sigma_{z_n}$ by repeatedly weighing an object of known weight. The probability that the true value is near $z_n$ is given by the distribution

$$

p_{z_n}(x) =

\frac{1}{\sqrt{2\pi\sigma_{x_n}^2}}

e^{-(z-z_n)^2/(2\sigma_{z_n}^2)}.

$$

Let’s also suppose that we have some estimate of the uncertainty of our estimates $x_n$. In particular, we’ll assume the probability that the true value is near $x_n$ is given by the normal distribution having mean $x_n$ and standard deviation $\sigma_{x_n}$:

$$

p_{x_n}(x) =

\frac{1}{\sqrt{2\pi\sigma_{x_n}^2}}

e^{-(x-x_n)^2/(2\sigma_{x_n}^2)}.

$$

We’ll describe how we can estimate $\sigma_{x_n}$ a bit later.

These distributions could be as shown below.

Now here’s the beautiful idea that underlies Kalman’s filtering algorithm. Both distributions $p_{x_n}$ and $p_{z_n}$ describe the location of the true value we seek. We can combine them to obtain a better estimate of the true value by multiplying them. That is, we combine or “fuse” these two distributions using the product

$$

p_{x_n}(x) p_{z_n}(x).

$$

Since the product of two Gaussians is a new Gaussian, we obtain, after normalizing, a new normal distribution that better reflects the true value. In our example, the product $p_{x_n}p_{z_n}$ is shown in green.

More specifically, the product of the two distributions is

$$

\begin{aligned}

& \exp\left[-(x-x_n)^2/(2\sigma_{x_n}^2)\right]

\exp\left[-(x-z_n)^2/(2\sigma_{z_n}^2)\right] \

& \hspace{48pt}=

\exp\left[-(x-x_n)^2/(2\sigma_{x_n})^2-(x-z_n)^2/

(2\sigma_{z_n}^2)\right].

\end{aligned}

$$

Expanding the quadratics in the exponents and recombining leads to the new normal distribution

$$

\frac{1}{\sqrt{2\pi\sigma_{x_{n+1}}^2}}

\exp\left[-(x-x_{n+1})^2/(2\sigma_{x_{n+1}}^2)\right]

$$

where

$$

\begin{aligned}

x_{n+1} & =

\frac{\sigma_{z_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2} x_n +

\frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2} z_n \

& = x_n + \frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}(z_n –

x_n) \

\end{aligned}

$$

Notice that this expression is similar to the one we found when we updated our estimates by simply taking the average of all the measurements we have seen up to this point. The Kalman gain is now

$$

K_n = \frac{\sigma_{x_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}

$$

In addition, the product of the Gaussians leads to the new standard deviation

$$

\sigma_{x_{n+1}}^2 =

\frac{\sigma_{x_n}^2\sigma_{z_n}^2}{\sigma_{x_n}^2+\sigma_{z_n}^2}

= (1-K_n) \sigma_{x_n}^2

$$

Notice that the Kalman gain satisfies $0\lt K_n \lt 1$. In the case that $\sigma_{x_n} \ll \sigma_{z_n}$, we feel much more confident in our estimate $x_n$ than our new measurement $z_n$. Therefore, $K_n\approx 0$ and so $x_{n+1} \approx x_n$, reflecting the fact that the new measurement has little influence on our next estimate.

However, if $\sigma_{x_n} \gg \sigma_{z_n}$, we feel much more confident in our new measurement $z_n$ than in our estimate $x_n$. Then $K_n\approx 1$ and $x_{n+1}\approx z_n$.

In both cases, the uncertainty in our new estimate, as measured by $\sigma_{x_{n+1}}^2 = (1-K_n)\sigma_{x_n}^2$, has decreased.

This leads to the following algorithm:

- Initialize the first estimate $x_1$ and the uncertainty $\sigma_{x_1}$. We could use the first measurement as the first estimate or we could simply make a guess for the estimate and choose a large uncertainty.
- Each time a new measurement $z_{n}$ arrives, update the estimate $x_{n+1}$ nd uncertainty $\sigma_{x_{n+1}}$ as follows:
- Find the Kalman gain $K_n = \frac{\sigma_{x_n}^2}

{\sigma_{x_n}^2+\sigma_{z_n}^2}$. - Update our estimate $x_{n+1} = x_n + K_n(z_{n}-x_n)$.
- Update our uncertainty $\sigma_{x_{n+1}}^2 =

(1-K_n)\sigma_{x_n}^2$. Since $0\lt 1-K_n\lt 1$, the uncertainty will continually decrease, which makes the algorithm relatively insensitive to the initialization.

- Find the Kalman gain $K_n = \frac{\sigma_{x_n}^2}

To illustrate, suppose that we make many measurements of a known weight with our scale and determine that the standard deviation of its measurements is constant $\sigma_z=2$. Initializing so that $x_1 = z_1$ and $\sigma_{x_1} = \sigma_z = 2$, we have the following sequence of estimates along with the decreasing

sequence of uncertainties.

Tracking a dynamic system

In our first example, the quantity we’re tracking, the weight of some object, doesn’t change. Let’s now consider a dynamic process where the quantities are changing. For instance, suppose we’re tracking the height and vertical velocity of a moving object. The true height of the object could look like this, although

that true height is not known to us.

In addition to recording the height $h$, we will also track the vertical velocity $v$ so that the state of the object at some time is given by the vector

$$

{\mathbf x}_{n,n} = \begin{bmatrix} h_n \ v_n \end{bmatrix}.

$$

The reason for writing the subscript in this particular way will become clear momentarily.

Moreover, we will assume that there is some uncertainty in both the position and the velocity. Initially, these uncertainties may be uncorrelated with one another

so that we arrive at a Gaussian

$$

\exp\left[-(h-h_n)^2/(2\sigma_{h_n}^2) –

(v-v_n)^2/(2\sigma_{v_n}^2)\right]

$$

that describes the multivariate normal distribution of the true state. A more convenient expression for this Gaussian uses the covariance matrix

$$

P_{n,n} = \begin{bmatrix} \sigma_{h_n}^2 & 0 \

0 & \sigma_{v_n}^2 \

\end{bmatrix}

$$

so that the Gaussian is defined in terms of the quadratic form associated to $P_{n,n}^{-1}$:

$$

\exp\left[-\frac12 ({\mathbf x} – {\mathbf x}*{n,n})^TP*{n,n}^{-1}({\mathbf x} –

{\mathbf x}_{n,n})\right].

$$

The following figure represents the value of this distribution by shading more likely regions more darkly.

Now if we know $x_{n,n}$, the state at some time, we can predict the state at a later time using some assumption about the system. For instance, we might assume, after $\Delta t$ time units have passed, that the new state is ${\mathbf x}*{n+1,n} = \begin{bmatrix}h*{n+1} \ v_{n+1} \end{bmatrix}$ where

$$

\begin{aligned}

h_{n+1} & = h_n + v_n\Delta t \

v_{n+1} & = v_n.

\end{aligned}

$$

That is, we assume that the height has increased at the constant velocity given by the state vector ${\mathbf x}*n$ and that the velocity has remained constant. More succinctly, if
$$
F_n = \begin{bmatrix} 1 & \Delta t \ 0 & 1 \end{bmatrix},
$$
we can write
$$
{\mathbf x}*{n+1, n} = F_n{\mathbf x}

$$

The subscript ${\mathbf x}

This is a particular model we are choosing; in other situations, another model may be more appropriate. For instance, if we have information about the object’s acceleration, we may want to incorporate it. In any case, we assume there is a matrix $F_n$ from which we extrapolate the next state:

$$

{\mathbf x}*{n+1,n} = F_n{\mathbf x}*{n,n}.

$$

Of course, uncertainty in the state ${\mathbf x}*{n,n}$ will translate into uncertainty in the extrapolated state ${\mathbf x}*{n+1,n}$. It is straightforward to verify that the covariance matrix is transformed as

$$

P_{n+1,n} = F_nP_{n,n}F_n^T.

$$

For instance, if the uncertainties in position and velocity are initially uncorrelated, they may become correlated after the transformation, which is to be expected.

$P_{n,n}$ | $P_{n+1,n}$ |

These two transformations form the extrapolation phase of the algorithm:

$$

\begin{aligned}

{\mathbf x}*{n+1,n} & = F_n{\mathbf x}*{n,n} \

P_{n+1,n} & = F_nP_{n,n}F_n^T.

\end{aligned}

$$

Next, suppose we have a new measurement ${\mathbf z}*{n}$ whose uncertainty is described by the covariance matrix $R*{n}$. We imagine that the normal distribution centered on ${\mathbf z}_{n}$ and with covariance matrix $R_n$ describes the distribution of the true state.

As before, we will fuse our predicted state ${\mathbf x}*{n+1,n}$ with the measured state ${\mathbf z}*{n}$ by multiplying the two normal distributions and rewriting as a single Gaussian. With some work, one finds the expression for the Kalman gain $$ K_{n} = P_{n+1,n}(P_{n+1,n} + R_{n})^{-1}, $$ which should be compared to our earlier expression.

We also obtain the updated state and covariance matrix

$$

\begin{aligned}

{\mathbf x}*{n+1, n+1} & = {\mathbf x}*{n+1,n} + K_{n}({\mathbf z}*{n} –
{\mathbf x}*{n+1,n}) \

P_{n+1,n+1} & = P_{n+1, n} – K_{n}P_{n+1,n}. \

\end{aligned}

$$

So now we arrive at a new version of the algorithm:

- Initialize the initial state ${\mathbf x}_{1,1}$ and covariance matrix $P_{1,1}$.
- As new measurements become available, repeat the following steps:
- Form the extrapolated state and covariance:

$$

\begin{aligned}

{\mathbf x}_{n+1,n} & = F_n{\mathbf x}_{n,n} \\

P_{n+1,n} & = F_nP_{n,n}F_n^T.

\end{aligned}

$$ - Use the result to find the Kalman gain

$$

K_{n} = P_{n+1,n}(P_{n+1,n} + R_{n+1})^{-1},

$$ - Use the Kalman gain to fuse the predicted state with the measured state and update the covariance matrix.

$$

\begin{aligned}

{\mathbf x}_{n+1, n+1} & = {\mathbf x}_{n+1,n} + K_{n}({\mathbf z}_{n} –

{\mathbf x}_{n+1,n}) \\

P_{n+1,n+1} & = P_{n+1, n} – K_{n}P_{n+1,n}. \\

\end{aligned}

$$

- Form the extrapolated state and covariance:

Let’s see how this plays out in an example. Imagine we are tracking the height and vertical velocity of an object whose true height as is shown.

Remember that we don’t know the true height, but we do have some noisy measurements that reflect the object’s height and its velocity.

Applying the Kalman filtering algorithm naively, we obtain the green curve describing the object’s height. Notice how there is a time lag between the filtered height and the true height.

What’s the problem here? Clearly, the object is experiencing a significant amount of acceleration, which is not built into the model. As we saw in our earlier static example, the uncertainty in the estimated state ${\mathbf x}_{n,n}$ continually decreases, which means the confidence we have in our estimates grows and causes us to downplay the new measurements.

There are several ways to deal with this. If we have measurements about the acceleration, we could rebuild our model so that the state vector ${\mathbf x}*{n,n}$ includes the acceleration in addition to the position and velocity. Alternatively, if we have information about how an operator is controlling the object we’re tracking, we could build it into the extrapolation phase using the update
$$
{\mathbf x}*{n+1,n} = F_{n}{\mathbf x}_{n,n} + B_n{\mathbf u}_n

$$

where ${\mathbf u}_n$ is a vector describing some additional control and $B_n$ is a matrix describing how this control feeds into the extrapolated state. Clearly, we need to know more about the system to incorporate a term like this.

Finally, we can simply build additional uncertainty into the model by adding it into the extrapolation phase. For instance, we could define

$$

P_{n+1,n} = F_nP_{n,n}F_n^T + Q_n,

$$

where $Q_n$ is a covariance matrix known as * process noise* and represents our way of saying there are additional influences not incorporated in the extrapolation model:

${\mathbf x}*{n+1,n} = F_n{\mathbf x}*{n,n}$.

Adding some process noise into the extrapolation phase prevents us from becoming overly confident in the extrapolated states and continuing to give sufficient weight to new measurements. This leads to the filtered state as shown below, which is clearly much better than the measured signal.

Summary

Kalman developed this algorithm in 1960, though it seems to have appeared earlier in other guises, and found a significant early use in the Apollo guidance computer. Indeed, the algorithm is well suited for this application. While the guidance computer was a marvel of both hardware and software engineering, its memory and processing power were modest by our current standards. As the algorithm only relies on our current estimate of the state, its demands on memory are slight, and the computational complexity is similarly small.

In addition to being fast and efficient, the algorithm is also optimal in the sense that, under certain assumptions on the system being modeled, the algorithm has the smallest possible expected error obtained from a given set of measurements.

Kalman filtering is now ubiquitous in navigation and guidance applications. In fact, it is used to smooth the motion of computer trackpads so you may have used it while reading this article. If you are driving while navigating with an app like Google Maps, you may notice the effect of the algorithm when you come to a stop at a traffic light. It sometimes happens that your location continues with constant speed into the intersection and then quickly snaps back to your actual location. The extrapolation phase of the algorithm would lead us to believe that we continue with constant speed into the intersection before new measurements pull the extrapolated locations back to our true location.

Finally, the Chicken Robot needs your help to find a new name.

References

There are lots of relevant references on the internet, but few give an intuitive sense of what makes this algorithm work. Besides Kalman’s original paper, I’ve given a few of those here.

**Rudolph Kalman.***A New Approach to Linear Filtering and Prediction Problems,*Journal of Basic Engineering**82**, 1960. Pages 35-45.**Tim Babb.**How a Kalman filter works, in pictures. A lovely introduction with a lot of visual explanations.**Alex Becker.**Kalman Filter Tutorial. A gentle introduction that includes many motivating examples.**Ramsey Faragher.**Understanding the Basis of the Kalman Filter Via a Simple and Intuitive Derivation.*Signal Processing Magazine*, IEEE**29**, 2012. Pages 128-132.

*For an integer $k$, is it possible to place $k$ rooks on a chess board so that no piece sits on the same row or column as any others? We wouldn’t want them stepping on each others’ toes. *

**Thomas Morrill**

**Trine University**

In the game of chess, each of the four rooks move along a row or a column, attacking the first piece they encounter. Chess can be complicated, so let’s ignore everything that isn’t a rook — the pawns, knights, bishops, king and queen, and even the players. How else could we set up the board?

This is the $k$-rook problem: For an integer $k$, is it possible to place $k$ rooks on a chess board so that no piece sits on the same row or column as any others? We wouldn’t want them stepping on each others’ toes. Are there multiple solutions? How many?

We’ll begin our exploration on the ordinary $8 \times 8$ square board, using as many rooks of a single color as necessary. Thinking like a combinatorialist, the easiest case to reason through is not $k=1$, but actually $k=0$: When you attempt to place zero rooks on the chessboard, the job is done before you start. There is exactly one solution to the zero-rook problem.

Next, try to place a single rook on the chess board. There are sixty-four spaces on the board, so there are sixty-four distinct solutions to the one-rook problem.

Two rooks? Just start with a random one-rook board, then place a second rook. Since the first rook blocks out one row and one column, there will be forty-nine legal spaces for the second rook to occupy. Uh-oh, $64 \cdot 49$ would be an over-estimate! Swapping the position of the first and second rooks does not change the arrangement, so we need to divide out by the two different ways of permuting the rooks. The answer is that there are $64 \cdot 49 / 2$ different solutions to the two-rook problem. And $64 \cdot 49 \cdot 36 /6$ solutions to the three-rook problem.

If we keep at it, we’ll see that for $1 \leq k \leq 8$, there are exactly

[

\frac{64 \cdot 49 \cdot \cdots \cdot (8 – k + 1)^2}{k!}

]

solutions to the $k$-rook problem. Does this formula still work for larger values of $k$? Yes! Since there are only eight rows to work with, it is impossible to place nine or more non-attacking rooks on the standard chess board. In these cases, the numerator of our formula will include $(8-9+1)$, rendering the whole quantity equal to zero.

Let’s focus on the eight-rook problem for a moment. Our solution formula suggests there are $8!$ solutions. But *why* eight factorial? Well, every row and every column must be occupied by a rook. The first row rook can be placed on any of the eight columns. Then the second row rook can be placed on any of the seven remaining columns, and so on. So there are $8 \cdot 7 \cdot 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 8!$ solutions.

Let’s apply what we’ve learned to a more general problem. How many solutions are there to the rook problem on an $n \times m$ rectangular board? We’ll rotate any non-square boards so that there are more rows than columns. That is, we’ll assume $m \leq n$. (The rooks won’t be able to tell the difference anyway.) This means there will only be room for up to $m$ rooks before we run out of open columns.

On an $n \times 1$ board, with $n \geq 1$, there is one solution to the zero-rook problem, and $n$ different solutions to the one-rook problem. Easy enough.

On an $n \times 2$ board, with $n \geq 2$, there is one solution to the zero-rook problem, then $2n$ different solutions to the one-rook problem, and $n(n-1)$ solutions to the two-rook problem. Let’s put the data in a table before we forget. For each $k$, we’ll say that $r_k$ is the number of solutions to the $k$-rook problem on whatever board we are currently looking at. These are called the *rook numbers* of the board.

$k$ | $0$ | $1$ | $2$ |

$r_k$ | $1$ | $2n$ | $n(n-1)$ |

The rook numbers for an $n \times 3$ board with $n \geq 3$ are given in the next table.

$k$ | $0$ | $1$ | $2$ | $3$ |

$r_k$ | $1$ | $3n$ | $3 \cdot 2\cdot n (n-1)$ | $n(n-1)(n-2)$ |

Some products have shown up which resemble the binomial coefficient. Recall, the binomial coefficient $\binom{n}{k}$ counts the number of ways to choose $k$ objects from a set of $n$ objects, and it may be calculated as

[

\binom{n}{k} = \frac{n!}{k!(n-k)!}.

]

Let’s clean up a bit by using the *falling factorial*

[

n^{\underline{k}} = n (n-1) \cdots (n-k+1),

]

so that $n^{\underline{k}}$ is a product of $k$ terms starting at $n$ and stopping at $n-k+1$. Let’s say that $n^{\underline{0}} = 1$, just to keep the pattern consistent. It looks like the rook numbers for a general $n \times m$ board with $m \leq n$ should be $r_k = \binom{m}{k} n^{\underline{k}}$. Can we prove this? Yes!

Fix $0 \leq k \leq m \leq n$. We need $k$ columns to work on, and there are $\binom{m}{k}$ ways to choose those. Similarly, we need $k$ rows, and there are $\binom{n}{k}$ ways to choose those. Placing rooks on the intersections of these $k$ rows and $k$ columns is equivalent to solving the $k$-rook problem on a $k\times k$ square board. From our work on the $8\times 8$ board earlier, we know that there are $k!$ ways to place these rooks.

All together, we have

[

\binom{m}{k} \binom{n}{k} k! = \binom{m}{k} \frac{n!k!}{k!(n-k)!}=\binom{m}{k} n^{\underline{k}}

]

solutions to the $k$-rook problem on this board. In fact, the $\binom{m}{k}$ term sets this formula to zero if we try to place more rooks than we have columns for. Fancy!

Could we make a more complicated board? Let’s start with an $n \times n$ square board and mark some of the spaces as being out of bounds. Whatever is left is the board $\mathcal{B}$. How do we find the rook numbers now?

Like before, there is always one solution to the zero-rook problem. Going further, the number of solutions to the one-rook problem will be equal to the number of spaces in bounds.

Just to keep things clear, we’ll decorate our rook numbers $r_k^\mathcal{B}$ with the letter $\mathcal{B}$ to indicate that the number of solutions to the $k$-rook problem depends on which board we have picked. This will be important when we work with more than one board at the same time. The $\mathcal{B}$ is *not* an exponent!

Now, it may be tempting to start a new table to hold these numbers. Before we get too carried away, I suggest we use a different data structure instead — a *polynomial*.

What, you don’t think we could store data in a polynomial? All we need is a safe place to write down the rook numbers $r_k^\mathcal{B}$ for later. We’ll write

\begin{align*}
R_\mathcal{B}(x)= r_0^\mathcal{B} + r_1^\mathcal{B} x + r_2^\mathcal{B} x^2 + \cdots \= \sum_{k=0}^n r_k^\mathcal{B} x^k.
\end{align*}

Because $\mathcal{B}$ only has room for at most $n$ rooks, this sum only goes up to $k=n$. This means each board $\mathcal{B}$ has a *rook polynomial* $R_\mathcal{B}(x)$, which we can use to explore relationships amongst the rook numbers $r_k^\mathcal{B}$.

Let’s see how this helps. Start with some non-empty board $\mathcal{B}$, and choose one of the squares. Any $k$-rook solution either places a rook on the chosen square, or it doesn’t. The solutions that don’t place a rook there correspond to $k$-rook solutions on the sub-board $\mathcal{B}_1$, which is $\mathcal{B}$ with the chosen square removed. The solutions that do place a rook on the chosen square correspond to $(k-1)$-rook solutions of the sub-board $\mathcal{B}_2$, which is $\mathcal{B}$ with both the row and column of the chosen square removed.

This means that $r_k^\mathcal{B} = r_k^{\mathcal{B}*1} + r*{k-1}^{\mathcal{B}*2}$. We can encode this relationship using the rook polynomials for all three boards, namely
[
R*\mathcal{B}(x) = R_{\mathcal{B}

]

This can be a useful tool for calculating rook polynomials by hand. Just keep subdividing your boards into smaller pieces until you get to a board you know.

Let’s think a bit harder about what the subdivisions tell us.

We can gather information about the rook problem by methodically removing spaces from our board and trying to solve the rook problem again. But each of these boards are square board with spaces marked as out of bounds. Shouldn’t we be able to get the rook numbers for the in-bounds spaces by looking at the spaces which are are out of bounds?

For a given board $\mathcal{B}$ cut out of an $n\times n$ square board, let’s take a look at the spaces which were removed. This is called the *complement board*, $\overline{\mathcal{B}}$.

The complement board has its own rook numbers, $r_k^{\overline{\mathcal{B}}}$, which have a fascinating relationship with the numbers $r_k^\mathcal{B}$:

[

\sum_{k=0}^n r_k^\mathcal{B} (n-k)! x^ k = \sum_{k=0}^n (-1)^k r_k^{\overline{\mathcal{B}}} (n-k)! x^k (x+1)^{n-k}

]

This fascinating, non-obvious result is known as the Rook Reciprocity Theorem.

In short, the equation states that the rook numbers for $\mathcal{B}$ determine the rook numbers for $\overline{\mathcal{B}}$.

Notice that both sides of the equation are polynomials, meaning that expanding the sums and distributing the products will produce the same coefficients on either side of the equation. Sounds complicated.

Remember, though, that the polynomials are just a data structure to us. They are a means to an end. These particular polynomials are equal to each other, but it is the relationship between the rook numbers that we are after. So instead, we will consider a simpler statement of the Rook Reciprocity Theorem. Rather than write our polynomials with powers of $x$

[

a_0 + a_1 x + a_2 x^2 + \cdots + a_n x^n,

]

let’s use falling factorials, as in

[

a_0 + a_1 x^{\underline{1}} + a_2 x^{\underline{2}} + \cdots + a_n x^{\underline{n}}.

]

We never said that the falling factorial required integer valued inputs!

The same coefficients now produce a different polynomial, and hopefully one that is more accommodating to our problem. With a little sleight of hand, let’s consider rook polynomials like

[

\tilde{R}*\mathcal{B}(x) = \sum*{k=0}^n r_k^\mathcal{B} x^{\underline{n-k}}.

]

And here is the Rook Reciprocity Theorem stated in its more elegant form:

*For any sub-board $\mathcal{B}$ of an $n\times n$ square board, we have*

[

\tilde{R}*\mathcal{\overline{B}}(x) = (-1)^n \tilde{R}*\mathcal{B}(-x-1).

]

Brilliant.

Let’s walk through a proof for this result originally developed by Timothy Chow. Starting on the right-hand side, we have

[

(-1)^n \tilde{R}*\mathcal{B}(-x-1) = (-1)^n \sum*{k=0}^n r_k^\mathcal{B} (-x-1)^{\underline{n-k}}

]

When we distribute the $(-1)^n$ into the summation, we can combine $n-k$ powers of $-1$ with the terms of the falling factorial; this can be rewritten as

[

\sum_{k=0}^n (-1)^{k} r_k^\mathcal{B} (x+n-k)^{\underline{n-k}}.

]

For the next part of the argument, let’s fix $x$ to be some positive integer, so that $r_k^\mathcal{B} (x+n-k)^{\underline{n-k}}$ is a nonnegative integer. What could it be counting?

Embed the board $\mathcal{B}$ into an $n\times n$ square board. Then, append $x$ more rows to the bottom of board, producing an $(x + n) \times n$ board. Now, there are still $r_k^\mathcal{B}$ ways to place $k$ rooks onto $\mathcal{B}$. Each of these arrangements leaves $x+n-k$ rows unoccupied and $n-k$ columns unoccupied, too.

We can then place $n-k$ more rooks on the board in $(x+n-k)^{\underline{n-k}}$ different ways. So together, $r_k^\mathcal{B} (x+n-k)^{\underline{n-k}}$ is the number of ways to place $n$ rooks on the $(x+n) \times n$ board by first placing $k$ rooks on $\mathcal{B}$ and then another $n-k$ rooks anywhere else.

Ah, but that second batch of rooks might have been placed on the extra rows, on the complement board $\overline{\mathcal{B}}$, or even on some unoccupied space of $\mathcal{B}$. When we sum over $0 \leq k \leq n$, the same configurations will be counted multiple times with differing signs from the $(-1)^{k}$ term. What gets cancelled out, and what remains?

Consider one of these arrangements without knowing how the two steps were performed. Each arrangement has some number of rooks on $\mathcal{B}$, which we’ll call $j$. In the summation, the same arrangement will be counted once for each of the $2^j$ subsets of the rooks on $\mathcal{B}$.

There is only one way rooks not on $\mathcal{B}$ could have been placed, step two. However, both steps in this process could have put a rook on $\mathcal{B}$. And the parity of $k$ in step one controls the minus sign! So if $j > 0$, these solution counts will cancel each other out in the sum. However, if $j=0$, the only subset of the empty solution on $\mathcal{B}$ is itself, and there is no cancellation.

That means, only arrangements in which there are *no* rooks on $\mathcal{B}$ will remain in the count. The only solutions we see have all the rooks placed on $\overline{\mathcal{B}}$ or on the $x$ extra rows. We can count those solutions directly: If there are $j$ rooks on $\overline{\mathcal{B}}$ and $n-j$ rooks on the extra rows, then there are $r_j^{\overline{\mathcal{B}}} x^{\underline{n-j}}$ ways to place them.

So, by summing over all the possible values of $j$, we have that

[

(-1)^n \sum_{k=0}^n r_k^\mathcal{B} (-x-1)^{\underline{k}}

=\sum_{j=0}^n r_j^{\overline{\mathcal{B}}} x^{\underline{n-j}},

]

or rather,

[

(-1)^n \tilde{R}*\mathcal{B}(-x-1) = \tilde{R}*\mathcal{\overline{B}}(x).

]

Now, this equation is describing the *outputs* of the two polynomials for some integer $x > 0$. Ah, but the size of $x$ was arbitrary during our argument. The only way that two polynomials can output the same value for all positive integers is that they are in fact the same polynomial.

Done.

Now, there are many other directions that the rook problem can lead to. When can two different boards have the same rook polynomials? Which polynomials are rook polynomials for some board? Can we pursue the rook problem in three or more dimensions? Could we have used a different chess piece?

I encourage you to think about the way we have used polynomials to organize our solution to this problem. We are not using polynomials as single objects in a vacuum. The insight here was to attach our counting problem to an algebraic structure. This technique is more broadly known as a *generating function*. When we are interested in some sequence of numbers, we can think of them as the coefficients of a polynomial or a power series, and manipulate the larger objects according to whatever algebraic rules they follow. The trick is to find the function that counts.

- Timothy Chow. A short proof of the rook reciprocity theorem.
*Electron. J.*, 3(1):Research Paper 10, approx. 2, 1996.

Combin. - Jay R. Goldman, J. T. Joichi, and Dennis E. White. Rook theory. I. Rook

equivalence of Ferrers boards.*Proc. Amer. Math. Soc.*, 52:485–492, 1975. - John Riordan.
*An introduction to combinatorial analysis.*Dover Publications,

Inc., Mineola, NY, 2002. Reprint of the 1958 original [Wiley, New York;

MR0096594 (20 #3077)]. - Herbert S. Wilf.
*generatingfunctionology.*A K Peters, Ltd., Wellesley, MA, third

edition, 2006.

*Turing’s methodology was unique: he imagined hypothetical machines that could perform complicated mathematical tasks in a deterministic manner, in the way computers do today. In this way, he inadvertently kickstarted the entire field of modern computer science…*

**Adam A. Smith**

**University of Puget Sound**

Alan Turing may be the most-influential-yet-least-read figure in 20th Century mathematics. His landmark paper, “On computable numbers, with an application to the Entscheidungsproblem”, was published in the Proceedings of the London Mathematical Society in late 1936. This is the paper that spun computer science off of mathematics, into the distinct field that we know today. Yet, if you ask mathematicians and computer scientists alike if they have read this paper, the answer is frequently “No.” It is so long and amazingly dense that even experts often have a very hard time parsing his arguments. This column aims to rectify this slightly, by explaining one small part of Turing’s paper: the set of computable numbers, and its place within the real numbers.

The Entscheidungsproblem (German for “decision problem”) is the problem of developing an algorithm that can determine whether a statement of first-order logic is universally valid. However, Turing did not know that American mathematician Alonzo Church had already proven that such an algorithm is impossible a few months previously. Even so, Turing’s methodology was unique: he imagined hypothetical machines that could perform complicated mathematical tasks in a deterministic manner, in the way computers do today. In this way, he inadvertently kickstarted the entire field of modern computer science (which has made some modest improvements since 1936). As part of his proof, he developed the concept of computable numbers. Turing used his imaginary machines (that have since come to be called “Turing machines” in his honor) for the process of calculating numbers in this set. It contains any number that can be calculated to within an arbitrary precision in a finite amount of time. This set includes all rational and algebraic numbers (e.g. \sqrt2), as well as many transcendental numbers such as $\pi$ and $e$. Interestingly, Turing created a very natural extension to Georg Cantor’s set theory, when he proved that the set of computable numbers is countably infinite!

Most mathematicians are familiar with the idea of countability. That is, the notion developed by Cantor in the 1870s that not all infinite sets have the same cardinality. A set that is countably infinite is one for which there exists some one-to-one correspondence between each of its elements and the set of natural numbers $\mathbb{N}$. For example, the set of integers $\mathbb{Z}$ (“Z” for “Zahlen”, meaning “numbers” in German) can be easily shown to be countably infinite. Cantor himself developed a well-known proof that the set of rational numbers $\mathbb{Q}$ (for “quotient”) is also countably infinite. Thus, in a very real sense $\mathbb{N}$, $\mathbb{Z}$, and $\mathbb{Q}$ all have the same cardinality, even though $\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q}$.

Conversely, an infinite set for which there is no one-to-one correspondence with $\mathbb{N}$ is said to be “uncountably infinite”, or just “uncountable”. $\mathbb{R}$, the set of real numbers, is one such set. Cantor’s “diagonalization proof” showed that no infinite enumeration of real numbers could possibly contain them all. Of course, there are many irrational real numbers that are useful in common computations ($\pi$, $e$, $\phi$, $\sqrt{2}$, and $\log~2$, just to name a few). A common misconception is that a set of all such useful numbers (including infinitely many transcendental numbers) is somehow “too complicated” to be merely countably infinite. Turing’s paper proved this intuition to be incorrect.

The rest of this column is laid out as follows. First, we repeat Cantor’s proofs showing that $\mathbb{Z}$ and $\mathbb{Q}$ are countable and $\mathbb{R}$ is uncountable. Then we will show how Turing extended Cantor’s work, by proving the countability of the set of computable numbers. We will call this set $\mathbb{K}$, to better fit in with the other sets of numbers. However, we will reprove Turing’s ideas using Python rather than his original Turing machines. The ideas behind the proofs will remain unchanged, while making it much more easily understood to a modern audience.

Intuitively, one might think of the set of integers $\mathbb{Z}$ as being “bigger” than the set of the natural numbers $\mathbb{N}$. After all, $\mathbb{Z}$ is a proper superset of $\mathbb{N}$! However, by rearranging the integers to start with $0$ and count up in magnitude alternating between positive and negative, we can create an infinite list of all the integers, with a definite starting point. This rearrangement allows a bijective function between $\mathbb{N}$ and $\mathbb{Z}$. This means the function gives a one-to-one and onto correspondence between the two sets, so the function is invertible.

Here’s a proof. Rearrange \(\mathbb{Z}\) to the order \(\{0, 1, -1, 2, -2, 3, -3, …\}\). Then map them to the elements of \(\mathbb{N}\) in their standard order, so that \(0\rightarrow 0\), \(1\rightarrow 1\), \(-1\rightarrow 2\), and so on. This is injective (“one-to-one”) , since no two different integers will ever map to the same natural number. And it is surjective (“onto”), because there exists an integer to map to every natural number. Therefore the relationship is a bijection, and thus the integers are a countable set.

Proving that the set of rational numbers is countable is more difficult, given that there are two “degrees of freedom” in a rational number: the numerator and the denominator. It seems difficult to rearrange $\mathbb{Q}$ into a list the same way we did with $\mathbb{Z}$. (That is, one with a definite starting point, that extends infinitely forward in a single dimension.) One cannot simply list all the rational numbers with numerator 1, then all with numerator 2, etc., because there are an infinite number of them in each subset. If we started out by listing all the rationals with numerator 1 (for example), we’d never get to the others!

Instead, Cantor thought to traverse the set in an alternating “zig-zag” fashion, as shown here. We start by only considering the positive rational numbers $\mathbb{Q}^+$. We’ll extend to the complete set later. This zig-zag pattern lets us get to any given positive rational number, eventually.

Let’s prove that the rational numbers are countable. We start by showing a bijection between the positive rationals \(\mathbb{Q}^+\) and \(\mathbb{N}\). Lay out the elements of \(\mathbb{Q}^+\) as shown in the figure above, so that the first row consists of all numbers with a \(1\) numerator in order of increasing denominator (\(\frac{1}{1}\), \(\frac{1}{2}\), \(\frac{1}{3}\), \(\ldots\)), the second row has all those with a \(2\) demoninator in the same order, and so on. Then, traverse them as shown in the figure, moving diagonally while weaving back and forth. We skip any fractions not in lowest terms, so that no rational number appears twice. In this way, we can arrange the set \(\mathbb{Q}^+\) into the order \(\{\frac{1}{1}, \frac{2}{1}, \frac{1}{2}, \frac{1}{3}, \frac{3}{1}, \ldots\}\). When mapping them to the natural numbers in this order, the mapping function is an injection because no two different elements of \(\mathbb{Q}^+\) map to the same member of \(\mathbb{N}\), and it is a surjection because there exists a member of \(\mathbb{Q}^+\) for every member of \(\mathbb{N}\). Thus it is a bijection, and \(\mathbb{Q}^+\) is countable.

To show that the set of *all* rational numbers \(\mathbb{Q}\) is countable, use the reordering strategy employed to prove that \(\mathbb{Z}\) is countable. Reorder \(\mathbb{Q}\) to start with \(0\), and then proceed through the rationals in the order shown in the figure, alternativing positive and negative. Thus \(\mathbb{Q}\) will be ordered \(\{\frac{0}{1}, \frac{1}{1}, -\frac{1}{1}, \frac{2}{1}, -\frac{2}{1}, \frac{1}{2}, -\frac{1}{2}, \frac{1}{3}, -\frac{1}{3}, \frac{3}{1}, -\frac{3}{1}, \ldots\}\). These elements can be mapped to the elements of \(\mathbb{N}\) in its usual order, and this is a bijection for the same reasons as given above.

However, real numbers are inherently uncountable. A rephrasing of Cantor’s original proof follows, using a trick that has come to be known as “diagonalization.” No matter what infinite list of real numbers is given, we can generate a new number $x$ that cannot possibly be in that list.

To prove the real numbers are uncountable, let’s assume they are not. Then there would exist some one-to-one correspondence between the real numbers \(\mathbb{R}\) and the natural numbers \(\mathbb{N}\). If that were the case, then it would be possible to enumerate all the real numbers in some order, such as that illustrated in the table shown. However, we can construct a real number \(x\) that cannot possibly be in the list, which means that it’s not a complete list. We do this by taking one digit from each number in the list, and changing it. (In the table shown, the change is effected by adding 1 to the digit, but skipping 9 and 0 so as to avoid numbers such as \(0.0999\ldots\) and \(0.1000\ldots\) that are equal, even though their digits are different.) This way, \(x\) differs from every single number in the list in at least one of its digits. The real number \(x\) has no corresponding element in the natural numbers, and therefore the original assumption must be false. \(\mathbb{R}\) must be uncountable.

Thus, integers and rational numbers are both countable sets, while real numbers are uncountable. Cantor also proved that the algebraic numbers ($\mathbb{A}$) are also a countable set. (These are the roots of polynomials with rational coefficients.) For his insight, Cantor was allegedly called an “apostate” (“Abtrünnigen” in German) and a “corrupter of youth” (“Verderber der Jugend”), among other colorful names. Many of his contemporaries felt that the nature of infinity was beyond human mathematics, being the realm of God. It was not until David Hilbert began to champion his ideas that they became more widely accepted in the early 1900s. Today, they are part of a standard undergraduate curriculum.

We will now move on to Turing’s work on computable numbers, which he defined in his famous paper. We use $\mathbb{K}$ to represent this set. ($\mathbb{C}$ is taken, as it usually refers to the set of *complex* numbers.) Turing defined a computable number as one that can be calculated to within an arbitrary precision, within a finite amount of time. These days, the easiest way to do that is often with a computer program—though any algorithmic approach will do.

First, let us show that all rational numbers are computable. When represented as a decimal, a rational number either terminates, or eventually begins to infinitely repeat the same set of digits. Thus, given enough time, a program could be written to display any rational number to the desired precision.

The figure here shows a non-terminating Python program, similar in spirit to the machines Turing laid out in his paper. This one outputs the never-ending decimal representation of one third. It first prints out “`0.`

“, and then loops forever printing the digit “`3`

” until the user gets tired and pulls the plug. In this way, it “calculates” the number to as accurate a degree as the user wishes.

It may come as a surprise that in his seminal paper, Turing preferred Turing machines that never terminated their calculations. Today, Turing machines are often associated with the “halting problem”, which is the impossible task of proving whether or not an arbitrary program will eventually terminate. However, Turing used his machines to calculate numbers to ever-increasing precisions. In fact, he referred to his machines as “unsatisfactory” if they ever stopped, and “satisfactory” if they kept on going forever. Although Turing’s paper touches on concepts similar to the halting problem, this would not be posed as we know it today until the 1950s.

The procedure to calculate a computable number does not need to be in code form (though any of the below approaches may be programmed, if needed). For example, $\pi$ is also a computable number, and to show this we only need to express it as an infinite sum, like this one:

$$\pi = \frac{4}{1} – \frac{4}{3} + \frac{4}{5} – \frac{4}{7} + \frac{4}{9} – \ldots$$

This is not as “computery” as the above code, but it is algorithmic all the same. One may continue adding further terms until the approximation is as accurate as needed. There are other well-known infinite series to calculate other irrational numbers like $e$ and $\phi$. So long as there exists a mechanical way to approximate a number to ever-increasing precision, within finite time, the number is computable.

Writing an efficient Python program to calculate $\pi$ is a little more difficult. The program shown in this figure uses a “streaming” method developed by J. Gibbons and implemented by D. Bau. It is not as easily comprehensible as the above infinite series, but is more in the spirit of Turing’s paper. It is able to continue printing further digits of $\pi,$ only needing to maintain six variables between each print statement (rather than reviewing all the digits it has already output), making for a very efficient calculation.

Some other numbers such as $\sqrt{2}$ and $\log~2$ also have infinite sums, products, or fractions that can be used to calculate them. However, it’s often more efficient to compute them via successive approximations. For example, let us approximate $\sqrt{2}$ with a number $r$. We know that $r$ is a positive number such that $r^2 = 2$. Since $1 < r^2 < 4$, that means that $1 < r < 2$. We can then successively approximate $r$ as $1.5$, $1.25$, $1.375$, $1.4375$, $1.40625$, and so on, each time adding or subtracting the next smaller power of two, depending on whether the current approximation’s square is greater than or less than 2. We stop when the power added or subtracted is less than half the desired maximum error.

There is one flaw with calculating irrational numbers like these on a modern computer: any computer has only a finite amount of memory, and eventually that memory will be exhausted. This is one key difference between Turing machines and real computers: Turing assumed that his machines had inexhaustible memory. Thus, it is fair to do likewise when proving that the set of computable numbers is countable. However, one could also slightly redefine a computable number to be one that can be calculated to within an arbitrary precision in a finite amount of time *and with a finite amount of resources*. This is equivalent to Turing’s original definition, since if a calculation required unbounded memory, it would need unbounded time to access that memory.

In general, any number with a well-defined value, or well-defined method to calculate its value, may be said to be an element of $\mathbb{K}$. If you can write a program to calculate a number (to within some arbitrary precision, in a finite amount of time), it’s computable. This includes many well-known non-algebraic real numbers. In fact, it’s very difficult to come up with a number that’s *not* computable.

Now that we’ve defined our set $\mathbb{K}$, we need to show that it is countable. The proof relies on the concept of a Gödel number. Gödel numbers were developed by Kurt Gödel himself, in order to prove his “incompleteness theorems.” (If you are new to German, Gödel’s name is pronounced similarly to the English word “girdle”, but without the “r” sound.)

**Definition: Gödel Number**

A Gödel number is a natural number that is a unique representation of a particular member of a set \(S\). The Gödel number is calculated with a function \(\mathrm{g}\) whose domain is \(S\) and whose codomain is \(\mathbb{N}\).

Note that $\mathrm{g}$ is an injective function, so that if $\mathrm{g}(s) = \mathrm{g}(s’)$, it must be the case that $s = s’$. However some integers may not be valid Gödel numbers for members of $S$, so the function $\mathrm{g}$ is frequently not a bijection between $S$ and $\mathbb{N}$.

If a set \(S\) is infinite, and there exists a function \(\mathrm{g}\) to calculate a valid Gödel number for every member \(s\), then \(S\) must be countably infinite. To prove this, we must show that \(S\) has a one-to-one correspondence with \(\mathbb{N}\). Since the Gödel number of each set element \(s\) is unique, one of them must have the least value. Let this element of \(S\) correspond to \(0\). Then, the one with the second-least Gödel number will correspond to \(1\), the next one to \(2\), and so on. Since both sets are infinite, the correspondence can continue infinitely. Hence there exists a bijection between \(S\) and \(\mathbb{N}\), and therefore \(S\) is countable.

It remains to show that a Gödel number function $\mathrm{g}$ exists for $\mathbb{K}$. The trick is to calculate a Gödel number from the program that generates a computable number, rather than from the number itself. Recall that a program consists of a finite block of text. Most computers use the ASCII code or Unicode to represent a text file, in which every valid character is a encoded as a number between 1 and 127. (A 0 would represent a null character, which isn’t used in text files.) Thus, the source code itself may be represented as a base-128 number with no 0s, and as many digits as there are characters in the file. We will simply translate this base-128 number into a common base-10 one. For example, in the case of the tiny program above that prints one third, the computation is:

$$112\cdot128^0 + 114\cdot128^1 + 105\cdot128^2 + \ldots + 10\cdot128^{50}$$

Here, the $112$, $114$, and $105$ represent the characters `p`

, `r`

, and `i`

respectively: the first three characters in the program. The fifty-first character at the end is a newline, which is represented by $10$. The resulting sum is the number:

23, 674, 419, 604, 088, 055, 738, 162, 829, 936, 727, 249, 274, 729, 959, 756, 590, 023, 485, 007, 031, 946, 824, 166, 916, 359, 320, 099, 837, 891, 922, 699, 574, 436, 329, 840.

The other program, to calculate $\pi$, results in a Gödel number that is approximately $1.77\times10^{654}$. These numbers are unwieldy, but remember that they encode entire programs. Because one can recover the complete program from its number, it must be the case that the function $\mathrm{g}$ is injective.

This is similar to Turing’s approach: he created Turing machines to compute numbers, and then represented each machine as a large table of states. This let him systematically create a unique “description number” for each possible machine. (The term “Gödel number” was not yet in common use.) The number was a large positive integer that described every aspect of the machine, and no two different machines could ever have the same number. Thus, as in our case, Turing showed that there exists a valid Gödel function $\mathrm{g}$, and used this to show that $\mathbb{K}$ is countable.

It may be argued that any given computable number has an infinite number of programs that can calculate it, and hence infinite Gödel numbers associated with it. We need the function $\mathrm{g}$ to map each computable number to one Gödel number. Therefore, we choose the number with the least magnitude, which corresponds to the shortest program that produces the computable number in question. A critic may argue that finding this shortest program is not always possible, and so the Gödel function $\mathrm{g}$ cannot be executed. However, for our purposes it does not matter if the function is practical. All that matters, to prove that the set of computable numbers is countable, is that each one can be shown to have an associated Gödel number.

Here’s the final proof, bringing this all together. The computable numbers are an infinite set. We have provided an injective function \(\mathrm{g}\) that maps every computable number to a single natural number: a Gödel number. Any set with such a function is countable, and therefore computable numbers are countable.

Some readers may wonder why we cannot use the same diaganolization trick that Cantor used, and so prove that the computable numbers are uncountable. Let’s try to do it, and we will see where it fails.

Let us say that we have an allegedly complete (though infinite) enumeration of every computable number. We will now attempt to use diagonalization in order to produce a new computable number $x$, that could not possibly be in our list. First we calculate the tenths place digit of the first computable number, and assign $x$’s tenths place digit to be something else. Then we calculate the hundredths place digit of the second computable number, and assign $x$’s corresponding digit to be another one. We proceed as before, until we have computed a number that could not possibly be in the enumeration. Thus, it appears that we have a contradiction, and so $\mathbb{K}$ appears uncountable.

The problem with this “proof” is that we never showed that the number $x$ is itself computable. When Cantor used diagonalization to prove $\mathbb{R}$ uncountable, the resulting $x$ was obviously a real number, even though it’s not in the “complete” list. Hence he had a contradiction. But proving that a number is computable is more difficult, because we would need to show that the computation to some arbitrary accuracy can be done in finite time. Since we haven’t done this (and in fact we cannot do it), the alleged proof falls apart.

It might seem straightforward at first to prove that $x$ is computable, since the diagonalization technique appears to give an algorithm to calculate it. However the algorithm is flawed, and will enter an infinite loop from which it cannot recover. Remember that it calculates $x$ by using each computable number in turn, which would include $x$ itself. Let us say that $x$ happens to be the $i$th number in the list. Then, in order to calculate $x$ to the $i$th digit, one must first calculate $x$ to the $i$th digit. The algorithm enters an infinite regress from which it cannot recover. Thus the computation fails: $x$ is not computable, there is no contradiction, and diagonalization technique did not work.

What does it mean that the set of computable numbers is countable, but the set of real numbers is not? In a very true sense, the set $\mathbb{K}$ contains all the numbers that are *knowable*. Real numbers that aren’t computable can’t be calculated, or even thought about except in the most abstract manner (such as $x$ from the last section, the uncalculable result of using the diagonalization process on $\mathbb{K}$). Yet, since $\mathbb{R}$ is uncountable, the vast majority of real numbers must be unknowable. They are the infinite streams of digits, with absolutely no rhyme or reason to their sequences. Any real number used in day-to-day life is part of the computable set $\mathbb{K}$.

Others have gone on to describe other incomputable numbers. For example, in 1949 the Swiss mathematician Ernst Specker found a monotonically increasing, bounded, computable sequence of rational numbers whose least upper bound is not computable. Computability theory originated with Turing’s paper as a branch of mathematics in its own right. It attempts to classify noncomputable functions into a hierarchy, based on their degrees of noncomputability. It is of interest to both mathematicians and computer scientists.

Fortunately for Alan Turing, his Turing machines were thought to be innovative enought that his paper found publication, despite not being the first to prove that the Entscheidungsproblem is not solvable. In fact, his paper served as his introduction to Alonzo Church, the very man who had scooped him. Church would go on to supervise Turing’s doctoral work at Princeton, from which he graduated in 1938 at the age of 26! The two would go on to lay much of the groundwork for modern computer science, including the Church-Turing thesis (which roughly states that human computation, machine computation, and Church’s lambda calculus are all equivalent).

Turing’s paper doesn’t mention complex numbers, but an extension of his work is straightforward. A complex computable number would be a complex number whose real portion is a computable number, and whose imaginary portion is another computable number times $i$ $(\sqrt{-1})$. Such a number is computable to within an arbitrary precision in a finite amount of time, so long as the program that calculates it alternates regularly between calculating the real portion and the imaginary portion (perhaps outputting the two numbers via separate streams or into separate files). Since such a complex number can be calculated using a program, the same proof of countability applies.

Charles Petzold’s “The Annotated Turing” is an excellent source for those who wish to know more about Turing and his proof that the Entscheidungsproblem is impossible. It reproduces Turing’s entire paper one small piece at a time, interspersed with copious explanatory detail and context. If you wish to know more about Turing’s seminal work, this is without a doubt the best place to start. Those more interested in Turing’s personal life should read Andrew Hodges’ “Alan Turing: the Enigma”. This covers his whole life in great detail, from his early childhood, through his codebreaking efforts during the war years, and ending with his tragic persecution and death just shy of his 42nd birthday.

Turing’s idea of computable numbers form a logical and intuitive extension to Cantor’s original work on countability. His work is a unique bridge between mathematics and computer science, that is still relevant today.

Thank you to David Bau for permission to use his implementation of the $\pi$ spigot algorithm. Thank you also to Ursula Whitcher for encouragement and assistance with one of the proofs.

- D. Bau. A dabbler’s weblog: Python pi.py spigot. http://davidbau.com/archives/2010/03/14/python_pipy_spigot.html, March 2010.
- G. Cantor. Über eine Eigenschaft des Inbegriffs aller reellen algebraischen Zahlen (On a property of the collection of all real algebraic numbers). Journal für die reine und angewandte Mathematik, 77:258–262, 1874.
- G. Cantor. Über eine elementare Frage der Mannigfaltigkeitslehre (On an elementary question concerning the theory of manifolds). Jahresbericht der Deutschen Mathematiker-Vereinigung, 1, 1892.
- A. Church. A note on the Entscheidungsproblem. The Journal of Symbolic Logic, 1(01):40–41, 1936.
- J. Gibbons. Unbounded spigot algorithms for the digits of pi. The American Mathematical Monthly, 113(4):pp. 318–328, 2006.
- K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I (On formally undecidable propositions of Principia Mathematica and related systems I). Monatshefte für Mathematik und Physik, 38(1):173–198, 1931.
- A. Hodges. Alan Turing: the Enigma. Burnett Books, London, 1983.
- C. Petzold. The Annotated Turing: A Guided Tour Through Alan Turing’s Historic Paper on Computability and the Turing Machine. Wiley Publishing, 2008.
- E. Specker. Nicht konstruktiv beweisbare Sätze der Analysis (Theorems of analysis that cannot be proven constructively). The Journal of Symbolic Logic, 14:145–158, 1949.
- A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, 42, 1936.
- A. M. Turing. Systems of logic based on ordinals. PhD thesis, Princeton University, 1938.

*Some people view mathematics as a purely platonic realm of ideas independent of the humans who dream about those ideas. If that’s true, why can’t we agree on the definition of something as universal as a prime number?*

**Courtney R. Gibbons**

**Hamilton College**

*Scene:* It’s a dark and stormy night at SETI. You’re sitting alone, listening to static on the headphones, when all of the sudden you hear something: two distinct pulses in the static. Now three. Now five. Then seven, eleven, thirteen — it’s the sequence of prime numbers! A sequence unlikely to be generated by any astrophysical phenomenon (at least, so says Carl Sagan in *Contact*, the novel from which I’ve lifted this scene) — in short, proof of alien intelligence via the most fundamental mathematical objects in the universe…

Hi! I’m Courtney, and I’m new to this column. I’ve been enjoying reading my counterparts’ posts, including Joe Malkevitch’s column Decomposition and David Austin’s column Meet Me Up in Space. I’d like to riff on those columns a bit, both to get to some fun algebra (atoms and ideals!) and to poke at the idea that math is independent of our humanity.

Scene: It’s a dark and stormy afternoon in Clinton, NY. I’m sitting alone at my desk with two undergraduate abstract algebra books in front of me, both propped open to their definitions of a prime number…

- Book A says that an integer $p$ (of absolute value at least 2) is
*prime*provided it has exactly two positive integer factors. Otherwise, Book A says $p$ is*composite*. - Book B says that an integer $p$ (of absolute value at least 2) is
*prime*provided whenever it divides a product of integers, it divides one of the factors (in any possible factorization). Otherwise, Book B says $p$ is*composite*.

* Note: *Book A is Bob Redfield’s

I reached for the nearest algebra textbook to use as a tie-breaker, which happened to be Dummit and Foote’s *Abstract Algebra*, only to find that the authors hedge their bets by providing Book A’s definition and then saying, well, actually, Book B’s definition can be used to define prime, actually. Yes, it’s a nice exercise to show these definitions are equivalent. I can’t help but wonder, though: which is what it really * is* to be prime, and which is merely a consequence of that definition?

Some folks take the view that math is a true and beautiful thing and we humans merely discover it. This seems to me to be a way of saying that math is independent of our humanity. Who we are, what communities we belong to — these don’t have any effect on Mathematics, Platonic Realm of Pure Ideas. To quantify this position as one might for an intro to proofs class: For each mathematical idea $x$, $x$ has a truth independent of humanity. And yet, two textbooks fundamental to the undergraduate math curriculum are sitting here on my desk with the audacity to disagree about the very definition of arguably the **most** pure, **most** platonic, **most** absolutely mathematical phenomenon you could hope to encounter: prime numbers!

This isn’t a perfect counterexample to the universally quantified statement above (maybe one of these books is wrong?). But in my informal survey of undergraduate algebra textbooks (the librarians at Hamilton really love me and the havoc I wreak in the stacks!), there’s not exactly a consensus on the definition of a prime!

As far as I can tell, the only consensus is that we shouldn’t consider $-1$, $0$, or $1$ to be prime numbers.

But, uh, why not?! In the case of $0$, it breaks both definitions. You can’t divide by zero (footnote: well, you shouldn’t divide by if you want numbers to be meaningful, which is, of course, a decision that someone made and that we continue to make when we assert “you can’t divide by zero”), and zero has infinitely many positive integer factors. But when $\pm 1$ divides a product, it divides one (all!) of the factors. And what’s so special about exactly two positive divisors anyway? Why not “at most two” positive divisors?

Well, if you’re reading this, you probably have had a course in algebra, and so you know (or can be easily persuaded, I hope!) that the integers have a natural (what’s natural is a matter of opinion, of course) algebraic analog in a ring of polynomials in a single variable with coefficients from a field $F$. The resemblance is so strong, algebraically, that we call $F[x]$ an integral domain (“a place where things are like integers” is my personal translation). The idea of prime, or “un-break-down-able”, comes back in the realm of polynomials, and Book A and Book B provide definitions as follow:

- Book B says that a nonconstant polynomial $p(x)$ is
*irreducible*provided the only way it factors is into a product in which one of the factors must have degree 0 (and the other necessarily has the same degree as $p(x)$). Otherwise, Book B says $p(x)$ is*reducible*. - Book A says that a nonconstant polynomial $p(x)$ is
*irreducible*provided whenever $p(x)$ divides a product of polynomials in $F[x]$, it divides one of the factors. Otherwise, Book A says $p(x)$ is*reducible*.

Both books agree, however, that a polynomial is reducible if and only if it has a factorization that includes more than one irreducible factor (and thus a polynomial cannot be both reducible and irreducible).

Notice here that we have a similar restriction: the zero polynomial is excluded from the reducible/irreducible conversation, just as the integer 0 was excluded from the prime/composite conversation. But what about the other constant polynomials? They satisfy both definitions aside from the seemingly artificial caveat that they’re not allowed to be irreducible!

Well, folks, it turns out that in the integers and in $F[x]$, if you’re hoping to have meaningful theorems (like the Fundamental Theorem of Arithmetic or an analog for polynomials, both of which say that factorization into primes/irreducibles is unique up to a mild condition), you don’t want to allow things with multiplicative inverses to be among your un-break-down-ables! We call elements with multiplicative inverses *units*, and in the integers, $(-1)\cdot(-1) = 1$ and $1\cdot 1 = 1$, so both $-1$ and $1$ are units (they’re the only units in the integers).

In the integers, we want $6$ to factor uniquely into $2\cdot 3$, or, perhaps (if we’re being generous and allowing negative numbers to be prime, too) into $(-2)\cdot(-3)$. This generosity is pretty mild: $2$ and $-2$ are *associates*, meaning that they are the same up to multiplication by a unit. One statement of the Fundamental Theorem of Arithmetic is that every integer (of absolute value at least two) is prime or factors uniquely into a product of primes *up to the order of the factors and up to associates*. That means that the list prime factors (up to associates) that appear in the factorization of $6$ is an invariant of $6$, and the number of prime factors (allowing for repetition) in any factorization of $6$ is another invariant (and it’s well-defined). Let’s call it the *length* of $6$.

But if we were to let $1$ or $-1$ be prime? Goodbye, fundamental theorem! We could write $6 = 2\cdot 3$, or $6 = 1\cdot 1\cdot 1 \cdots 1 \cdot 2 \cdot 3$, or $6 = (-1)\cdot (-2) \cdot 3$. We have cursed ourselves with the bounty of infinitely many distinct possible factorizations of $6$ into a product primes (even accounting for the order of the factors or associates), and we can’t even agree on the length of $6$. Or $2$. Or $1$.

The skeptical, critical-thinking reader has already been working on workarounds. Take the minimum number of factors as the length. Write down the list of prime factors without their powers. Keep the associates in the list (or throw them out, but at that point, just agree that $1$ and $-1$ shouldn’t be prime!). But in the polynomial ring $F[x]$, dear reader, **every** nonzero constant polynomial is a unit: given $p(x) = a$ for some nonzero $a \in F$, the polynomial $d(x) = a^{(-1)}$ is also in $F[x]$ since $a^{(-1)}$ is in the field $F$, and $p(x)d(x) = 1$, the multiplicative identity in $F[x]$. So, if you allow units to be irreducible in $F[x]$, now even an innocent (and formerly irreducible) polynomial like $x$ has infinitely many factorizations into things like ($a)(1/a)(b)(1/b)\cdots x$. So much for those workarounds!

So, since we like our Fundamental Theorems to be neat, tidy, and useful, we agree to exclude units from our definitions of prime and composite (or irreducible and reducible, or indecomposable and decomposable, or…).

Lately I’ve been working on problems related to semigroups, by which I mean nonempty sets equipped with an associative binary operation — and I also insist that my semigroups be commutative and have a unit element. In the study of factorization in semigroups, the Fundamental Theorem of Arithmetic leads to the idea of counting the distinct factors an element can have in any factorization into *atoms* (the semigroup equivalent of irreducible/prime elements; these are elements $p$ that factor only into products involving units and associates of $p$).

One of my favorite (multiplicative) semigroups is $\mathbb{Z}[\sqrt{-5}] = {a + b \sqrt{-5} \, : \, a,b \in \mathbb{Z}}$, favored because the element $6$ factors distinctly into two different products of irreducibles! In this semigroup, $6 = 2\cdot 3$ and $6 = (1+\sqrt{-5})(1-\sqrt{-5})$. It’s a nice exercise to show that $1\pm \sqrt{-5}$ are not associates of $2$ or $3$, yielding two distinct factorizations into atoms!

While we aren’t lucky enough to have unique factorization, at least we have that the number of irreducible factors in any factorization of $6$ is always two. That is, excluding units from our list of atoms leads to an invariant of $6$ in the semigroup $\mathbb{Z}[\sqrt{-5}]$.

Anyway, without the context of more general situations like this semigroup (and I don’t know, is $\mathbb{Z}[\sqrt{-5}]$ one of those platonically true things, or were Gauss et al. just really imaginative weirdos?), would we feel so strongly that $1$ is not a prime integer?

Reminding ourselves yet again that the integers form a ring under addition and multiplication, we might be interested in the ideals generated by prime numbers. (What’s an ideal? It’s a nonempty subset of the ring closed under addition, additive inverses, and scalar multiplication from the ring.) We might even call those ideals prime ideals, and then generalize to other rings! The thing is, if we do that, we end up with this definition:

(Book A and B agree here:) An ideal $P$ is prime provided $xy \in P$ implies $x$ or $y$ belongs to $P$.

But in the case of the integers — a principal ideal domain! — that means that a product $ab$ belongs to the principal ideal generated by the prime $p$ precisely when $p$ divides one of the factors.

From the perspective of rings, every (nonzero) ring has two trivial ideals: the ring $R$ itself (and if $R$ has unity, then that ideal is generated by $1$, or any other unit in $R$) and the zero ideal (generated by $0$). If we want the study of prime ideals to be the study of interesting ideals, then we want to exclude units from our list of potential primes. And once we do, we recover nice results like an ideal $P$ is prime in a commutative ring with unity if and only if $R/P$ is an integral domain.

I still have two books propped open on my desk, and after thinking about semigroups and ideals, I’m no closer to answering the question “But what is a prime, really?” than I was at the start of this column! All I have is some pretty good evidence that we, as mathematicians, might find it useful to exclude units from the prime-or-composite dichotomy (I haven’t consulted with the mathematicians on other planets, though). To me, that evidence is a reminder that we are constantly updating our mathematics framework in reference to what we learn as we do more math. We look back at these ideas that seemed so solid when we started — something fundamentally indivisible in some way — and realize that we’re making it up as we go along. (And ignoring a lot of what other humans consider math, too, as we insist on our axioms and law of the excluded middle and the rest of the apparatus of “modern mathematics” while we’re making it up…)

And the math that gets done, the math that allows us to update our framework… Well, that depends on *what* is trendy/fundable/publishable, *who* is trendy/fundable/publishable, and *who* is making all of those decisions. Perhaps, on planet Blarglesnort, math looks very different.

Anderson, Marlow; Feil, Todd. *A first course in abstract algebra. Rings, groups, and fields.* Third edition. ISBN: 9781482245523.

Dummit, David S.; Foote, Richard M. *Abstract algebra.* Third edition. ISBN: 0471433349.

Redfield, Robert. Abstract algebra. A concrete introduction. First edition. ISBN: 9780201437218.

Geroldinger, Alfred; Halter-Koch, Franz. *Non-unique factorizations. Algebraic, combinatorial and analytic theory.* ISBN: 9781584885764.