Decoding, Gerrymanders, and Markov Chains

David Austin
Grand Valley State University

In 2009, crypto miner James Howells of southern Wales mistakenly threw away a hard drive containing 8,000 bitcoins. That’s over $100 million even in today’s sinking crypto landscape. Thirteen years later, Howells has a plan, backed by venture capital, to search the 110,000 tons of garbage in the local landfill using robotic dogs and artificial intelligence. This is a fairly common problem. Not searching landfills for cryptocurrency, of course, but searching for something in a vast sea of possibilities. That’s what we’ll be doing in this month’s column. Code breaking Let’s begin by breaking a code. More specifically, suppose we have a message encoded with a substitution cipher where each character in the message is substituted with another. To keep things simple, we’ll just imagine that the letters of the alphabet are permuted so that whenever we see an “a”, say, we write “m” instead. Here’s our message:  ziatwxafbssuladwafb uanubeablwdfunaywwmabywxdakfnzgdwsfunan wyzlatwxarbtanururyunadfbdafuawlkuafbeabagoblawnadfuagoblaf beakfnzgdwsfunanwyzlazaewldamlwoaofzkfableadfbdafuaxgueadwa kbjjadfzgagoblaswwfaadfbdaobgabajwlhadzruabhwableaofulaouag bzeahwweytuaouadwwmadfualbruaozdfaxgabgaouaezeldadfzlmadfua goblaowxjeaobldazdabltarwnuaoujjaofulaueobneayubnagbzeadfbd afuaowxjeajzmuablauqkzdzlhalbruabjjadwafzrgujiakfnzgdwsfuna nwyzlagbzeabdawlkuaozdfwxdagdwsszlhadwadfzlmadfbdafuaobgaoz llzudfuswwfableafuaobgagwabgazafb uauqsjbzlueadfuaswwfasbnd azaozjjalwoauqsjbzladfuanugdawiazd  Our task is to find the permutation that decodes the message. Of course, we could generate every permutation and look for one that generates something readable just as Mr. Howells could turn over every piece of rubbish in the landfill. However, since there are 27 characters, the letters in the alphabet along with a space, there are $$27! = 10,888,869,450,418,352,160,768,000,000 \approx 10^{19}$$ permutations. That’s more than the number of seconds in the lifetime of the universe. We clearly need a better strategy to search through the permutations. We know that letters in an English text occur at different frequencies. For instance, in Project Gutenberg’s translation of War and Peace, about 18% of the very large number of characters are spaces, 10% are “e”, 7% are “t” and so forth. In our encoded message, about 20% of the characters are “a” so it seems reasonable to suppose that “a” represents a space, about 8% are “u”, which probably represents an “e”, and so forth. This generates a permutation, which when applied to our message gives:  hk pow ntccei ao ntxe detl tioaned mooy tmowa gndhraocned d omhi pow ftp defefmed anta ne oige ntl t rsti od ane rsti n tl gndhraocned domhi h loia yios snhgn til anta ne wrel ao gtuu anhr rsti coon anta str t uoib ahfe tbo til snei se r thl boolmpe se aooy ane itfe shan wr tr se lhlia anhiy ane rsti sowul stia ha tip fode seuu snei elstdl metd rthl anta ne sowul uhye ti evghahib itfe tuu ao nhfreuk gndhraocned d omhi rthl ta oige shanowa raocchib ao anhiy anta ne str shi iheanecoon til ne str ro tr h ntxe evcuthiel ane coon ctda h shuu ios evcuthi ane dera ok ha  Hmmm. Rather than using just the frequency of single characters, perhaps we could look at the frequency of bigrams, pairs of consecutive characters. To that end, let$T(x,y)$be the probability that$x$is followed by$y$in English text. For instance, we expect$T(\text{q}, \text{u})$to be relatively large while$T(\text{q}, \text{b})$should be relatively small, if not zero. Suppose that our encoded message consists of the sequence of characters$c_i$and that$\pi$is the permutation of the symbols that decodes the message. If$c_ic_{i+1}$is a bigram in the message, then$\pi(c_i)\pi(c_{i+1})$is a bigram in a piece of English text, which means that$T(\pi(c_i), \pi(c_{i+1}))$should be relatively high. This allows us to measure how likely a permutation is as a decoding key by defining the plausibility of a permutation to be $$P(\pi) = \prod_{i} T(\pi(c_i),\pi(c_{i+1})).$$ Permutations with higher plausibilities are more likely to decode the message. Now our task is to search through the$27!$permutations looking for ones that are highly plausible. The following algorithm will generate a sequence of highly plausible permutations$\pi_n$. 1. Begin with a random permutation$\pi_0$. 2. Given the permutation$\pi_n$, form a new permutation$\pi’ = \sigma \pi_n$where$\sigma$is a random transposition that interchanges two symbols. • If$P(\pi’)\geq P(\pi_n)$, then$\pi’$is more plausible than$\pi_n$, and we accept$\pi_{n+1} = \pi’.$• Otherwise,$P(\pi’)/P(\pi_n) \lt 1$. Choose a number$\alpha$in the interval$[0,1]$uniformly at random. If$\alpha \lt P(\pi’)/P(\pi_n)$, then$\pi_{n+1} = \pi’$. If not, we stay where we are:$\pi_{n+1} = \pi_n$. • Continually repeat this step. When I ran this algorithm, the permutation$\pi_{1000}$applied to the encoded message produced:  ik mof tappen so tage wead anostew voou avofs ctwirsoptew w ovin mof bam webebvew stas te once tad a rhan ow ste rhan t ad ctwirsoptew wovin i dons unoh htict and stas te fred so call stir rhan poot stas har a lony sibe ayo and hten he r aid yoodvme he soou ste nabe hist fr ar he didns stinu ste rhan hofld hans is anm bowe hell hten edhawd veaw raid stas te hofld liue an excisiny nabe all so tibrelk ctwirsoptew w ovin raid as once histofs rsoppiny so stinu stas te har hin niestepoot and te har ro ar i tage explained ste poot paws i hill noh explain ste wers ok is  and$\pi_{2500}$gave:  if you happen to have read another book about christopher r obin you may remember that he once had a swan or the swan h ad christopher robin i dont know which and that he used to call this swan pooh that was a long time ago and when we s aid goodbye we took the name with us as we didnt think the swan would want it any more well when edward bear said that he would like an exciting name all to himself christopher r obin said at once without stopping to think that he was win niethepooh and he was so as i have explained the pooh part i will now explain the rest of it  These results look pretty reasonable, and the whole thing, including scanning War and Peace, took about five seconds on my laptop. It’s remarkable that we can find the decoding permutation, out of the set of$10^{19}$permutations, with so little effort. This is an example of what Persi Diaconis has called the Markov Chain Monte Carlo revolution. In his 2007 survey article, Diaconis describes how a psychologist came to the Stanford statistical consulting service with a collection of coded messages exchanged between incarcerated people in the California prison system. Guessing the message was encoded using a substitution cipher, Stanford student Marc Coram applied this algorithm to decode it. There are a few things to be said before we investigate more deeply. First, the algorithm only worked about half of the time; I’ve just chosen to present here one of the times that it did work. Other times, it seemed to get stuck with a permutation that produced gibberish. The odds of success went up when the initial permutation was chosen to be the one that matched symbols based on the frequency with which they appeared in the message rather than choosing a permutation at random. Even when the message was successfully decoded, continuing to run the algorithm for more iterations frequently gave something like:  if you happen to have read another mook amout christopher romin  Let’s look more carefully at the algorithm to see what’s going on. At each step, we generate a new permutation. If that permutation is more plausible, we accept it. However, there is a still a chance that a less plausible permutation will be accepted. So while the algorithm generally tends to increase the plausibility of the permutations it produces, there’s a way for it to escape a local maximum. There is an alternative point of view that’s helpful and that we’ll develop more fully. Rather than searching for the most plausible permutation, the algorithm is actually sampling from the permutations in such a way that more plausible permutations are more likely to appear. More specifically, the probability of obtaining a permutation$\pi$is proportional to its plausibility so that the sampling distribution is $$s(\pi) = \frac{P(\pi)}{Z}$$ where$Z$is the normalizing constant $$Z = \sum_\pi P(\pi).$$ This alternative point of view will become more clear after we make a digression through the world of Markov chains. Markov chains Let’s begin with a network consisting of nodes joined by edges and suppose we traverse the network at random by moving along its edges. In particular, if we are at a node labeled$x$, the probability that we move along an edge to node$y$will be denoted$K(x,y)$. If there are$m$nodes in our network,$K$is an$m\times m$matrix we refer to as the transition matrix. For example, the nodes in the network could be the 27 characters in a piece of English text with$K(x,y)$being the probability that character$x$is followed by character$y$. Since we’ll move from$x$to some other node, we have$\sum_y K(x,y) = 1$so that the sum across any row of$K$is 1. We say that$K$is row stochastic and note that a row$K(x,-)$represents the distribution describing our location one step after beginning at$x$. Notice that$K(x,y)K(y,z)$is the probability that we begin at$x$, pass through$y$, and then move to$z$. Therefore, $$K^2(x,z) = \sum_y K(x,y)K(y,z)$$ represents the probability that we begin at$x$and end up at$z$after two steps. Likewise,$K^n(x,z)$represents the probability that we begin at$x$and end up at$z$after$n$steps. As we randomly walk around the network, suppose that the distribution of locations at step$n$is given by the row vector$s_n$; that is,$s_n(x)$gives the probability we are at node$x$at that step . The product$\sum_x s_n(x)K(x,y)$describes the probability that we are node$y$after the next step. In other words, the product$s_nK$provides the new distribution,$s_{n+1} = s_nK$. As a simple example, consider the network with transition matrix $K=\begin{bmatrix} 5/6 & 1/6 \\ 1/2 & 1/2 \\ \end{bmatrix}.$This means that if we’re at node$x$, there is a$5/6$chance we stay at$x$and a$1/6$chance we move to$y$. On the other hand, if we are at$y$, we move to$x$or stay at$y$with equal probability. If we begin at$x$, the initial distribution is described by the row vector$s_0 = \begin{bmatrix}1 & 0\end{bmatrix}$. After one step, the distribution is$s_1=s_0K = \begin{bmatrix}5/6 & 1/6 \end{bmatrix}$. Here’s how the first few steps proceed: $$\begin{array}{c|c} {\bf n} & {\bf s_n} \\ \hline 0 & \begin{bmatrix}1.000 & 0.000 \end{bmatrix} \\ 1 & \begin{bmatrix}0.833 & 0.167 \end{bmatrix} \\ 2 & \begin{bmatrix}0.778 & 0.222 \end{bmatrix} \\ 3 & \begin{bmatrix}0.759 & 0.241 \end{bmatrix} \\ 4 & \begin{bmatrix}0.753 & 0.247 \end{bmatrix} \\ \end{array}$$ If we continue, something remarkable happens. The distributions$s_n$converge to$s_n\to\overline{s} = \begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$. This means that after a long time, we spend three-quarters of our time at node$x$and one-quarter of our time at node$y$. Since$s_{n+1} = s_nK$, this says that $$\overline{s} = \overline{s}K$$ and we call$\overline{s}$a stationary distribution. In fact, the powers of$K$converge $$K^n \to \begin{bmatrix} 0.75 & 0.25 \\ 0.75 & 0.25 \\ \end{bmatrix}$$ to a rank one matrix so that no matter the initial state$s_0$, the distributions$s_n = K^ns_0$converge to the same stationary distribution$\overline{s}=\begin{bmatrix} 0.75 & 0.25 \end{bmatrix}$. This is an example of the beautiful Perron-Frobenius theorem: under a mild assumption on the row stochastic matrix$K$, any initial distriubtion$s_0$will converge to a unique stationary distribution$\overline{s}$. In fact, this theorem is the basis for Google’s PageRank algorithm, as explained in an earlier Feature Column. The sequence of distributions$s_n$is called a Markov chain so we say that any Markov chain converges to the unique stationary distribution. But what does this have to do with the decoding problem we began with? Let’s recast the sequence of permutations$\pi_n$that we created in the language of Markov chains. The Metropolis-Hastings algorithm Let’s view the set of permutations$S_n$as a network where each permutation represents a node and an edge connects two nodes when the permutations differ by a transposition. We want to sample permutations from$S_n$in such a way that we are more likely to sample highly plausible permutations. With this in mind, we view the plausibility as defining a distribution on$S_n$. That is, the probability of choosing a given permutation is$s(\pi) = P(\pi)/Z$where$Z$is a normalizing constant $$Z = \sum_{\pi} P(\pi).$$ Of course,$S_n$is so large that evaluating$Z$is not feasible, but we’ll see that this presents no problem. If we can find a transition matrix$K$whose unique stationary distribution is$\overline{s} = s$, then a Markov chain will produce a sequence of permutations that is a sample from the distribution$s(\pi)$. In the discussion of the Perron-Frobenius theorem, we began with the transition matrix$K$and found the resulting stationary vector. Now we’d like to invert this process: if we are given a distribution$s$, can we find a transition matrix$K$whose stationary distribution$\overline{s} = s$? The Metropolis-Hastings algorithm tells us how to create the transition matrix$K$. We begin with any row stochastic transition matrix$J(x,y)$. Given$x$and$y$, we define the acceptance ratio $$A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)}.$$ This enables us to define $$K(x,y) = \begin{cases} J(x,y) & \text{ if } x\neq y, A(x,y) \geq 1 \\ J(x,y)A(x,y) & \text{ if } x\neq y, A(x,y) \lt 1 \\ \end{cases}.$$ The diagonal term$K(x,x)$is chosen to enforce the condition that$\sum_y K(x,y) = 1$so that$K$is row stochastic. What is the stationary distribution associated to$K$? Well, notice that$A(x,y) = 1/A(x,y)$. If the acceptance ratio$A(x,y) \lt 1$, then$A(y,x) \gt 1$so that $$s(x)K(x,y) = s(x)J(x,y)A(x,y) = s(y)J(y,x) = s(y)K(y,x).$$ Therefore $$\sum_x s(x)K(x,y) = \sum_x s(y)K(y,x) = s(y)\sum_xK(y,x) = s(y).$$ In other words,$sK = s$so$s$is the unique stationary distribution for$K$:$\overline{s} = s$. When we apply this to our decoding problem, we choose the initial transition matrix$J(x,y)$so that from the permutation$x$, we uniformly choose any permutation$y$with$y=\sigma x$for a transposition$\sigma$. In this case,$J$is symmetric so that$J(x,y) = J(y,x)$and the acceptance ratio is $$A(x,y) = \frac{s(y)J(y,x)}{s(x)J(x,y)} = \frac{s(y)}{s(x)} = \frac{P(y)/Z}{P(x)/Z} = \frac{P(y)}{P(x)}.$$ Notice that we can find the ratio$s(y)/s(x)$without knowing the normalizing constant$Z$. Now we see that the sequence of permutations$\pi_n$that we created to decode the message is actually a Markov chain given by the transition matrix$K$whose stationary distribution is proportional to the plausibility. Since the Markov chain samples from this distribution, we find that the sequence of permutations favors permutations with a high plausibility, which makes it likely that we encounter the right permutation for decoding the message. Detecting Gerrymanders This method is known as Markov Chain Monte Carlo sampling, and it plays an important role in the mathematical analysis of political redistricting plans. As is well known, American state legislatures redraw maps of districts for political representation every 10 years using data from the latest Census. A typical state is geographically divided into small building blocks, such as Census blocks or tracts, and each building block is assigned into a Congressional district. For instance, Michigan has 14 Congressional districts and 2813 Census tracts. A redistricting plan that assigns a Congressional district to each Census block would be a function,$R:{\cal C}\to\{1,2,3,\ldots,14\}$, where$\cal C$is the set of Census tracts. The number of such functions is$14^{2813}$, whose size we understate by calling it astronomically large. Most of these plans don’t satisfy legal requirements established by the state, but this just points to the fact that there are lots of possible redistricting plans. Here are the requirements that must be satisfied by a Michigan redistricting plan. The populations of the districts should be approximately equal, and each district should be contiguous, meaning that one could walk over land between any two points in a district. Districts are also required to be “compact,” meaning the area of each district is a significant fraction of the area of the smallest circle that encloses the district. When a particular party controls the state legislature, they are naturally motivated to draw districts that ensure their party will gain as many representatives as possible no matter the expressed preference of voters. For instance, they may put many voters from the opposing party into a small number of districts while spreading their own voters around just enough to ensure a majority in a large number of districts. This practice is called gerrymandering. For example, Republicans won 60 of the 99 seats in the Wisconsin State Assembly in the 2012 elections after the Republican-controlled legislature created a new redistricting plan in 2010. That is, Republicans won about 60% of the seats in spite of winning only 50.05% of the votes. How can we assess whether this is the result of a gerrymander or simply due to constraints on the redistricting plans? There are many mathematical approaches to this problem, but a promising recent approach is to generate a large ensemble of redistricting plans and, for each plan, determine how many seats each party would win had that plan been in place. Now the problem starts to sound familiar: we want to generate a representative sample from an extremely large set of possibilities. Markov Chain Monte Carlo sampling does just that! There’s a network that provides a useful mathematical model of a redistricting plan. Each geographic building block is a node, and two nodes are joined by an edge if the corresponding building blocks abut one another. Nodes with the same color in the figure below belong to the same district in a particular redistricting plan. We say that an edge joining two nodes in different districts is a conflicted edge. To facilitate our sampling strategy, we will now build a network of redistricting plans with each plan forming a node. An edge joins two districts if they are obtained by changing the district of one endpoint of a conflicted edge. For instance, here’s a new plan obtained by changing one endpoint of the conflicted edge from yellow to green. These two plans are connected by an edge in our network of redistricting plans. We would like to draw samples from the huge set of redistricting plans by completing a random walk on this network. If we simply follow an edge at random, we have the transition matrix$J(R, R’)$where$R$and$R’$are redistricting plans joined by an edge. If$c(R)$is the number of conflicted edges, we have $$J(R,R’) = \frac{1}{2c(R)}.$$ But now we would like to sample redistricting plans that satisfy the additional requirements of equal population, contiguity, and so forth. For each requirement, there is a measure of how well a redistricting plan satisfies that requirement. For instance,$M_{\text{pop}}(R)$will measure how well the equal population requirement is met. These measures are weighted and combined into a total scoring function $$M(R) = w_{\text{pop}}M_{\text{pop}}(R) + w_{\text{compact}}M_{\text{compact}}(R) + \ldots.$$ The better the redistricting plan$R$satisfies the requirements, the lower the scoring function$M(R)$. Finally, the function$e^{-M(R)}$defines a distribution $$s(R) = \frac{e^{-M(R)}}{Z}$$ where$Z = \sum_R e^{-M(R)}$is a normalizing constant that, as before, needn’t concern us. Sampling from this distribution means we are more likely to obtain redistricting plans that satisfy the requirements. We’re now in a position to apply the Metropolis-Hastings algorithm to obtain the transition matrix$K(R, R’)$whose unique stationary distribution is$s(R)$. Taking a random walk using this transition matrix produces a sample whose redistricting plans are likely to satisfy the legal requirements for a redistricting plan. To study the 2012 Wisconsin State Assembly election, Herschlag, Ravier, and Mattingly drew an ensemble of 19,184 redistricting plans in this way, and, for each plan in the ensemble, determined the number of seats that would have been won by the two parties had that plan been in place. The results are summarized in the following histogram, which shows the frequency of seats won by Republicans. The plan in place resulted in 60 Republican seats, which is shaded red in the histogram. This result appears to be an outlier, which is evidence that the redistricting plan drawn by Republicans in 2010 is the result of an egregious partisan gerrymander. Herschlag, Ravier, and Mattingly summarize this result, combined with others from their analysis, by writing: The Wisconsin redistricting seems to create a firewall which resists Republicans falling below 50 seats. The effect is striking around the mark of 60 seats where the number of Republican seats remains constant, despite the fraction of votes dropping from 51% to 48%. Summary In addition to the Wisconsin study described here, these techniques have been used to assess North Carolina’s Congressional redistricting plan, which saw Republicans capturing 10 of 13 seats in the 2016 election in spite of a 47-53 percent split in votes given to the two major parties. Fewer than one percent of the 24,000 maps generated gave Republicans ten seats. While mathematicians have created other metrics to detect gerrymanders, the approach using Markov Chain Monte Carlo sampling described here, however, offers several advantages. First, it can be adapted to the unique requirements of any state by modifying the scoring function$M(R)\$.

It’s also relatively easy to communicate the results to a non-technical audience, which is important for explaining them in court. Indeed, a group of mathematicians and legal experts filed an amicus brief with the Supreme Court that was cited in Justice Elena Kagan’s dissent in a recent gerrymandering case regarding North Carolina’s redistricting map.

This column has focused on two applications of Markov Chain Monte Carlo sampling. Diaconis’ survey article, cited in the references, describes many more uses in areas such as group theory, chemistry, and theoretical computer science.