Tony’s Take, August 2025

This month’s topics:

The mathematics of gerrymandering.

Gerrymandering, the practice of drawing legislative maps that disproportionately benefit one political party, is in the news these days as Texas’s new Republican-friendly legislative map generates lawsuits and threats from Democratic states to respond in kind. In a guest essay in the August 12, 2025 New York Times, Harvard economist Roland Fryer shares a metric that he thinks can help identify gerrymandered maps.

This metric, called the “Relative Proximity Index” (RPI), was developed by Fryer and Richard Holden back in 2007. It gives a precise mathematical interpretation of compactness, which is one of the National Council of State Legislatures’ two “traditional criteria” for fair legislative maps. (Although gerrymandering is not considered good practice, it is not expressly forbidden. States are only required to draw districts that are as equal in population as possible.) Different states measure compactness differently, as evidenced by these examples from the League of Women Voters.

Current measures of compactness are based on the geometry of the district: Elongated districts or districts with long perimeters are penalized. Fryer and Holden’s index, however, measures compactness based on the locations of voters—specifically, the average physical distance between voters within a district. It then compares this average physical distance to the minimal such distance achievable by any possible districting plan. The RPI is actually negatively correlated with current measures, and so would likely recommend maps that are quite different from those in use right now.

How is this average distance computed? We start by thinking of a state $S$ as a 2-dimensional surface inhabited by individuals numbered $1, 2, \dots, N$. If the state gets to elect $M$ legislators, the equal-population requirement specifies that $S$ should be partitioned into $M$ electoral districts $D_1, \dots, D_M$, each with around $N/M$ inhabitants. To calculate Fryer and Holden’s average-distance measure, calculate the physical distance $d_{ij}$ between any pair $i$ and $j$ that inhabit the same district. For each individual district $D_k$, sum the squares of all these distances:$$ \sum_{i,j \in D_k} d_{ij}^2.$$Then, sum up the results for all $M$ districts, to get$$\pi(D_1, \dots, D_M) = \sum_{k=1}^{M} \sum_{i,j \in D_k}(d_{ij})^2.$$

Fryer and Holden give an example to make this clear. The image below, adapted from their article, shows a hypothetical state with two possible legislative maps, one blue ($\mathbf{B}$) and the other orange ($\mathbf{O}$). The average distance between same-district constituents in the blue map comes out to $\pi(\mathbf{B}) = 24$; for the orange map, we get $\pi(\mathbf{O}) = 16$. None of the other partitions has a $\pi$-score less than $16$.

Six vertices arranged in two rows. The first row has vertices 1, 2, 3; the second has 4, 5, 6.
In this simple example, a state has six inhabitants, located at the vertices of a 1-km grid. They are to be allocated to two electoral districts. The image shows two partition schemes. In the blue scheme, the districts have populations $\{1, 2, 3\}$ and $\{4, 5, 6\}$. In the orange scheme, the populations are $\{1, 4, 5\}$ and $\{2, 3, 6\}$. Image credit: Tony Phillips.

Since the smallest average distance is the one achieved by the orange map, to compute the RPI of a specific map we take the ratio of its average distance with $\pi(\mathbf{O}) = 16$. So the blue partition has RPI equal to the ratio between $\pi(\mathbf{B})$ and $\pi(\mathbf{O})$, i.e. $24/16 = 1.5$. The RPI of the orange partition is, of course, $\pi(\mathbf{O})/\pi(\mathbf{O}) = 1$.

The drawback with this natural measure of compactness is that the only way known at present to find the minimum average distance involves testing every element in the set of all possible partitions, and the size of that set grows exponentially with the population of the state. The authors give that size as $78.4\times 10^{59,351}$ for a simplified model of California. In fact, as they remark, the most-compact-partition problem is one of a family of counting problems assigned by computational complexity theory to the category “NP-hard.” Whether or not there is a general way to bring these problems under control is an open question in mathematics considered important enough to be one of the million-dollar Millennium Prize Problems. So for practical purposes, as matters stand, an exact solution is unattainable. But the authors found a workaround: Using extra information contained in the Census—the center of population mass in each legislative district in the state—they can cook up a good approximation to a most-compact partition.

To get an idea of the practical consequences of a switch to maximally compact districts, the authors went back to data from the 2000 election in California, New York, Pennsylvania, and Texas, and examined how a change from the current system to an optimal one would affect election results. For both cases, analyzing hundreds of simulations of the election, they recorded how the proportion of seats won by one party varied with their proportion of the vote. They conclude that maximally compact districts would make the results of an election more responsive, in a statistically significant way, to how the actual votes were cast.

“The topological properties of the protein universe”

is the title of an article published in Nature Communications, August 13, 2025. As the title suggests, the work has a very large scope: the 214 million unique proteins modeled in the AlphaFold2 database. According to the authors, a main principle of protein science is that “the shape of the protein determines its function.” This motivates them to systematically inventory the shapes within the database. To do so, they use a 21st-century innovation in topology, the mathematical study of shape, called persistent homology. (See also this survey.)

Persistent homology is a way of assigning meaningful shape to a cloud of points $X$. Working with a positive real parameter $\epsilon$, we put an edge between any two points that are less than $\epsilon$ apart; we fill in a triangle when three points fit in a ball of diameter $\epsilon$, and a tetrahedron when this happens for four points. This gives us what is called a simplicial complex, which we’ll call $X^{\epsilon}.$ Simplicial complex means, in particular, that whenever a tetrahedron belongs to $X^{\epsilon},$ its triangular faces must also belong to $X^{\epsilon},$ etc. (This is automatic from the way we’ve defined $X^{\epsilon}.$)

It is customary to call points, edges, triangles and tetrahedra simplices, and to label them as follows. A point $p$ in $X$ is the $0$-simplex $\langle p\rangle$, the edge between $p$ and $q$ is the $1$-simplex $\langle p q\rangle$, etc. A linear combination of $k$-dimensional simplices (we’ll use $\mathbf{Z}_2$ coefficients, i.e. $0$ and $1$, with $1+1=0$) is called a $k$-chain. The $k$-chains of $X^{\epsilon}$ form a $\mathbf{Z}_2$-vector space. We’ll label this vector space $C_k^{\epsilon}(X)$. Its basis is the set of $k$-simplices.

Homology is built around the operation boundary. The boundary of the tetrahedral $3$-simplex $\langle p q r s\rangle$ is the sum of its four faces, $\langle p q r\rangle+\langle p q s\rangle+\langle p r s\rangle+\langle q r s\rangle$, and so forth: the boundary of a triangle is the sum of its three edges, the boundary of an edge is the sum of its two endpoints, the boundary of a point is $0$. This defines boundary on the basis elements, and the definition extends to a linear transformation $\partial_k:C_k^{\epsilon}(X) \rightarrow C_{k-1}^{\epsilon}(X)$, for $k=1,2,3$ in our example.

The fundamental fact that underlies homology is that the boundary of a boundary is zero. This can be easily checked on simplices, so it must hold for any chain. For example, each edge in the boundary of the boundary of a tetrahedron occurs exactly twice, on two adjacent triangular faces, giving coefficient 0 mod 2.

This means in particular that the space consisting of all the boundaries of some $(k+1)$-chain is a subspace of all the $k$-chains with boundary zero. This lets us define the $k$-th homology vector space $H_k^{\epsilon}(X)$ as a quotient space: in $C_k^{\epsilon}(X)$ take the vector subspace consisting of all $k$-chains with boundary 0, and divide by the subspace made up of all the boundaries of some $(k+1)$-chain. So in the quotient we identify two $k$-cycles if their sum is the boundary of a $(k+1)$-chain; in particular, if a cycle itself is a boundary, it gets identified to 0. Roughly speaking, $H_k^{\epsilon}(X)$ keeps track of the $k$-cycles in $X^{\epsilon}$ which are not boundaries of anything.

In this rough sense, $H_1^{\epsilon}(X)$ records polygons with no interior—”holes,” as the authors call them. $H_2^{\epsilon}(X)$ records empty polyhedral surfaces, or “voids”. Meanwhile, $H_0^{\epsilon}(X)$ records pairs of points that lie in different connected components of $X^{\epsilon}$.

Persistence appears as a phenomenon when we vary $\epsilon$. Suppose two points $1$ and $2$ are at distance $\delta$ which is greater than our chosen parameter $\epsilon$. Then the chain $\langle 1\rangle$ + $\langle 2\rangle$ in $C_0^{\epsilon}(X)$ is a 0-cycle which is not a boundary, and so represents a 0-dimensional homology class in $H_0^{\epsilon}(X)$. But if we increase $\epsilon$ until it is larger than the distance $\delta$, the chain $\langle 1\rangle$ + $\langle 2\rangle$ becomes the boundary of the 1-simplex $\langle 12\rangle$. That is, its homology class is now zero. The original class has not persisted. Here is a more elaborate example.

Left: Four points labeled 1, 2, 3, 4, with parameter epsilon. There are line segments from 2 to 3, 3 to 4, and 4 to 1. There is no line segment between 1 and 2. Center: Parameter has increased to delta, and there is now a line segment between 1 and 2. Right: Parameter is increased to delta', and the quadrilateral formed by the points is now filled in.
Birth and death of a 1-dimensional homology class. a. In $X^{\epsilon}$, the edges $\langle 14\rangle$, $\langle 34\rangle$ and $\langle 23\rangle$ are not part of a cycle, since the edge $\langle 12 \rangle$ is not in $X^{\epsilon}$. b. When $\epsilon$ increases to $\delta$, the edge $\langle 12\rangle$ joins the simplicial complex. The four edges now constitutes a non-bounding cycle and thus a non-zero class in $H_1^{\delta}(X)$. c. This class does not persist: when $\delta$ increases to $\delta’$, the 2-simplexes $\langle 123\rangle$ and $\langle 134\rangle$ join the complex $X^{\delta’}$. The chain $\langle 12\rangle+\langle 23\rangle+\langle 34\rangle
+\langle 14\rangle$ is a boundary of the sum $\langle 123\rangle + \langle 134\rangle$, so this cycle is now zero in $H_1^{\delta’}(X)$. Image credit: Tony Phillips.

Madsen and collaborators approximated each of the 214 million AlphaFold2 protein structures by a point cloud, where each point gives the location in 3-dimensional space of one of the backbone atoms of the protein. For human hemoglobin, an important protein, this cloud has 574 points; for some proteins, the number is several thousand. For each of these clouds, they recorded the persistent homology.

Left: Model of a protein, with a loop highlighted in purple. Right: The same protein, with a set of strands that form the edges of a polyhedron highlighted in purple.
A “loop” (a cycle representing a non-zero 1-dimensional homology class) and a “void” (a cycle representing a non-zero 2-dimensional class) in a stage of the persistent homology analysis of one of the subunits of hemoglobin. Image 1H from Nat Commun 16, 7503, used under a CC by-NC-ND 4.0 License.

The authors elaborate the concept of topological richness, which they define as “the measure of how many unique, persistent topological features each protein has, …, normalised by number of residues [the number of points in that protein’s point cloud, as described above].” It’s the count of how many holes and voids appear in the persistent homology analysis of that point cloud, as the parameter $\epsilon$ varies from $0$ to the size of the cloud. In their analysis, they discover something quite striking: The topological richness of proteins varies greatly among the three fundamental domains in biology—Eukaryota (organisms whose cells have nuclei; includes all animals, plants and fungi), Bacteria and Archaea. For Eukaryota, 32% of proteins showed topological richness, vs. only 10% for Bacteria and 8% for Archaea.

—Tony Phillips, Stony Brook University