January’s topics:
Linear Algebra and Netflix
Until 2007, Netflix relied on user ratings and linear algebra to select its movie recommendations. Krešo Josić (University of Houston) talks about the technique on the December 24 episode of NPR’s “Engines of our Ingenuity.” Though the full story is most likely considerably more complicated, here are some of the central ideas.
On any movie or TV show, Netflix lets you give a rating. Now, the rating is either a thumbs down, thumbs up, or double thumbs up (signaling “Love this!”) but previously, the ratings were from 1 to 5 stars. The recommendation attempts to guess how you’ll rate a movie that you haven’t watched yet.
To do this, user ratings become entries in an enormous matrix ${\bf R}$. ${\bf R}$ has a column for every movie and a row for every subscriber. Most people rate fewer than a couple hundred films, while Netflix has tens of thousands of options, so most of the entries are empty. How can Netflix fill in the missing entries? In other words, if you rated Shrek 4 stars, will you like or dislike The Shining?
There are a couple of clues Netflix can lean on. First, even if all the movies you’ve rated are more like Shrek, some of them might have a nonzero horror element. Your ratings on those will signal something about your appetite for The Shining. But also, there may be other subscribers out there who have similar taste to you who have seen The Shining, or at least some other Shining-like movie.
The key to making this work mathematically, says Josić, is that there are just “a few dozen stereotypical rating profiles.” We can think of each of these profiles as a basis vector in taste-space. They will turn out to be abstractly identified, but imagine they represent “Western,” “historical,” “rom-com,” “sci-fi,” “with actress X,” “samurai,””without actor Y,” “disaster,” etc.
Your unique taste will probably not correspond to one of these stereotypical profiles, but it is likely to be a mixture of those profiles, in the sense that your cloud of preferences can be approximated by a vector with components in each of those stereotypical directions. And likewise, for this analysis, each film’s appeal to viewers can be approximated by a vector of features, with components in each of the directions: “Western,” “historical,” etc.
If these vectors are known, then you can calculate how well the film’s profile matches your own taste. Mathematically, this is measured by the angle between the film’s features vector and your taste vector. The smaller the angle, the closer these two vectors are. Linear algebra can calculate this angle using the dot product between the two vectors. (If you are a subscriber with taste vector $(s_1,\dots,s_N)$ considering a movie with feature vector $(m_1,\dots,m_N)$, the dot product $s \cdot m$ is the sum $s_1m_1+ \cdots +s_Nm_N$.)
How can we compute these vectors? All we know is that, for a movie you have seen, the taste-feature dot product should be close to the rating you gave the movie. This is the entry in ${\bf R}$ where your row meets the movie’s column.
What Netflix does is start by randomly assigning components to each of the subscribers’ taste vectors and to each of the movies’ features vectors. By an iterative process, those components get adjusted so that their dot products better and better approximate the actual ratings. This is done over and over, until the approximation is close enough to be useful.

The standard method for improving guesses is gradient descent. In this case, the method tells us what adjustments to make in order to decrease an error function, which we compute as follows. For each non-zero entry $r_{ij}$ in the matrix ${\bf R}$, calculate the taste-feature dot product $s_i\cdot m_j$ for subscriber $i$ and movie $j$. Now calculate the discrepancy between the taste-feature dot product (the predicted rating at this stage) and the actual rating given:
$$r_{ij}~ – ~s_i \cdot m_j. $$
To get the total error, we square the discrepancy for each $r_{ij}$, and sum:
$$\sum_{i,j} (r_{ij}~ – ~s_i \cdot m_j)^2.$$
This is our error function; let’s call it $E$. The squares are there, in part, so that discrepancies from different ratings don’t cancel out. With tens of thousands of subscribers and movies, adjusting $E$ will take a gigantic number of simple computations: a perfect job for a computer.
To understand how gradient descent works, let’s consider an unrealistic numerical example with a single subscriber who has a single taste component $s$ and a single movie with feature $m$. Suppose our initial random taste assignment is $s_0=2$ for the taste component and $m_0=3$ for the feature, and the actual rating our subscriber gave was $r=2$.
Computed as above, the error from this choice is $E = (r-s_0m_0)^2 = 16$. We want to move in the direction that will cause the greatest decrease in the error; let’s go by steps of size 0.5.
The gradient $\nabla E$ gives the direction in which $E$ is increasing the fastest. The gradient descent method tells us to move in the direction opposite to the gradient. The components of $\nabla E$ are the partial derivatives of $E$ with respect to $s$ and $m$:
$$\nabla E(s,m)= (\frac{\partial E}{\partial s}, \frac{\partial E}{\partial m}) = (-2(r-sm)m,-2(r-sm)s),$$
so $\nabla E(2,3)=(24,16)$. The vector of length $0.5$ in the opposite direction is approximately $d_1=(-0.42, -0.28)$. Taking this step from $(2,3)$ lands us at $(s_1, m_1)=(1.58, 2.72)$. This is our first iteration. Computing the error shows that it has been reduced to $E=5.28$. For the next iteration we repeat the calculation at $(s_1, m_1)$. There $\nabla E(1.58, 2.72)=(12.51, 7.27)$ and $d_2 = (-0.43, -0.25)$. A second step leads us to $(s_2, m_2) = (1.15, 2.47)$. The error is now $E(1.15, 2.47) = 0.71$.

This application of mathematics to the recommendation problem was developed, according to Simon Funk, in response to Netflix’s October 2006 offer of \$1 million for an algorithm 10% better than the one they had. Funk (an alias) was one of the finalists. Dan Jackson has another, more recent, account of that competition. He also tells us that Netflix has moved on to using data directly available from their streaming operations; knowing what people are actually watching is more useful than reading their ratings. For a clear, slow-paced presentation of the algorithm, I recommend Luis Serrano’s half-hour YouTube presentation, “How does Netflix recommend movies? Matrix Factorization.”
Report from “the mathapalooza in Seattle”
Siobhan Roberts visited this year’s Joint Mathematics Meetings in Seattle and wrote up her observations for The New York Times.
The official theme of the meetings was “Mathematics in the Age of A.I.” Roberts interviewed Bryna Kra, the outgoing AMS President. “A.I. is something that is in our lives, and it’s time to start thinking about how it impacts your teaching, your students, your research. … What does it mean to have A.I. as a co-author? These are the kinds of questions that we have to grapple with,” Kra said. Roberts balances these concerns with a quote from a lecture by Yann LeCun, who runs A.I. at Meta: “The current stage of machine learning is that it sucks. … We can’t even reproduce what a cat can do.”
The Joint Meetings had sessions on mathematics and the arts, as well as an art exhibit. The sessions included Susan Goldstine (St. Mary’s College) who spoke about her “Poincaré Blues” project—a denim skirt made from old jeans, using as pattern the tiling of the Poincaré model of the hyperbolic plane by $30^{\circ}-45^{\circ}-90^{\circ}$ triangles (note the angle sum, strictly less than $180^{\circ}$, characteristic of negative curvature); and Barry Cipra (University of Minnesota) showing a hidden, coded magic square in Max Bill’s painting Gelbes Feld. The article featured pictures of two sculptures from the art exhibit; see them below.


A Comedy of Errors?
From New Scientist on December 26: “Mathematicians found – and fixed – an error in a 60-year-old proof.” The author, Alex Wilkins, is writing about the project to formally verify a proof of Fermat’s last theorem. To state Fermat’s last theorem, take an integer $n$ strictly larger than 2, and consider the equation
$$x^n + y^n = z^n$$
where $x$, $y$, and $z$ are all integers. The theorem, proved by Andrew Wiles in 1995, says this equation has no solutions except $(0,0,0)$.
Wiles’ proof has been long accepted by the mathematical community, but Kevin Buzzard (Imperial College London) is leading an effort to re-prove Fermat’s last theorem in a computer-checkable programming language called Lean. The formalization effort is housed by the Xena project, and involves a method of proof different from the one Wiles used.
The problems arose over the summer, according to Buzzard. Lean “did that irritating thing which it sometimes does,” as he wrote in a blog post on December 11. That irritating thing was to refuse to swallow part of an argument.
Lean’s choking point was an argument by the late French mathematician Norbert Roby. Upon examination, that proof turned out to be wrong. In transcribing the statement $\Gamma_A(M)\otimes_A R = \Gamma_R(M\otimes_A R)$ from an earlier paper of Roby’s, “one of the tensor products [$\otimes$] accidentally falls off,” as Buzzard puts it, and the resulting incorrect identity was part of the proof.
Not to worry. Buzzard never thought that Roby’s lemma was wrong, just that the proof needed repair. Word soon got around to the experts. One of them, Brian Conrad (Stanford University), dug up another argument in the appendix to the 1978 Notes on Crystalline Cohomology. As Buzzard puts it, “The proof was back on!”
Update: Notes on Crystalline Cohomology was co-written by Arthur Ogus (University of California, Berkeley) and the late Pierre Berthelot. Buzzard visited Berkeley recently and met Ogus for lunch. He told Ogus how his appendix had saved the day. Whereupon, Buzzard tells us, Ogus answered: “Oh! That appendix has several errors in it! But it’s OK, I think I know how to fix them.”
—Tony Phillips, Stony Brook University