# Fingeprinting, Polynomial Identities, Matchings and Isolation Lemma

## Motivation

It is hard to overstate the importance of algebraic techniques in computer science. Algebraic techniques are used in many areas of computer science, including randomized algorithms (hashing, today’s lecture), parallel algorithms (also this lecture), efficient proof/program verification (PCPs), coding theory, cryptography, and complexity theory.

## Fingerprinting

We begin with a basic problem: suppose Alice and Bob each maintain the same large database of information (think of each as being a server from a comapany that deals with a lot of data). Alice and Bob want to check if their databases are consistent. However, they do not want to reveal their entire database to each other (as that would be too expensive).

So, sending the entire database to each other is not an option. What can they do? Deterministic consistent checking requires sending the entire database to each other. However, if we use randomness we can do much better, using a technique called fingerprinting.

The problem above can be more succinctly stated as follows: if Alice’s version of the database is given by string $a = (a_0, a_1, \ldots, a_{n-1})$ and Bob’s is given by $b = (b_0, b_1, \ldots, b_{n-1})$, then given two strings $a, b \in \{0,1\}^n$, how can we check if they are equal?

### Fingerprinting Mechanism

Let $\alpha := \sum_{i=0}^{n-1} a_i \cdot 2^i$ and $\beta := \sum_{i=0}^{n-1} b_i \cdot 2^i$. Let $p$ be a prime number and let $F_p(x) := x \bmod p$ be the function that maps $x$ to its remainder modulo $p$. This function is called the fingerprinting function.

Now, we can describe the fingerprinting mechanism/protocol as follows:

- Alice picks a random prime $p$ and sends $(p, F_p(\alpha))$ to Bob.
- Bob checks if $F_p(\beta) \equiv F_p(\alpha) \bmod p$, and sends to Alice $$\begin{cases} 1 & \text{if } F_p(\beta) \equiv F_p(\alpha) \bmod p \\ 0 & \text{otherwise} \end{cases}$$

In the above algorithm, the total number of bits communicated is $O(\log p)$. And it is easy to see that if $a = b$, the protocol always outputs $1$. What happens when $a \neq b$?

### Verifying String Inequality

If $a \neq b$, then $\alpha \neq \beta$. For how many primes $p$ is it true that $F_p(\alpha) \equiv F_p(\beta)$? (i.e., the protocol will fail) Note that $F_p(\alpha) \equiv F_p(\beta) \bmod p$ if and only if $p \mid \alpha - \beta$. This leads us to the following claim:

**Claim:** If a number $M \in {-2^n, \ldots, 2^n}$, then the number of distinct primes $p$ such that $p \mid M$ is less than $n$.

**Proof:** each prime divisor of $M$ is $\geq 2$, so if $M$ has $k$ distinct prime divisors, then $|M| > 2^k$.
Since $|M| \leq 2^n$, we have $k < n$.

By the above claim, the number of primes $p$ such that $p \mid \alpha - \beta$ is at most $n$. By the prime number theorem, we know that there are $m/\log m$ primes among the first $m$ positive integers. Choosing our prime $p$ among the first $tn \log(tn)$ positive integers, we have that the probability that $p \mid \alpha - \beta$ is at most $\dfrac{n}{tn \log(tn) /\log(tn \cdot \log tn)} = \tilde{O}(1/t)$.

Thus, the number of bits sent is $\tilde{O}(\log(tn))$. Choosing $t = n$ gives us a protocol which works with high probability.

## Polynomial Identity Testing

The technique of fingerprinting can be used to solve a more general problem: given two polynomials $f(x), g(x) \in \mathbb{F}[x]$ (where $\mathbb{F}$ is a field), how can we check if $f(x) = g(x)$?

Two polynomials are equal if and only if their difference is the zero polynomial. Hence, the problem reduces to checking if a polynomial is the zero polynomial. Since a polynomial of degree $d$ is uniquely determined by its values at $d+1$ points, we can check if a polynomial is the zero polynomial by checking if it is zero at $d+1$ points. If we want to turn this into a randomized algorithm, we can simply sample one point uniformly at random from a set $S \subseteq \mathbb{F}$ with $|S| = 2(d+1)$ and check if the polynomial is zero at that point. By the above argument, the probability that a nonzero polynomial evaluates to zero most $1/2$.

If we want to increase the success probability, there are two ways to do it: either we can increase the number of points we check, or we can repeat the above procedure multiple times.

The above problem as well as the approach can be generalized to polynomials in many variables.

The general problem is known as polynomial identity testing, which we now formally state:

**Polynomial Identity Testing (PIT):** Given a polynomial $f(x_1, \ldots, x_n) \in \mathbb{F}[x_1, \ldots, x_n]$, is $f(x_1, \ldots, x_n) \equiv 0$?

What do we mean by “given a polynomial”? This can come in many forms, but in this class we will only assume that we have access to an oracle that can evaluate the polynomial at any point in $\mathbb{F}^n$.

Generalizing the above approach yields the following lemma, that can be used in a randomized algorithm for polynomial identity testing.

**Lemma 1 (Ore-Schwartz-Zippel-de Millo-Lipton):** Let $f(x_1, \ldots, x_n) \in \mathbb{F}[x_1, \ldots, x_n]$ be a nonzero polynomial of degree $d$.
Then, for any set $S \subseteq \mathbb{F}$, we have
$$\Pr_{a_1, \ldots, a_n \in S}[f(a_1, \ldots, a_n) = 0] \leq \dfrac{d}{|S|}$$

**Proof:** We prove the lemma by induction on $n$.
The base case $n = 1$ follows from the argument above.

For the inductive step, we assume that the lemma holds for $n-1$ variables. Let $f(x_1, \ldots, x_n) = \sum_{i=0}^d f_i(x_1, \ldots, x_{n-1}) x_n^i$. Since $f$ is non-zero, it must be the case that $f_i$ is non-zero for some $i$. Let $k$ be the largest index such that $f_k$ is non-zero. Then, we have that $f_k(x_1, \ldots, x_{n-1})$ is a nonzero polynomial of degree $d-k$ in $n-1$ variables. By the inductive hypothesis, we have that $$\Pr_{a_1, \ldots, a_{n-1} \in S}[f_k(a_1, \ldots, a_{n-1}) = 0] \leq \dfrac{d-k}{|S|}$$

Now, we have $$ \Pr_{a_1, \ldots, a_n \in S}[f(a_1, \ldots, a_n) = 0] = $$ $$\Pr_{a_1, \ldots, a_n \in S}[f(a_1, \ldots, a_{n-1}, a_n) = 0 \mid f_k(a_1, \ldots, a_{n-1}) \neq 0] \cdot \Pr_{a_1, \ldots, a_{n-1} \in S}[f_k(a_1, \ldots, a_{n-1}) \neq 0] + $$ $$\Pr_{a_1, \ldots, a_n \in S}[f(a_1, \ldots, a_{n-1}, a_n) = 0 \mid f_k(a_1, \ldots, a_{n-1}) = 0] \cdot \Pr_{a_1, \ldots, a_{n-1} \in S}[f_k(a_1, \ldots, a_{n-1}) = 0] \leq $$ $$\dfrac{k}{|S|} \cdot \Pr_{a_1, \ldots, a_{n-1} \in S}[f_k(a_1, \ldots, a_{n-1}) \neq 0] + $$ $$\Pr_{a_1, \ldots, a_n \in S}[f(a_1, \ldots, a_{n-1}, a_n) = 0 \mid f_k(a_1, \ldots, a_{n-1}) = 0] \cdot \dfrac{d-k}{|S|} \leq $$ $$\dfrac{k}{|S|} \cdot 1 + 1 \cdot \dfrac{d-k}{|S|} = \dfrac{d}{|S|}$$

where in the second to last inequality we simply applied the inductive hypothesis for the cases of 1 variable and $n-1$ variables. In the last inequality, we simply used the fact that any probability is upper bounded by $1$.

## Randomized Matching Algorithms

We now use the above lemma to give a randomized algorithm for the perfect matching problem. We begin with the problem of deciding whether a bipartite graph $G = (L \cup R, E)$ has a perfect matching.

**Input:** A bipartite graph $G = (L \cup R, E)$.

**Output:** YES if $G$ has a perfect matching, NO otherwise.

Let $n = |L| = |R|$ and let $X \in \mathbb{F}[x_{11}, x_{12}, \ldots, x_{nn}]^{n \times n}$ be the symbolic adjacency matrix of $G$. That is, $X_{ij} = x_{ij}$ if $(i,j) \in E$ and $X_{ij} = 0$ otherwise.

Since $$\det(X) = \sum_{\sigma \in S_n} (-1)^{\sigma} \prod_{i=1}^n X_{i, \sigma(i)}$$ and since each permutation corresponds to a perfect matching, we have that $\det(X) \not\equiv 0$ (as a polynomial) if and only if $G$ has a perfect matching.

Thus, we can use Lemma 1 to give a randomized algorithm for the perfect matching problem! In other words, the perfect matching problem for bipartite graphs is a special case of the polynomial identity testing problem.

Thus, our algorithm is simply to evaluate the polynomial $\det(X)$ at a random point in $\mathbb{F}^{n \times n}$. The analysis is the same as the one in the previous section.

## Isolation Lemma

Often times in parallel algorithms, when solving a problem with *many possible solutions*, it is important to make sure that *different processors* are working towards the *same solution*.

For this, we need to *single out* (i.e. *isolate*) a specific solution *without knowing* any element of the solution space.
How can we do this?

One way to do this is to implicitly define a *random* ordering on the solution space and then pick the *first* solution (i.e. lowest order solution) in this ordering.
This approach also has applications in distributed computing, where we want to pick a *leader* among a set of processors, or break deadlocks.
We can also use this approach to compute a minimum weight perfect matching in a graph (see references in slides).

We now state the isolation lemma:

**Lemma 2 (Isolation Lemma):** given a set system over $[n] := {1, 2, \dots, n}$, if we assign a random weight function $w: [n] \rightarrow [2n]$, then the probability that the minimum weight set is unique is at least $1/2$.

**Example:** Suppose $n = 4$, and our set system is given by $S_1 = {1, 4}, S_2 = {2, 3}, S_3 = {1, 2, 3}$.

Then a random weight function $w: [4] \rightarrow [8]$ might be $w(1) = 3, w(2) = 5, w(3) = 8, w(4) = 4$. Then, the minimum weight set is $S_1$. However, if we had instead chosen $w(1) = 5, w(2) = 1, w(3) = 7, w(4) = 3$, then we will have two sets with minimum weight, i.e., $S_1, S_2$.

**Remark:** The isolation lemma can be quite counter-intuitive.
A set system can have $\Omega(2^n)$ sets, and on average, there are $\Omega(2^n/2n^2)$ sets of a given weight, as the max weight is $2n^2$.
The isolation lemma tells us that even though there are exponentially many sets, the probability that the minimum weight set is unique is still at least $1/2$.

**Proof of Isolation Lemma:** Let $\mathcal{S}$ be a set system over $[n]$, let $v \in [n]$, and for each $A \in \mathcal{S}$, let $w(A)$ be the weight of $A$.
Also, let $\mathcal{S}_v \subset \mathcal{S}$ be the family of sets in $\mathcal{S}$ that contain $v$, and similarly define $\mathcal{N}_v := \mathcal{S} \setminus \mathcal{S}_v$, that is, the family of sets in $\mathcal{S}$ that do not contain $v$.

Let $$ \alpha_v := \min_{A \in \mathcal{N}_v} w(A) - \min_{B \in \mathcal{S}_v} w(B \setminus {v})$$

Note that

- $\alpha_v < w(v) \Rightarrow v$ does not belong to any minimum weight set.
- $\alpha_v > w(v) \Rightarrow v$ belongs to every unique minimum weight set.
- $\alpha_v = w(v) \Rightarrow v$ belongs to some but not all minimum weight sets (so this is an ambiguous case).

Since the weight function $w$ is chosen uniformly at random, we have that $\alpha_v$ is independent of $w(v)$. Hence, we have that $$\Pr[\alpha_v = w(v)] \leq 1/2n \Rightarrow \Pr[\text{there is an ambiguous element}] \leq 1/2$$ where the last inequality follows from the union bound.

Note that if we have two sets $A, B$ of minimum weight, then any element $v \in A \Delta B$ is ambiguous. But as we saw above, the probability that there is an ambiguous element is at most $1/2$.