Fingeprinting, Polynomial Identities, Matchings and Isolation Lemma

Motivation

It is hard to overstate the importance of algebraic techniques in computer science. Algebraic techniques are used in many areas of computer science, including randomized algorithms (hashing, today’s lecture), parallel algorithms (also this lecture), efficient proof/program verification (PCPs), coding theory, cryptography, and complexity theory.

Fingerprinting

We begin with a basic problem: suppose Alice and Bob each maintain the same large database of information (think of each as being a server from a comapany that deals with a lot of data). Alice and Bob want to check if their databases are consistent. However, they do not want to reveal their entire database to each other (as that would be too expensive).

So, sending the entire database to each other is not an option. What can they do? Deterministic consistent checking requires sending the entire database to each other. However, if we use randomness we can do much better, using a technique called fingerprinting.

The problem above can be more succinctly stated as follows: if Alice’s version of the database is given by string a=(a0,a1,,an1) and Bob’s is given by b=(b0,b1,,bn1), then given two strings a,b{0,1}n, how can we check if they are equal?

Fingerprinting Mechanism

Let α:=i=0n1ai2i and β:=i=0n1bi2i. Let p be a prime number and let Fp(x):=xmodp be the function that maps x to its remainder modulo p. This function is called the fingerprinting function.

Now, we can describe the fingerprinting mechanism/protocol as follows:

  1. Alice picks a random prime p and sends (p,Fp(α)) to Bob.
  2. Bob checks if Fp(β)Fp(α)modp, and sends to Alice {1if Fp(β)Fp(α)modp0otherwise

In the above algorithm, the total number of bits communicated is O(logp). And it is easy to see that if a=b, the protocol always outputs 1. What happens when ab?

Verifying String Inequality

If ab, then αβ. For how many primes p is it true that Fp(α)Fp(β)? (i.e., the protocol will fail) Note that Fp(α)Fp(β)modp if and only if pαβ. This leads us to the following claim:


Claim: If a number M2n,,2n, then the number of distinct primes p such that pM is less than n.


Proof: each prime divisor of M is 2, so if M has k distinct prime divisors, then |M|>2k. Since |M|2n, we have k<n.


By the above claim, the number of primes p such that pαβ is at most n. By the prime number theorem, we know that there are m/logm primes among the first m positive integers. Choosing our prime p among the first tnlog(tn) positive integers, we have that the probability that pαβ is at most ntnlog(tn)/log(tnlogtn)=O~(1/t).

Thus, the number of bits sent is O~(log(tn)). Choosing t=n gives us a protocol which works with high probability.

Polynomial Identity Testing

The technique of fingerprinting can be used to solve a more general problem: given two polynomials f(x),g(x)F[x] (where F is a field), how can we check if f(x)=g(x)?

Two polynomials are equal if and only if their difference is the zero polynomial. Hence, the problem reduces to checking if a polynomial is the zero polynomial. Since a polynomial of degree d is uniquely determined by its values at d+1 points, we can check if a polynomial is the zero polynomial by checking if it is zero at d+1 points. If we want to turn this into a randomized algorithm, we can simply sample one point uniformly at random from a set SF with |S|=2(d+1) and check if the polynomial is zero at that point. By the above argument, the probability that a nonzero polynomial evaluates to zero most 1/2.

If we want to increase the success probability, there are two ways to do it: either we can increase the number of points we check, or we can repeat the above procedure multiple times.

The above problem as well as the approach can be generalized to polynomials in many variables.

The general problem is known as polynomial identity testing, which we now formally state:


Polynomial Identity Testing (PIT): Given a polynomial f(x1,,xn)F[x1,,xn], is f(x1,,xn)0?


What do we mean by “given a polynomial”? This can come in many forms, but in this class we will only assume that we have access to an oracle that can evaluate the polynomial at any point in Fn.

Generalizing the above approach yields the following lemma, that can be used in a randomized algorithm for polynomial identity testing.


Lemma 1 (Ore-Schwartz-Zippel-de Millo-Lipton): Let f(x1,,xn)F[x1,,xn] be a nonzero polynomial of degree d. Then, for any set SF, we have Pra1,,anS[f(a1,,an)=0]d|S|


Proof: We prove the lemma by induction on n. The base case n=1 follows from the argument above.

For the inductive step, we assume that the lemma holds for n1 variables. Let f(x1,,xn)=i=0dfi(x1,,xn1)xni. Since f is non-zero, it must be the case that fi is non-zero for some i. Let k be the largest index such that fk is non-zero. Then, we have that fk(x1,,xn1) is a nonzero polynomial of degree dk in n1 variables. By the inductive hypothesis, we have that Pra1,,an1S[fk(a1,,an1)=0]dk|S|

Now, we have Pra1,,anS[f(a1,,an)=0]= Pra1,,anS[f(a1,,an1,an)=0fk(a1,,an1)0]Pra1,,an1S[fk(a1,,an1)0]+ Pra1,,anS[f(a1,,an1,an)=0fk(a1,,an1)=0]Pra1,,an1S[fk(a1,,an1)=0] k|S|Pra1,,an1S[fk(a1,,an1)0]+ Pra1,,anS[f(a1,,an1,an)=0fk(a1,,an1)=0]dk|S| k|S|1+1dk|S|=d|S|

where in the second to last inequality we simply applied the inductive hypothesis for the cases of 1 variable and n1 variables. In the last inequality, we simply used the fact that any probability is upper bounded by 1.


Randomized Matching Algorithms

We now use the above lemma to give a randomized algorithm for the perfect matching problem. We begin with the problem of deciding whether a bipartite graph G=(LR,E) has a perfect matching.


Input: A bipartite graph G=(LR,E).

Output: YES if G has a perfect matching, NO otherwise.


Let n=|L|=|R| and let XF[x11,x12,,xnn]n×n be the symbolic adjacency matrix of G. That is, Xij=xij if (i,j)E and Xij=0 otherwise.

Since det(X)=σSn(1)σi=1nXi,σ(i) and since each permutation corresponds to a perfect matching, we have that det(X)0 (as a polynomial) if and only if G has a perfect matching.

Thus, we can use Lemma 1 to give a randomized algorithm for the perfect matching problem! In other words, the perfect matching problem for bipartite graphs is a special case of the polynomial identity testing problem.

Thus, our algorithm is simply to evaluate the polynomial det(X) at a random point in Fn×n. The analysis is the same as the one in the previous section.

Isolation Lemma

Often times in parallel algorithms, when solving a problem with many possible solutions, it is important to make sure that different processors are working towards the same solution.

For this, we need to single out (i.e. isolate) a specific solution without knowing any element of the solution space. How can we do this?

One way to do this is to implicitly define a random ordering on the solution space and then pick the first solution (i.e. lowest order solution) in this ordering. This approach also has applications in distributed computing, where we want to pick a leader among a set of processors, or break deadlocks. We can also use this approach to compute a minimum weight perfect matching in a graph (see references in slides).

We now state the isolation lemma:


Lemma 2 (Isolation Lemma): given a set system over [n]:=1,2,,n, if we assign a random weight function w:[n][2n], then the probability that the minimum weight set is unique is at least 1/2.


Example: Suppose n=4, and our set system is given by S1=1,4,S2=2,3,S3=1,2,3.

Then a random weight function w:[4][8] might be w(1)=3,w(2)=5,w(3)=8,w(4)=4. Then, the minimum weight set is S1. However, if we had instead chosen w(1)=5,w(2)=1,w(3)=7,w(4)=3, then we will have two sets with minimum weight, i.e., S1,S2.

Remark: The isolation lemma can be quite counter-intuitive. A set system can have Ω(2n) sets, and on average, there are Ω(2n/2n2) sets of a given weight, as the max weight is 2n2. The isolation lemma tells us that even though there are exponentially many sets, the probability that the minimum weight set is unique is still at least 1/2.


Proof of Isolation Lemma: Let S be a set system over [n], let v[n], and for each AS, let w(A) be the weight of A. Also, let SvS be the family of sets in S that contain v, and similarly define Nv:=SSv, that is, the family of sets in S that do not contain v.

Let αv:=minANvw(A)minBSvw(Bv)

Note that

  • αv<w(v)v does not belong to any minimum weight set.
  • αv>w(v)v belongs to every unique minimum weight set.
  • αv=w(v)v belongs to some but not all minimum weight sets (so this is an ambiguous case).

Since the weight function w is chosen uniformly at random, we have that αv is independent of w(v). Hence, we have that Pr[αv=w(v)]1/2nPr[there is an ambiguous element]1/2 where the last inequality follows from the union bound.

Note that if we have two sets A,B of minimum weight, then any element vAΔB is ambiguous. But as we saw above, the probability that there is an ambiguous element is at most 1/2.

Previous
Next