4. Minimax Principle

One of the other foundational result in the study of randomized complexity theory is Yao’s Minimax Principle. Before we can introduce it, let’s examine distributional complexity measures in a bit more detail.

Distributional Complexity

We saw already that there are distributional complexity measures for randomized algorithms. Similar complexity measures can also be defined for deterministic algorithms as well. Let’s define one such measure now.

Recall that the expected distributional error of a randomized algorithm R with respect to a target function f and the distribution μ on the domain of f is

errfμ(R)=Exμerrf(R,x).

The distributional error of a deterministic algorithm D is defined in the same way:

errfμ(D)=Exμerrf(D,x)=Prxμ[D(x)f(x)].

The worst-case distributional complexity of a function f for error parameter ϵ and distribution μ is

Dϵμ(f)=minD:errfμ(D)ϵcost(D).

Note that in this definition, the error of D is measured on average over inputs drawn from the distribution μ but the cost of D is measured in the worst case over all possible inputs.

The advantage of working with distributional complexity is that it is often easier to prove lower bounds on the cost of deterministic algorithms against random inputs than it is to argue about randomized algorithms. And yet, there is a very useful relation between randomized and distributional complexity which shows that distributional lower bounds also give lower bounds for randomized algorithms:

Lemma 1. For every function f and every ϵ>0,

Rϵ(f)maxμDϵμ(f).

Proof. Fix any distribution μ.

Since the set of randomized algorithms R with errfμ(R)ϵ is a superset of the set of randomized algorithms with error at most ϵ on all inputs,

Rϵ(f)=minR:errf(R)ϵcost(R)minR:errfμ(R)ϵcost(R).

Now errfμ(R)=EDR[errfμ(D)], so there must always be a deterministic algorithm D in the support of R with error at most that of R. And the worst-case cost of R equals the maximum cost of any deterministic algorithm in its support. So

Rϵ(f)minD:errfμ(D)ϵcost(D)=Dϵμ(f).

The use of this lemma and other related variants is often justified by the name of “Yao’s Minimax Principle”, but that’s not quite correct; before we see some example uses of the lemma, let’s first clarify what the Minimax Principle actually shows.

Yao’s Minimax Principle

The lemma It is often called “Yao’s Minimax Principle”, but it is more accurate to call this the “easy (or even perhaps trivial) direction of Yao’s Minimax Principle”.

The full Minimax Principle states something much stronger: that the inequality in the last lemma can be replaced with equality. Or, in other words, that the method of proving lower bounds on randomized complexity measures by bounding the distributional complexity of the same function is complete in that there will always be a hard distribution for which the distributional complexity matches the randomized complexity of the function.

Minimax Principle. (Worst-case randomized complexity variant) For every function f and every ϵ>0,

Rϵ(f)=maxμDϵμ(f).

This Minimax Principle is only one of many (many!) examples of minimax theorems in different areas of mathematics. Generally, a minimax theorem establishes the identity

minxXmaxyYh(x,y)=maxyYminxXh(x,y)

for some sets X,Y and function h:X×YR. But one quick warning: to see how the Minimax Principle above can be stated (and proved) in this form, it doesn’t suffice to just expand the definitions; we need to restate the equation in a different way. We’ll postpone doing so until a bit later in the course.

There are other variants of the Minimax Principle for randomized complexity theory. The following one was the origiinal Minimax Principle, the one that Andrew Yao first established. It is concerned with the average complexity of zero-error randomized algorithms.

Minimax Principle. (Average-case zero-error randomized complexity variant) For every function f,

R0(f)=maxμD0μ(f).

Let us see how this variant of the Minimax Principle is a minimax theorem. Let Df denote the set of deterministic algorithms that correctly compute f. Let Rf denote the set of randomized algorithms that correctly compute f. Then

R0(f)=minRRfmaxxcost(R,x)=minRRfmaxμcost(R,μ)

where the last max is over all distributions on the domain of f. And

maxμD0(f)=maxμminDDfcost(D,μ)=maxμminRRfcost(R,μ).

Therefore, the Minimax Principle for average complexity of zero-error algorithms can be expressed as the minimax theorem

minRRfmaxμcost(R,μ)=maxμminRRfcost(R,μ).

The last identity follows from the fact that the average cost of any randomized algorithm with respect to μ is at least as large as the minimum average cost of the deterministic algorithms in its support.

Note, however, that the Minimax Principle does not hold in all randomized models of computation. For example, we have the following “non-minimax” result.

Lemma 1. For every function f and every ϵ>0,

Rϵ(f)maxμDϵμ(f)

and there are choices of f and ϵ for which the inequality is strict.

Using the (Easy Direction of the) Minimax Principle

We can some examples of how to use Lemma 1 to obtain lower bounds on the randomized complexity of functions by revisiting the evasiveness conjecture from the first lecture. Recall that in this setting, the cost of a deterministic algorithm on a graph G is the number of vertex-pairs that it queries to learn whether the two vertices are connected by an edge in G or not. We can use Lemma 1 to obtain a simple bound on the randomized query complexity for determining if a graph is bipartite.

Theorem. R13(Bipartite)=Θ(n2).

Proof. By Lemma 1, it suffices to construct a distribution μ on graphs for which every deterministic algorithm that queries at most 12(n/22) pairs of vertices must err with probability greater than 13 on graphs drawn from μ. Take μ to be the distribution where

In the first case, the graph returned by the algorithm is bipartite. In the second case, it’s not. But consider any deterministic algorithm that queries at most 12(n/22) vertex pairs before terminating. By the union bound, the probability in the second case that it queries the additional edge added within one of the two parts is at most

12(n/22)2(n/22)=14.

Let’s say that it accepts (or outputs “bipartite”) with probability p even when it does not query any additional edge outside of the ones in the base Kn2,n2 graph. The error of the algorithm with respect to μ is then at least

12(1p)+1234p=1218p38>13.

Here’s another example of the use of Lemma 1.

Theorem. R13( Planar )=Θ(n2).

Proof. We again construct a hard distribution μ. This time,

A deterministic algorithm that queries only 124(n2) vertex pairs will observe an edge in the second case with probability at most 14. By the same reasoning as in the last proof, any such algorithm must have error greater than 13 on μ.

There are many other examples of the use of Lemma 1 or similar variants in other models of computation, as we will see throughout the rest of the course.