4. Minimax Principle

One of the other foundational result in the study of randomized complexity theory is Yao’s Minimax Principle. Before we can introduce it, let’s examine distributional complexity measures in a bit more detail.

Distributional Complexity

We saw already that there are distributional complexity measures for randomized algorithms. Similar complexity measures can also be defined for deterministic algorithms as well. Let’s define one such measure now.

Recall that the expected distributional error of a randomized algorithm \(R\) with respect to a target function \(f\) and the distribution \(\mu\) on the domain of \(f\) is

\[\overline{\mathrm{err}}_f^\mu(R) = \mathrm{E}_{x \sim \mu} \overline{\mathrm{err}}_f(R,x).\]

The distributional error of a deterministic algorithm \(D\) is defined in the same way:

\[\overline{\mathrm{err}}_f^\mu(D) = \mathrm{E}_{x \sim \mu} \overline{\mathrm{err}}_f(D,x) = \Pr_{x \sim \mu}[ D(x) \neq f(x) ].\]

The worst-case distributional complexity of a function \(f\) for error parameter \(\epsilon\) and distribution \(\mu\) is

\[\mathrm{D}^\mu_\epsilon(f) = \min_{D \,:\, \overline{\mathrm{err}}_f^\mu(D) \le \epsilon} \mathrm{cost}(D).\]

Note that in this definition, the error of \(D\) is measured on average over inputs drawn from the distribution \(\mu\) but the cost of \(D\) is measured in the worst case over all possible inputs.

The advantage of working with distributional complexity is that it is often easier to prove lower bounds on the cost of deterministic algorithms against random inputs than it is to argue about randomized algorithms. And yet, there is a very useful relation between randomized and distributional complexity which shows that distributional lower bounds also give lower bounds for randomized algorithms:

Lemma 1. For every function \(f\) and every \(\epsilon > 0\),
\[\mathrm{R}_\epsilon(f) \ge \max_\mu \mathrm{D}^\mu_\epsilon(f).\]

Proof. Fix any distribution \(\mu\).

Since the set of randomized algorithms \(R\) with \(\overline{\mathrm{err}}^\mu_f(R) \le \epsilon\) is a superset of the set of randomized algorithms with error at most \(\epsilon\) on all inputs,
\[\mathrm{R}_\epsilon(f) = \min_{R \,:\, \overline{\mathrm{err}}_f(R) \le \epsilon} \mathrm{cost}(R) \ge \min_{R \,:\, \overline{\mathrm{err}}^\mu_f(R) \le \epsilon} \mathrm{cost}(R).\]
Now \(\overline{\mathrm{err}}^\mu_f(R) = \mathrm{E}_{D \sim R}[ \overline{\mathrm{err}}^\mu_f(D) ]\), so there must always be a deterministic algorithm \(D\) in the support of \(R\) with error at most that of \(R\). And the worst-case cost of \(R\) equals the maximum cost of any deterministic algorithm in its support. So
\[\mathrm{R}_\epsilon(f) \ge \min_{D \,:\, \overline{\mathrm{err}}^\mu_f(D) \le \epsilon} \mathrm{cost}(D) = \mathrm{D}^\mu_\epsilon(f).\]

The use of this lemma and other related variants is often justified by the name of “Yao’s Minimax Principle”, but that’s not quite correct; before we see some example uses of the lemma, let’s first clarify what the Minimax Principle actually shows.

Yao’s Minimax Principle

The lemma It is often called “Yao’s Minimax Principle”, but it is more accurate to call this the “easy (or even perhaps trivial) direction of Yao’s Minimax Principle”.

The full Minimax Principle states something much stronger: that the inequality in the last lemma can be replaced with equality. Or, in other words, that the method of proving lower bounds on randomized complexity measures by bounding the distributional complexity of the same function is complete in that there will always be a hard distribution for which the distributional complexity matches the randomized complexity of the function.

Minimax Principle. (Worst-case randomized complexity variant) For every function \(f\) and every \(\epsilon > 0\),
\[\mathrm{R}_\epsilon(f) = \max_\mu \mathrm{D}^\mu_\epsilon(f).\]

This Minimax Principle is only one of many (many!) examples of minimax theorems in different areas of mathematics. Generally, a minimax theorem establishes the identity

\[\min_{x \in X} \max_{y \in Y} h(x,y) = \max_{y \in Y} \min_{x \in X} h(x,y)\]

for some sets \(X, Y\) and function \(h : X \times Y \to \mathbb{R}\). But one quick warning: to see how the Minimax Principle above can be stated (and proved) in this form, it doesn’t suffice to just expand the definitions; we need to restate the equation in a different way. We’ll postpone doing so until a bit later in the course.

There are other variants of the Minimax Principle for randomized complexity theory. The following one was the origiinal Minimax Principle, the one that Andrew Yao first established. It is concerned with the average complexity of zero-error randomized algorithms.

Minimax Principle. (Average-case zero-error randomized complexity variant) For every function \(f\),
\[\overline{\mathrm{R}}_0(f) = \max_\mu \overline{\mathrm{D}}^\mu_0(f).\]

Let us see how this variant of the Minimax Principle is a minimax theorem. Let \(\mathcal{D}_f\) denote the set of deterministic algorithms that correctly compute \(f\). Let \(\mathcal{R}_f\) denote the set of randomized algorithms that correctly compute \(f\). Then

\[\overline{\mathrm{R}}_0(f) = \min_{R \in \mathcal{R}_f} \max_{x} \overline{\mathrm{cost}}(R,x) = \min_{R \in \mathcal{R}_f} \max_{\mu} \overline{\mathrm{cost}}(R,\mu)\]

where the last max is over all distributions on the domain of \(f\). And

\[\max_\mu \overline{\mathrm{D}}_0(f) = \max_{\mu} \min_{D \in \mathcal{D}_f} \overline{\mathrm{cost}}(D,\mu) = \max_{\mu} \min_{R \in \mathcal{R}_f} \overline{\mathrm{cost}}(R,\mu).\]

Therefore, the Minimax Principle for average complexity of zero-error algorithms can be expressed as the minimax theorem

\[\min_{R \in \mathcal{R}_f} \max_{\mu} \overline{\mathrm{cost}}(R,\mu) = \max_{\mu} \min_{R \in \mathcal{R}_f} \overline{\mathrm{cost}}(R,\mu).\]

The last identity follows from the fact that the average cost of any randomized algorithm with respect to \(\mu\) is at least as large as the minimum average cost of the deterministic algorithms in its support.

Note, however, that the Minimax Principle does not hold in all randomized models of computation. For example, we have the following “non-minimax” result.

Lemma 1. For every function \(f\) and every \(\epsilon > 0\),
\[\overline{\mathrm{R}}_\epsilon(f) \ge \max_\mu \overline{\mathrm{D}}^\mu_\epsilon(f)\]
and there are choices of \(f\) and \(\epsilon\) for which the inequality is strict.

Using the (Easy Direction of the) Minimax Principle

We can some examples of how to use Lemma 1 to obtain lower bounds on the randomized complexity of functions by revisiting the evasiveness conjecture from the first lecture. Recall that in this setting, the cost of a deterministic algorithm on a graph \(G\) is the number of vertex-pairs that it queries to learn whether the two vertices are connected by an edge in \(G\) or not. We can use Lemma 1 to obtain a simple bound on the randomized query complexity for determining if a graph is bipartite.

Theorem. \(\mathrm{R}_{\frac13}( \textsf{Bipartite}) = \Theta( n^2 ).\)

Proof. By Lemma 1, it suffices to construct a distribution \(\mu\) on graphs for which every deterministic algorithm that queries at most \(\frac12 \binom{n/2}{2}\) pairs of vertices must err with probability greater than \(\frac13\) on graphs drawn from \(\mu\). Take \(\mu\) to be the distribution where

With probability \(\frac12\), it returns the complete bipartite graph \(K_{\frac n2, \frac n2}\) with parts \(\{1,\ldots,\frac n2\}\) and \(\{\frac n2 + 1,\ldots, n\}\).

With probability \(\frac12\), it returns the graph obtained by taking \(K_{\frac n2, \frac n2\}\) and adding one more edge to the graph uniformly at random.

In the first case, the graph returned by the algorithm is bipartite. In the second case, it’s not. But consider any deterministic algorithm that queries at most \(\frac 12 \binom{n/2}{2}\) vertex pairs before terminating. By the union bound, the probability in the second case that it queries the additional edge added within one of the two parts is at most
\[\frac{ \frac12 \binom{n/2}{2} }{ 2 \binom{n/2}{2}} = \frac14.\]
Let’s say that it accepts (or outputs “bipartite”) with probability \(p\) even when it does not query any additional edge outside of the ones in the base \(K_{\frac n2, \frac n2}\) graph. The error of the algorithm with respect to \(\mu\) is then at least
\[\frac12(1-p) + \frac12 \cdot \frac34 p = \frac12 - \frac18 p \ge \frac38 > \frac13.\]

Here’s another example of the use of Lemma 1.

Theorem. \(\mathrm{R}_{\frac13}( \textsf{ Planar } ) = \Theta( n^2 ).\)

Proof. We again construct a hard distribution \(\mu\). This time,

With probability \(\frac12\), it returns the empty graph.

With probability \(\frac12\), it returns a random embedding of the \(K_4\) graph in the empty graph.

A deterministic algorithm that queries only \(\frac 1{24} \binom{n}{2}\) vertex pairs will observe an edge in the second case with probability at most \(\frac14\). By the same reasoning as in the last proof, any such algorithm must have error greater than \(\frac13\) on \(\mu\).

There are many other examples of the use of Lemma 1 or similar variants in other models of computation, as we will see throughout the rest of the course.