3. Foundational Inequalities

The Inequalities

There are a number of different inequalities that are extremely useful in the study of randomized complexity classes. Three fundamental inequalities, however, really stand out for their ubiquity.

The first is so basic that it almost seems unnecessary to list it here. But it is called upon so often that it does deserve highlighting:

Union Bound. For any two events $A$ and $B$,
\[\Pr[ A \vee B ] \le \Pr[A] + \Pr[B].\]
More generally, for any finite or countable set of events $A_1, A_2, A_3, \ldots$,
\[\Pr\left[ \bigvee_{n \ge 1} A_n \right] \le \sum_{n \ge 1} \Pr[ A_n ].\]

Proof. The first inequality follows from the identity
\[\Pr[ A \vee B ] = \Pr[A] + \Pr[ B \setminus A]\]
and the inequality
\[\Pr[ B \setminus A ] = \Pr[ B \cap \overline{A} ] \le \Pr[ B ].\]
Both of these facts follow directly from the axioms of probability, and can be extended to taking the union of countably many events to obtain the second inequality.

Note that there are no conditions whatsoever on the events $A_1,A_2,A_3,\ldots$ in the statement of the union bound. This is part of what makes it such a useful inequality; we can use it even when there are some complicated (and sometimes unknown) dependencies between these events.

The second main tool is Markov’s inequality.

Markov’s Inequality. For any non-negative random variable $X$ and parameter $t \ge 0$,
\[\Pr[ X > t ] < \frac{\mathrm{E}[X]}{t}.\]

Proof. The expected value of $X$ can be expressed as
\[\mathrm{E}[X] = \Pr[ X > t ] \cdot \mathrm{E}[X \mid X > t] + \Pr[ X \le t ] \cdot \mathrm{E}[ X \mid X \le t].\]
The conditional expectation $\mathrm{E}[X \mid X > t]$ must be greater than $t$ since the expectation of a random variable is at least as large as its minimum value. And when $X$ is non-negative then $X \ge 0$ always holds and so $\mathrm{E}[X \mid X \le t] \ge 0$ for the same reason. Therefore,
\[\mathrm{E}[X] > \Pr[ X > t ] \cdot t + \Pr[X \le t] \cdot 0.\]
Re-arranging the terms yields Markov’s inequality.

Markov’s inequality is also extremely useful because it only requires non-negativity of $X$. It works particularly well with linearity of expectation: for any random variables $X_1,X_2,\ldots,X_n$, we always have that

\[\mathrm{E}\left[ \sum_{i=1}^n X_i \right] = \sum_{i=1}^n \mathrm{E}[ X_i ].\]

So when $X_1,\ldots,X_n$ are all non-negative, the probability that the sum $X_1 + \cdots + X_n$ is much larger than its expected value can always be bounded by Markov’s inequality, even when the variables $X_1,\ldots,X_n$ have some complicated dependency between them.

When we are considering the sum of independent random variables, another tool gives us much stronger concentration bounds. This tool has numerous variants and a few different names, including Hoeffding’s inequality and Chernoff bounds. Here is one standard variant of the inequality.

Hoeffding’s inequality. Let $X_1,X_2,\ldots,X_n$ be independent random variables that each take values in $[0,1]$. Define $X = \sum_{i=1}^n X_i$ and let $\mu = \mathrm{E}[X]$ denote the expected value of this sum. Then for any $t > 0$,
\[\Pr[ X > \mu + t ] \le e^{-2t^2/n}\]
and
\[\Pr[ X < \mu - t ] \le e^{-2t^2/n}.\]

Truncation

Let’s start to use our foundational inequalities to examine some basic questions in randomized complexity theory. As a first step, let’s compare worst-case and average-case randomized complexity of Boolean functions. We have already seen that there are randomized algorithms for which the worst-case cost of the algorithm is much larger than the average cost. Can there be a similar separation for the complexity of functions? In other words, does there exist a function $f$ for which $\overline{\mathrm{R}}(f) \ll \mathrm{R}(f)$?

The answer, it turns out, is no for most natural models of computation. The only restriction on the result is that it applies only to models where at each step of algorithms in these models, we have a well-defined notion of the cost incurred by the algorithm so far. Let us call models of computation where this condition holds odometered.

Theorem. In any odometered model of computation, for every function $f$ and all pairs of parameters $\epsilon \ge 0, \delta > 0$,
\[\mathrm{R}_{\epsilon + \delta}(f) \le \frac1\delta \cdot \overline{\mathrm{R}}_\epsilon(f).\]

Proof. Let $R$ be a randomized algorithm that has error at most $\epsilon$ and satisfies $\overline{\mathrm{cost}}(R) = \overline{\mathrm{R}}_\epsilon(f)$. Let $R'$ be the randomized algorithm obtained by truncating $R$ whenever its cost exceeds $\frac1\delta \overline{\mathrm{cost}}(R)$: at any step where $R$ reaches this cost and has not yet halted, we abort the computation and guess the answer. By construction, $R'$ has worst-case cost at most $\frac1\delta \overline{\mathrm{cost}}(R)$. It remains to show that it computes $f$ with error at most $\epsilon + \delta$.

Fix any input $x$. Let $E$ denote the event that $R'$ aborts and guesses the value of $f$. By the Union Bound, the probability that $R'$ errs is at most
\[\mathrm{err}_f(R, x) + \Pr[ E ] \le \epsilon + \Pr[E].\]
And by Markov’s inequality, the probability that $R'$ aborts is
\[\Pr[E] = \Pr_{D \sim R}\left[ \mathrm{cost}(D,x) \ge \frac1\delta \mathrm{cost}(R) \right] \le \frac{\overline{\mathrm{cost}}(R,x)}{\frac1\delta \overline{\mathrm{cost}}(R)} \le \delta.\]

Success Amplification

With one-sided error, it’s easy to reduce error probability exponentially fast.

As an example, see the EQ function.

Theorem. $\mathrm{R}_\epsilon^{\mathrm{pub}}(\mathsf{EQ}) = O( \log \tfrac1\epsilon)$ and $\mathrm{R}_\epsilon^{\mathrm{pri}}(\mathsf{EQ}) = O( \log n \log \tfrac1\epsilon )$.

Proof. Repeat the 1-sided error protocol $\log \frac1\epsilon$ times. Output 1 if and only if all instances of the protocol output 1. This protocol still has 1-sided error: it can err only by incorrectly outputing 1 when $x \neq y$. But since the protocols are independent, the probability that this occurs is $\underbrace{\frac12 \cdot \frac12 \cdot \cdots \cdot \frac12}_{\log 1/\epsilon} = 2^{-\log 1/\epsilon} = \epsilon$.

This success amplification of algorithms with 1-sided error is also what is used in the analysis of Karger’s algorithm. In that setting, the original error of $1 - \binom{n}{2}^{-1}$ is very close to 1, but by repeating the algorithm enough times we still get the desired error reduction. The sucess amplification theorem in this case looks like this:

Theorem. For every $\delta > 0$, $\mathrm{R}_{\frac12,0}(f) \le \frac1\delta \mathrm{R}_{1-\delta,0}(f).$

Proof. The probability of observing an error in each of the $1/\delta$ instances of the algorithm is at most
\[\underbrace{(1-\delta) \cdots (1-\delta)}_{1/\delta} = (1-\delta)^{1/\delta} \le (e^{-\delta})^{1/\delta} = 1/e,\]
where we used the remarkably useful inequality $1-x \le e^{-x}$ that holds for all $x \in \mathbb{R}$.

What about the setting where algorithms have 2-sided error? We can obtain the same qualitative success amplification results, using Hoeffding’s inequality.

Theorem. For every $\epsilon > 0$, $\mathrm{R}_{\epsilon}(f) \le O \big( \log \tfrac1\epsilon \cdot \mathrm{R}_{\frac13}(f) \big).$

Proof. Let $A$ be a randomized algorithm with error at most $\frac13$ and cost $\mathrm{R}_{\frac13}(f)$. Consider the algorithm $B$ that runs $k$ independent instances of $A$ and outputs the majority answer, for some value of $k$ that we will fix later. Let $X_1, \ldots, X_k$ be random variables where $X_i = 1$ if the $i$th instance of $A$ errs and $X_i = 0$ otherwise. The variables are independent and all take values in the range $[0,1]$. By linearity of expectation, the expected value of their sum $X$ is
\[\mu := \mathrm{E}[X] = \sum_i \mathrm{E}[X_i] \le \frac k3.\]
The algorithm $B$ is wrong only if $X > \frac k2$. By Hoeffding’s inequality, this occurs with probability
\[\Pr[ X > \frac k2] \le \Pr[ X - \mu > \frac k 6] \le e^{-2k/6^2}.\]
When $k = \lceil 18 \ln 1/\epsilon = \Theta(\log \frac 1\epsilon)$, this error probability is at most $\epsilon$.