# Concentration Inequalities

In the next lectures, we will study randomized algorithms. When evaluating the performance of randomized algorithms, it is not enough to analyze the average-case performance of the algorithm. In other words, it is not enough to know that our algorithm runs with expected time $T$. What we want to say is:

That is, we not only need to analyze the *expected running time* of our algorithm, but also need to show the *concentration* of the running time around the expected value (that is, the *typical* running time will be the expected running time).

To do the above, we will make use of **concentration inequalities**.
More precisely, today we will study the **Markov inequality**, the **Chebyshev inequality** and the **Chernoff-Hoeffding inequality**.

## Markov Inequality

**Theorem (Markov’s inequality):** Let $X$ be a discrete, non-negative random variable.
Then, for any $t > 0$, we have
$$ \Pr[X \geq t] \leq \frac{\mathbb{E}[X]}{t}.$$

**Proof:** Note that
$$ \mathbb{E}[X] = \sum_{j \geq 0} j \cdot \Pr[X = j] \geq \sum_{j \geq t} j \cdot \Pr[X = j] \geq \sum_{j \geq t} t \cdot \Pr[X = j] = t \cdot \Pr[X \geq t]. $$

Markov’s inequality is a very simple inequality, but it is very useful. It is most useful when we have very little information about the distribution of $X$ (more precisely, we only need non-negativity and we need to know the expected value). However, as we will see soon, if we have more information about our random variables, we can get better concentration inequalities.

Applications of Markov’s inequality:

**quicksort:**the expected running time of quicksort is $2n \log n$. Markov’s inequality implies that the probability that quicksort takes longer than $2 c \cdot n \log n$ is at most $1/c$.**coin flipping:**the expected number of heads in $n$ (unbiased) coin flips is $n/2$. Markov’s inequality tells us that $\Pr[ \text{# heads } \geq 3n/4] \leq 2/3$.

We will see that we can get better concentration inequalities for the coin flipping example, if we know that the coin flips are independent (so we will have more information about our random variable).

## Moments of Probability Distributions

To get better concentration inequalities, we will need to know more (than the expected value) about our random variables. For instance, how do we disiniguish between the following two probability distributions?

- $X$ is the random variable defined by
$\Pr[X = i] =
\begin{cases}
1/n, \text{ if } 1 \leq i \leq n \\

0, \text{ otherwise.} \end{cases}$ - $Y$ is the random variable that takes value $1$ with probability $1/2$ and value $n$ with probability $1/2$.

They have the same expectation, but they are very different random variables.

To get more information on our random variables, we will define the **moments** of a random variable.
Moments tell us how much the random variable deviates from its mean.

- The
**$k^{th}$ moment**of a random variable $X$ is defined as $\mathbb{E}[X^k]$. - The
**$k^{th}$ central moment**of a random variable $X$ is defined as $\mu_X^{(k)} := \mathbb{E}[(X - \mathbb{E}[ X ])^k]$.

Note that the expected value is the first moment, and the first central moment is $0$.

Now, in our example above, we have that $\mathbb{E}[X] = \mathbb{E}[Y] = (n+1)/2$.
So they are equal in the first moment.
However, looking at the *second moments*, we have that
$$\mathbb{E}[X^2] = n(n+1)(2n+1)/(6n) = (n+1)(2n+1)/6$$
and
$$\mathbb{E}[Y^2] = 1/2 + n^2/2 = (n^2+1)/2$$
So, we can see that the higher moments are able to distinguish between the two random variables.

We will now see that the higher moments are also useful to get better concentration inequalities.

## Chebyshev Inequality

One particularly useful moment is the **variance** of a random variable, which is the second central moment.
So we define $Var[X] := \mathbb{E}[(X - \mathbb{E}[X])^2]$.
With information about both the expected value and the variance, we can get a better concentration inequality: Chebyshev’s inequality.

**Theorem (Chebyshev’s inequality):** Let $X$ be a discrete random variable
Then, for any $t > 0$, we have
$$ \Pr[|X - \mathbb{E}[X]| \geq t] \leq \frac{\text{Var}[X]}{t^2}.$$

**Proof:** Let $Z = (X - \mathbb{E}[X])^2$.
$Z$ is a non-negative random variable, so we can apply Markov’s inequality to $Z$.
Then, we have that
$$ \Pr[|X - \mathbb{E}[X]| \geq t] = \Pr[Z \geq t^2] \leq \frac{\mathbb{E}[Z]}{t^2} = \frac{\text{Var}[X]}{t^2}.$$

An important measure of correlation between two random variables is the **covariance**.
It is defined as
$$Cov[X,Y] := \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])].$$
Note that $Cov[X,X] = Var[X]$.

We say that two random variables $X$ and $Y$ are **positively correlated** if $Cov[X,Y] > 0$ and **negatively correlated** if $Cov[X,Y] < 0$.
We say that two random variables are **uncorrelated** if $Cov[X,Y] = 0$.

*Remark:* Note that independent random variables are uncorrelated, but uncorrelated random variables are not necessarily independent.

**Proposition:** Let $X$ and $Y$ be two random variables. Then,

- $Var[X+Y] = Var[ X ] + Var[Y] + 2 Cov[X,Y]$.
- If $X$ and $Y$ are uncorrelated, then $Var[X+Y] = Var[ X ] + Var[Y]$.

Now that we learned about the covariance, we can apply it to our coin flipping process.

**coin flipping:**let $X$ be the number of heads in $n$ unbiased coin flips. We can describe the $i^{th}$ coin toss by the random variable $X_i = \begin{cases} 1, \text{ if coin flipped heads} \\ 0, \text{ otherwise} \end{cases}$

Since the coin flips are independent, they are also uncorrelated. Thus, by the above proposition, we have that $Var[X] = \sum_{i=1}^n Var[X_i] = n/4$. So, by Chebyshev’s inequality, we have that $$ \Pr[X \geq 3n/4] \leq \Pr[|X - \mathbb{E}[X]| \geq n/4] \leq \frac{n/4}{(n/4)^2} = \frac{4}{n}.$$

**Practice problem:** can you generalize Chebychev’s inequality to higher moments?

## Chernoff-Hoeffding Inequality

Often times in algorithm analysis, we deal with random variables that are sums of independent random variables (distinct elements, balls and bins, etc). Can we get a better concentration inequality for these types of random variables?

The law of large numbers tells us that the average of a large number of independent, identically distributed, random variables is close to the expected value. Chernoff’s inequality tells us how likely it is for the average to be far from the expected value.

**Theorem (Chernoff inequality):** Let $X_1, \ldots, X_n$ be independent random variables such that $X_i \in {0,1}$ for all $i \in [n]$.
Let $X = \sum_{i=1}^n X_i$ and $\mu = \mathbb{E}[X]$. Then, for $0 < \delta < 1$,
$$ \Pr[X \geq (1+\delta)\mu] \leq e^{-\frac{\delta^2 \mu}{3}}$$
also,
$$ \Pr[X \leq (1-\delta)\mu] \leq e^{-\frac{\delta^2 \mu}{2}}.$$

**Proof:** We will prove the first inequality.
The proof of the lower tail bound is similar.

Let $p_i := \Pr[X_i = 1]$, and thus $1-p_i = \Pr[X_i = 0]$ and $\mu = \sum_{i=1}^n p_i$.

The idea of the proof is to apply Markov’s inequality to the random variable $e^{tX}$.
Since the exponential function is increasing, we have
$$\Pr[X \geq a] = \Pr[e^{tX} \geq e^{ta}] \leq \mathbb{E}[e^{tX}]/e^{ta}, \text{ for any } t > 0.$$
What do we gain by doing this?
When we look at the exponential function, we are using information about **all** the moments of $X$.
This is because the Taylor series of $e^{tX}$ is
$$ e^{tX} = \sum_{k=0}^\infty \frac{(tX)^k}{k!} = 1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \ldots$$
In particular, we define the **moment generating function** of $X$ as
$$ M_X(t) := \mathbb{E}[e^{tX}] = \sum_{k=0}^\infty \frac{t^k \mathbb{E}[X^k]}{k!}.$$
If $X = X_1 + X_2$, where $X_1$ and $X_2$ are independent, then $M_X(t) = M_{X_1}(t) M_{X_2}(t)$.

Now, let’s apply Markov’s inequality to $e^{tX}$. We have that $$ \Pr[X \geq (1+\delta)\mu] = \Pr[e^{tX} \geq e^{t \cdot (1+\delta)\mu}] \leq \frac{\mathbb{E}[e^{tX}]}{e^{t(1+\delta)\mu}}.$$ By the above, and independence of $X_i$’s, we have $$\mathbb{E}[e^{tX}] = \prod_{i=1}^n \mathbb{E}[e^{t X_i}] = \prod_{i=1}^n \left(p_i \cdot e^t + (1-p_i) \cdot 1 \right) $$ Since $p_i \cdot e^t + (1-p_i) \cdot 1 = 1 + p_i \cdot (e^t - 1) \leq e^{p_i \cdot (e^t -1)}$, as $e^x \geq 1 + x$, for all $x \geq 0$, we have $$ \dfrac{\mathbb{E}[e^{tX}]}{e^{t(1+\delta) \mu}} \leq \dfrac{1}{e^{t(1+\delta) \mu}} \cdot \prod_{i=1}^n e^{p_i \cdot (e^t -1)} = \left( \dfrac{e^{e^t-1}}{e^{t \cdot (1+\delta)}} \right)^\mu \leq \left( \dfrac{e^\delta}{(1+\delta)^{1+\delta}} \right)^\mu $$ where in the last inequality we plugged in $t = \ln(1 + \delta)$.

The main inequality follows from the above and the fact that $e^\delta/(1+\delta)^{1+\delta} \leq e^{-\delta^2/3}$, for all $0 < \delta < 1$. To see this, we can use the Taylor series of $\ln(1+x)$ for $x \in (0,1)$, which is $\ln(1+x) = x - x^2/2 + x^3/3 - \ldots$. Then, we have $(1+\delta) \ln(1+\delta) = x + x^2/2 - x^3/6 + x^4/3 \cdot 4 - \cdots$ $$ e^\delta/(1+\delta)^{1+\delta} = \exp(\delta - (1+\delta) \ln(1+\delta)) = \exp(-\delta^2/2 + \delta^3/6 - \delta^4/12 + \cdots) $$ $$ \leq e^{-\delta^2/3} \cdot \exp(- \delta^2/6 + \delta^3/6) \leq e^{-\delta^2/3} $$

**Theorem (Hoefding’s inequality):** Let $X_1, \ldots, X_n$ be independent random variables such that $X_i \in [a_i,b_i]$ for all $i \in [n]$.
Let $X = \sum_{i=1}^n X_i$ and $\mu = \mathbb{E}[X]$. Then, for any $\ell > 0$, we have
$$ \Pr[|X - \mu |\geq \ell] \leq 2 \cdot \exp\left(-\frac{2\ell^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)$$

**Proof:** the proof is similar to the proof of Chernoff’s inequality, but now we use Hoeffding’s lemma instead of Markov’s inequality.

Hoeffding’s lemma states that if $Z$ is a random variable such that $Z \in [a,b]$, then $$ \mathbb{E}[e^{t(Z - \mathbb{E}[Z])}] \leq e^{t^2(b-a)^2/8}.$$