# Balls and Bins

In this lecture, we will analyse a random process called **balls and bins**, which underlies several randomized algorithms, ranging from data structures (hashing) to graph algorithms (graph sparsification), routing in parallel computers, and many others.

**Setup (balls and bins):** We have $m$ balls and $n$ bins. We throw each ball into a bin chosen uniformly at random. Each throw is independent of the others.
We are interested in the following questions:

- what is the expected number of balls in a bin?
- What is the expected number of empty bins?
- What is “typically” the maximum number of balls in a bin?
- What is the expected number of bins with at least $k$ balls?
- For what value of $m$ do we expect to have no empty bins? (coupon collector problem)

### First question: what is the expected number of balls in a bin?

Let us label the balls $1,2,\ldots,m$ and the bins $1,2,\ldots,n$. Let $B_{ij}$ be the indicator variable that ball $i$ lands in bin $j$. Then, the number of balls in bin $j$ is given by $X_j = \sum_{i=1}^m B_{ij}$. Thus, the expected number of balls in bin $j$ is given by $$ \mathbb{E}[X_j] = \sum_{i=1}^m \mathbb{E}[B_{ij}] = \sum_{i=1}^m \frac{1}{n} = \frac{m}{n}.$$ When $m = n$, we have expectation of one ball per bin. How often will this actually happen?

### Second question: what is the expected number of empty bins?

Let $N_i$ be the indicator variable that bin $i$ is empty. Then, the number of empty bins is given by $\sum_{i=1}^n N_i$. Thus, the expected number of empty bins is given by $$ \mathbb{E}\left[\sum_{i=1}^n N_i\right] = \sum_{i=1}^n \mathbb{E}[N_i] = \sum_{i=1}^n \left(1 - \frac{1}{n}\right)^m \approx n e^{-m/n}.$$ When $m = n$, expected number of empty bins is $n/e$.

**Head scratching moment:** When $m=n$, we expect one ball per bin, but we also expect $n/e$ empty bins. How can this be?
Which expectation should I actually “expect” to see?
As we mentioned in the previous lecture, this is where *concentration of probability measure* comes in.
It turns out that the second random variable (expected number of empty bins) is concentrated around its mean, and thus the expected number of empty bins is a good indicator of the actual number of empty bins (i.e., it is the “typical” situation).

### Third question: what is “typically” the maximum number of balls in a bin?

As we saw in the previous question, “typical” is related to concentration of probability measure.

Let us first see a simpler problem, known as the *birthday paradox*.
For what value of $m$ do we expect to see a collision (i.e., two balls in the same bin)?

The probability that there is no collision is given by $$ 1 \cdot (1 -1/n) \cdot (1 - 2/n) \cdots (1 - (m-1)/n) \leq \prod_{i=1}^{m-1} e^{-i/n} \approx e^{-m(m-1)/2n} $$ which is $\leq 1/2$ when $m = \sqrt{2n \log 2}$. For $n = 365$, we have $m \approx 23$.

Thus, we expect to see a collision when $m = \Theta(\sqrt{n})$. This appears in several places, such as hashing, graph sparsification, factoring, and many others.

Now, let us return to our original question: what is “typically” the maximum number of balls in a bin? We will address this question when $m = n$.

What is the probability that bin $j$ has at least $k$ balls? This is given by $$\Pr[X_j \geq k] = \Pr[\text{at least } k \text{ balls land in bin } j] \leq \binom{n}{k} \cdot \left(\frac{1}{n}\right)^k \leq \left( \dfrac{ne}{k} \right)^k \cdot \dfrac{1}{n^k} = \dfrac{e^k}{k^k}$$ where the first inequality follows from the union bound.

By the union bound once again, $\Pr[\text{some bin has } \geq k \text{ balls}] \leq n \cdot \dfrac{e^k}{k^k} = e^{\ln n + k - k \ln k}$.

Thus, $\Pr[\text{max load is } \leq k] = 1 - \Pr[\text{some bin has } > k \text{ balls}] \geq 1 - e^{\ln n + k - k \ln k}$. When will this probability be large (say $\gg 1/2$)? The answer is when $k \ln k > \ln n$. Setting $k = 3 \dfrac{\ln n}{\ln \ln n}$ does it. Hence, with high probability, the maximum load is $O\left(\dfrac{\ln n}{\ln \ln n}\right)$.

**Remark:** the quantity above comes up in hashing and analysis of approximation algorithms (for instance, the best known approximation ration for congestion minimization).

### Coupon Collector Problem

We now address the question: for what value of $m$ do we expect to have no empty bins?

But first, let us see why this is called the *coupon collector problem*.
We can formulate the problem in the following way: we have $n$ different types of coupons (the bins), and we want to collect all of them.
We buy one coupon at a time (like kinder eggs/pack action cards), and each coupon is equally likely to be any of the $n$ types.
How many coupons do we need to buy to collect all $n$ coupons?

Let $X_i$ be the number of balls thrown to get from $i+1$ empty bins to $i$ empty bins. Then, $X = \sum_{i=0}^{n-1} X_i$ is the number of balls thrown to get from $n$ empty bins to $0$ empty bins. By linearity of expectation, we have $\mathbb{E}[X] = \sum_{i=0}^{n-1} \mathbb{E}[X_i]$. But what is $\mathbb{E}[X_i]$?

Note that $X_i$ is a geometric random variable with parameter $p = \frac{i}{n}$. Thus, $\Pr[X_i = k] = (1-p)^{k-1} p$, which implies that $\mathbb{E}[X_i] = \frac{1}{p} = \frac{n}{i}$. Now, we have $$\mathbb{E}[X] = \sum_{i=0}^{n-1} \frac{n}{i} \approx n \ln n$$

This $n \ln n$ bound shows up in several places, such as cover time of random walks in the complete graph, number of edges needed in graph sparsification and many others.

### Power of Two Choices

We now know that when $n$ balls are thrown into $n$ bins, the maximum load is $O\left(\dfrac{\ln n}{\ln \ln n}\right)$ with high probability.
Consider the following variant of the problem: what if when throwing a ball, we choose *two bins* uniformly at random, and then throw the ball into the bin with the *least number* of balls?
It turns out that this simple modification improves the maximum load to $O(\ln \ln n)$ with high probability.

Intuition/idea: let the hight of a bin be the number of balls in it. The process above tells us that to get one bin with height $h+1$ we must have at least two bins with height $h$. We can bound the number of bins with height $ \geq h$, as that will tell us how likely it is to get a bin with height $h+1$.

If $N_h$ is the number of bins with height $\geq h$, then we have

$$ \Pr[\text{ at least one bin of height } h+1] \leq \left( \dfrac{N_h}{n} \right)^2 $$

Now, to bound $N_h$, here is a bit more intuition: say we have only $n/4$ bins with $4$ items (i.e. height $4$). Then the probability of selecting 2 such bins is $\leq 1/16$. So, we should expect only $n/16$ bins with height $5$. Analogously, we should expect only $n/16^2 = n/256 = n/2^{2{^3}}$ bins with height $6$. Repeating this, we should expect only $n/2^{2^{h-3}}$ bins with height $h$. So we expect $\log \log n$ maximum height after throwing $n$ balls.

To turn the above intuition into a proof, see either [Mitzenmacher & Upfal, Chapter 14] or Prof. Lau’s notes.