Balls and Bins
In this lecture, we will analyse a random process called balls and bins, which underlies several randomized algorithms, ranging from data structures (hashing) to graph algorithms (graph sparsification), routing in parallel computers, and many others.
Setup (balls and bins): We have $m$ balls and $n$ bins. We throw each ball into a bin chosen uniformly at random. Each throw is independent of the others. We are interested in the following questions:
- what is the expected number of balls in a bin?
- What is the expected number of empty bins?
- What is “typically” the maximum number of balls in a bin?
- What is the expected number of bins with at least $k$ balls?
- For what value of $m$ do we expect to have no empty bins? (coupon collector problem)
First question: what is the expected number of balls in a bin?
Let us label the balls $1,2,\ldots,m$ and the bins $1,2,\ldots,n$. Let $B_{ij}$ be the indicator variable that ball $i$ lands in bin $j$. Then, the number of balls in bin $j$ is given by $X_j = \sum_{i=1}^m B_{ij}$. Thus, the expected number of balls in bin $j$ is given by $$ \mathbb{E}[X_j] = \sum_{i=1}^m \mathbb{E}[B_{ij}] = \sum_{i=1}^m \frac{1}{n} = \frac{m}{n}.$$ When $m = n$, we have expectation of one ball per bin. How often will this actually happen?
Second question: what is the expected number of empty bins?
Let $N_i$ be the indicator variable that bin $i$ is empty. Then, the number of empty bins is given by $\sum_{i=1}^n N_i$. Thus, the expected number of empty bins is given by $$ \mathbb{E}\left[\sum_{i=1}^n N_i\right] = \sum_{i=1}^n \mathbb{E}[N_i] = \sum_{i=1}^n \left(1 - \frac{1}{n}\right)^m \approx n e^{-m/n}.$$ When $m = n$, expected number of empty bins is $n/e$.
Head scratching moment: When $m=n$, we expect one ball per bin, but we also expect $n/e$ empty bins. How can this be? Which expectation should I actually “expect” to see? As we mentioned in the previous lecture, this is where concentration of probability measure comes in. It turns out that the second random variable (expected number of empty bins) is concentrated around its mean, and thus the expected number of empty bins is a good indicator of the actual number of empty bins (i.e., it is the “typical” situation).
Third question: what is “typically” the maximum number of balls in a bin?
As we saw in the previous question, “typical” is related to concentration of probability measure.
Let us first see a simpler problem, known as the birthday paradox. For what value of $m$ do we expect to see a collision (i.e., two balls in the same bin)?
The probability that there is no collision is given by $$ 1 \cdot (1 -1/n) \cdot (1 - 2/n) \cdots (1 - (m-1)/n) \leq \prod_{i=1}^{m-1} e^{-i/n} \approx e^{-m(m-1)/2n} $$ which is $\leq 1/2$ when $m = \sqrt{2n \log 2}$. For $n = 365$, we have $m \approx 23$.
Thus, we expect to see a collision when $m = \Theta(\sqrt{n})$. This appears in several places, such as hashing, graph sparsification, factoring, and many others.
Now, let us return to our original question: what is “typically” the maximum number of balls in a bin? We will address this question when $m = n$.
What is the probability that bin $j$ has at least $k$ balls? This is given by $$\Pr[X_j \geq k] = \Pr[\text{at least } k \text{ balls land in bin } j] \leq \binom{n}{k} \cdot \left(\frac{1}{n}\right)^k \leq \left( \dfrac{ne}{k} \right)^k \cdot \dfrac{1}{n^k} = \dfrac{e^k}{k^k}$$ where the first inequality follows from the union bound.
By the union bound once again, $\Pr[\text{some bin has } \geq k \text{ balls}] \leq n \cdot \dfrac{e^k}{k^k} = e^{\ln n + k - k \ln k}$.
Thus, $\Pr[\text{max load is } \leq k] = 1 - \Pr[\text{some bin has } > k \text{ balls}] \geq 1 - e^{\ln n + k - k \ln k}$. When will this probability be large (say $\gg 1/2$)? The answer is when $k \ln k > \ln n$. Setting $k = 3 \dfrac{\ln n}{\ln \ln n}$ does it. Hence, with high probability, the maximum load is $O\left(\dfrac{\ln n}{\ln \ln n}\right)$.
Remark: the quantity above comes up in hashing and analysis of approximation algorithms (for instance, the best known approximation ration for congestion minimization).
Coupon Collector Problem
We now address the question: for what value of $m$ do we expect to have no empty bins?
But first, let us see why this is called the coupon collector problem. We can formulate the problem in the following way: we have $n$ different types of coupons (the bins), and we want to collect all of them. We buy one coupon at a time (like kinder eggs/pack action cards), and each coupon is equally likely to be any of the $n$ types. How many coupons do we need to buy to collect all $n$ coupons?
Let $X_i$ be the number of balls thrown to get from $i+1$ empty bins to $i$ empty bins. Then, $X = \sum_{i=0}^{n-1} X_i$ is the number of balls thrown to get from $n$ empty bins to $0$ empty bins. By linearity of expectation, we have $\mathbb{E}[X] = \sum_{i=0}^{n-1} \mathbb{E}[X_i]$. But what is $\mathbb{E}[X_i]$?
Note that $X_i$ is a geometric random variable with parameter $p = \frac{i}{n}$. Thus, $\Pr[X_i = k] = (1-p)^{k-1} p$, which implies that $\mathbb{E}[X_i] = \frac{1}{p} = \frac{n}{i}$. Now, we have $$\mathbb{E}[X] = \sum_{i=0}^{n-1} \frac{n}{i} \approx n \ln n$$
This $n \ln n$ bound shows up in several places, such as cover time of random walks in the complete graph, number of edges needed in graph sparsification and many others.
Power of Two Choices
We now know that when $n$ balls are thrown into $n$ bins, the maximum load is $O\left(\dfrac{\ln n}{\ln \ln n}\right)$ with high probability. Consider the following variant of the problem: what if when throwing a ball, we choose two bins uniformly at random, and then throw the ball into the bin with the least number of balls? It turns out that this simple modification improves the maximum load to $O(\ln \ln n)$ with high probability.
Intuition/idea: let the hight of a bin be the number of balls in it. The process above tells us that to get one bin with height $h+1$ we must have at least two bins with height $h$. We can bound the number of bins with height $ \geq h$, as that will tell us how likely it is to get a bin with height $h+1$.
If $N_h$ is the number of bins with height $\geq h$, then we have
$$ \Pr[\text{ at least one bin of height } h+1] \leq \left( \dfrac{N_h}{n} \right)^2 $$
Now, to bound $N_h$, here is a bit more intuition: say we have only $n/4$ bins with $4$ items (i.e. height $4$). Then the probability of selecting 2 such bins is $\leq 1/16$. So, we should expect only $n/16$ bins with height $5$. Analogously, we should expect only $n/16^2 = n/256 = n/2^{2{^3}}$ bins with height $6$. Repeating this, we should expect only $n/2^{2^{h-3}}$ bins with height $h$. So we expect $\log \log n$ maximum height after throwing $n$ balls.
To turn the above intuition into a proof, see either [Mitzenmacher & Upfal, Chapter 14] or Prof. Lau’s notes.