Hashing
Motivation
Hashing is a very useful technique for designing data structures. It is used in many applications, including:
- efficient dictionaries for searching in a database
- data streaming
- derandomization
- cryptography
- complexity theory
and many more.
Model of computation
Before we talk about hashing, we need to state our model of computation.
Model of computation (Word RAM model): in the Word RAM model, we assume that:
- all elements are integers that fit in a machine word of $w$ bits, and each memory cell can store one machine word;
- all operations on a word (comparison, bit operations, arithmetic) take unit time;
- all memory accesses take unit time.
Hashing
Given a universe $U := {0, 1, \ldots, m-1}$ of size $m$, we want to store a set $S \subseteq U$ of size $\ell$, where $\ell \ll m$, in a data structure that supports “as efficiently as possible” the following operations: insertion, deletion, search.
We will assume that $m < 2^w$, so that each element of the universe fits in a machine word.
Naive approach: use an array $A$ of size $m$, where $A[i] = 1$ if $i \in S$ and $A[i] = 0$ otherwise. This approach is too slow: insertion, deletion, and search all take $\Theta(1)$ time, however, the space needed is $\Theta(m)$, which is too large.
We would like to achieve optimal memory ($O(\ell)$) as well as optimal time. We will see that hashing is a technique that achieves this. Hashing is a data structure composed of two parts:
- A hash function is a function $h : U \rightarrow [0, n-1]$.
- A hash table is a data structure that consists of an array $T$ of size $n$ (i.e., $n$ cells) and a hash function $h$. The function $h$ maps elements of $U$ (keys) to cells of $T$.
Thus, we would like to construct an efficient hash table $T$ with $n = O(\ell)$ cells. So we will assume from now on that $n \ll m$ as well.
Challenges in hashing: ideally we would like to map different keys to different cells in the hash table. However, this is not always possible, due to collisions.
Collision: two keys $x, y \in U$ such that $x \neq y$ and $h(x) = h(y)$.
How will we handle collisions? For (most of) this lecture, we will use a technique called chaining to handle collisions. In chaining, each cell of the hash table $T$ is a pointer to a linked list of keys that hash to that cell.
By the pigeonhole principle, since $m > n$, it is impossible to achieve collision-free hashing without knowing keys in advance. Therefore, we need to deal with collisions. Since we are resolving collisions with chaining, we would like to have as few collisions as possible, so that the linked lists are short. Hence, we will settle for having few collisions with high probability.
See discussion of random functions in the lecture slides - that is fairly complete, and I will update this later.
$k$-wise independence
To fix the computational problem with random functions, we will use a weaker notion of randomness, called $k$-wise independence.
Definition (Full Independence): a set of random variables $X_1, \ldots, X_n$ is fully independent if $$ \forall \alpha_1, \ldots, \alpha_n, \ \ \ \Pr[X_{1} = \alpha_{1}, \ldots, X_{n} = \alpha_{n}] = \prod_{j=1}^n \Pr[X_{j} = \alpha_{j}].$$ Equivalently, for any $k \in [n]$ and any $i_1, \ldots, i_k \in [n]$, the random variables $X_{i_1}, \ldots, X_{i_k}$ are independent.
A weaker notion of independence is $k$-wise independence, where we only require that any subset of $k$ random variables are independent. The formal definition is as follows:
Definition ($k$-wise independence): a set of random variables $X_1, \ldots, X_n$ is $k$-wise independent if for any $i_1, \ldots, i_k \in [n]$, the random variables $X_{i_1}, \ldots, X_{i_k}$ are independent. That is, $$ \forall \alpha_1, \ldots, \alpha_k, \ \ \ \Pr[X_{i_1} = \alpha_{1}, \ldots, X_{i_k} = \alpha_{k}] = \prod_{j=1}^k \Pr[X_{i_j} = \alpha_{j}].$$
Let us now see some examples of $k$-wise independent random variables.
When $k = 1$, any set of random variables is $1$-wise independent.
When $k = 2$, we call the random variables pairwise independent. We will now construct a couple of examples of pairwise independent random variables.
Example 1 (XOR of random variables): let $Y_1, \ldots, Y_t$ be uniformly distributed, independent random variables in ${0, 1}$. For each nonempty subset $\emptyset \neq S \subseteq [n]$, let $X_S := \bigoplus_{i \in S} Y_i$. Then, the random variables $X_S$ are uniformly distributed and pairwise independent.
Example 2 (Pairwise independence of lines in $\mathbb{F}_p^2$): let $p$ be a prime number, and let $Y_1, Y_2 \in \mathbb{F}_p$ be uniformly distributed, independent random variables. For each $i \in [0, p-1]$, let $X_i := (Y_1 \cdot i + Y_2) \bmod p$. Then, the random variables $X_i \in \mathbb{F}_p$ are uniformly distributed and pairwise independent.
Universal hash families
To achieve the goal of obtaining hash functions that “look random” and are efficiently computable, we will construct a special (small) family of functions $\mathcal{H}$, and we will pick a random hash function $h$ from $\mathcal{H}$. We will show that, with high probability, the number of collisions is small when we pick a random $h \in \mathcal{H}$.
The simplest version of the goal that we are trying to achieve is the following: $$ \forall x \neq y \in U, \ \ \ \Pr_{h \in_R \mathcal{H}}[h(x) = h(y)] \leq \dfrac{1}{\textsf{poly}(n)} $$
Assumptions:
- we will assume that the keys are independent from the choice of the hash function $h$;
- we do not know the keys in advance. (even if we did, it is a non-trivial problem to find a hash function that maps each key to a different cell, as we will see at the end).
We would also like to have the property that any hash function $h \in \mathcal{H}$ is efficiently computable, since the time to compute $h(x)$ contributes to the running time of the data structure. So we would like to be able to compute $h(x)$ in $O(1)$ time.
Formalizing the above, we have the following definitions:
Definition (Universal hash family): a family of hash functions $\mathcal{H} := { h : U \rightarrow [0,n-1] }$ is $k$-universal if for any distinct elements $x_1, x_2, \ldots, x_k \in U$, we have $$ \Pr_{h \in_R \mathcal{H}}[h(x_1) = h(x_2) = \cdots = h(x_k)] \leq \dfrac{1}{n^{k-1}}. $$
Definition (Strongly universal hash family): a family of hash functions $\mathcal{H} := { h : U \rightarrow [0,n-1] }$ is strongly $k$-universal if for any distinct elements $x_1, x_2, \ldots, x_k \in U$ and any $y_1, y_2, \ldots, y_k \in [0,n-1]$, we have $$ \Pr_{h \in_R \mathcal{H}}[h(x_1) = y_1, h(x_2) = y_2, \ldots, h(x_k) = y_k] \leq \dfrac{1}{n^{k}}. $$
It turns out that strongly universal hash families is an (almost) equivalent concept to $k$-wise independence. Formally, we have: a family of hash functions $\mathcal{H} := { h : U \rightarrow [0,n-1] }$ is strongly universal if and only if the random variables $h(0), \ldots, h(|U|-1)$ are uniformly random and $k$-wise independent.
An upshot of the above equivalence is that we can construct strongly universal hash families using the same techniques that we used to construct $k$-wise independent random variables.
For instance, we get the following examples of strongly universal hash families:
Example 3: let $p$ be a prime number, and $\mathcal{H}$ be a family of functions from $\mathbb{F}_p$ to $\mathbb{F}_p$ defined by $$ \mathcal{H} := \{ h_{a,b}(x) = ax + b \bmod p \mid a, b \in \mathbb{F}_p \} $$ Then, $\mathcal{H}$ is strongly $2$-universal.
If we want to make the domain larger than the image of the maps, we can use the following example:
Example 4: let $p$ be a prime number, and $\mathcal{H}$ be a family of functions from $\mathbb{F}_p$ to $[0, n-1]$ defined by $$ \mathcal{H} := \{ h_{a,b}(x) = ((ax + b) \bmod p) \bmod n \mid a, b \in \mathbb{F}_p, \ a \neq 0 \} $$ Then, $\mathcal{H}$ is $2$-universal.
Remark: example 4 is not strongly $2$-universal. As a practice problem, prove this fact.
Another way to increase the domain is to use the following example:
Example 5: let $p$ be a prime number, and $\mathcal{H}$ be a family of functions from $\mathbb{F}_p^t$ to $\mathbb{F}_p$ defined by $$ \mathcal{H} := \{ h_{\vec{a},b}(x) = \sum_{i=1}^t a_i x_i + b \bmod p \mid \vec{a} \in \mathbb{F}_p^t, b \in \mathbb{F}_p \} $$ Then, $\mathcal{H}$ is strongly $2$-universal.
So far we have only seen examples of (strongly) $2$-universal hash families. It turns out that we can construct (strongly) $k$-universal hash families for any $k$. One way to do this is to generalize the above examples of lines in $\mathbb{F}_p^2$ to curves of degree $k$ in $\mathbb{F}_p^2$. These are described by degree $k-1$ polynomials in $\mathbb{F}_p[x]$. For $2$-universality, we took lines (i.e. polynomials of degree $1$).
Practice problem: generalize the above example of lines in $\mathbb{F}_p^2$ to curves of degree $k-1$ in $\mathbb{F}_p^2$, and show that they are strongly $k$-universal.
Note that the (strongly) $2$-universal hash families that we have seen so far are efficiently computable. Thus, we can use these families to select our hash functions!
When $\mathcal{H}$ is a $2$-universal hash family of maps from $U$ to $[0, n-1]$, what is the expected maximum number of collisions when we pick a random hash function $h \in \mathcal{H}$?
Lemma 1: let our set of keys $S$ be of size $\ell$, and let $\mathcal{H}$ be a $2$-universal hash family of maps from $U$ to $[0, n-1]$. The expected maximum number of collisions when we pick a random hash function $h \in \mathcal{H}$ is at most $\ell^2/2n$.
Proof: let $X_{ij}$ be the indicator random variable for the event that $h(x_i) = h(x_j)$. Then, the number of collisions is $X := \sum_{i < j} X_{ij}$. By linearity of expectation, we have $$ \mathbb{E}[X] = \sum_{i < j} \mathbb{E}[X_{ij}] = \sum_{i < j} \Pr_{h \in_R \mathcal{H}}[h(x_i) = h(x_j)] \leq \sum_{i < j} \dfrac{1}{n} = \dfrac{\ell(\ell-1)}{2n} \leq \dfrac{\ell^2}{2n}. $$
By combining the above lemma with Markov’s inequality, we get the following corollary:
Corollary 1: let our set of keys $S$ be of size $\ell$, and let $\mathcal{H}$ be a $2$-universal hash family of maps from $U$ to $[0, n-1]$. With probability $\geq 1/2$, the number of collisions is $\leq \ell^2/n$ and the maximum load of any entry of the hash table when we pick a random hash function $h \in \mathcal{H}$ is at most $\sqrt{2\ell^2/n}$.
In particular, when $\ell \approx n$, we expect the maximum load to be $\sqrt{2n}$.
Proof: let $C$ be the maximum load of any entry of the hash table, and let $X$ be the number of collisions, just as in Lemma 1. Then, we have $\binom{C}{2}$ pairs of keys that collide. Hence, we know $C^2/2 \sim \binom{C}{2} \leq X$. Thus, we have $$ \Pr[C \geq \sqrt{2\ell^2/n}] = \Pr[C^2/2 \geq \ell^2/n] \leq \Pr[X \geq \ell^2/n] \leq \dfrac{\mathbb{E}[X]}{\ell^2/n} \leq \dfrac{\ell^2}{2n} \cdot \dfrac{n}{\ell^2} = \dfrac{1}{2}. $$
Perfect Hashing
Suppose that we are given (in advance) a set of keys $S$ of size $n$. This is the static setting. Our goal now is to (efficiently, and with high probability) construct a hash table with no collisions and $O(n)$ memory. Can we do this with a $2$-universal hash family?
The answer is yes, but we will need to use two layers of hash tables.
For the first layer, we will use a hash table with $n$ entries. Let $\mathcal{H}$ be a $2$-universal hash family of maps from $U$ to $[0, n-1]$. Then, if we pick a random hash function $h \in \mathcal{H}$, by Corollary 1, with probability $\geq 1/2$ the maximum load of any entry is $\sqrt{2n}$. Now, since we have the keys, we can check if the maximum load is $\leq \sqrt{2n}$. If this is not the case, then we can just pick another random hash function $h \in \mathcal{H}$ and try again. In constantly many tries, with high probability we will find a hash function $h$ such that the maximum load is $\leq \sqrt{2n}$.
Now that we have a hash function $h$ such that the maximum load is $\leq \sqrt{2n}$, we can construct the second layer of hash tables. For each entry $i \in [0, n-1]$, we will construct a hash table with $n_i$ entries, where $n_i$ is the number of keys that hash to $i$.
Since $n_i$ is the load of entry $i$, we have that $n_i(n_i-1)/2$ is the number of collisions in entry $i$. By Corollary 1, we also have that the total number of collisions will be $\leq n$, and thus we have $$\sum_{i=0}^{n-1} n_i^2 = O(n)$$
Let $\mathcal{H}_i$ be a $2$-universal hash family of maps from $U$ to $[0, n_i^2-1]$. Then, for each entry $i \in [0, n-1]$, if we pick a random hash function $h_i \in \mathcal{H}_i$, by Corollary 1, with probability $\geq 1/2$ the maximum load of any entry is $\leq \sqrt{2}$ (i.e., there are no collisions). Now, since we have the keys, we can check if the maximum load is $\leq \sqrt{2}$. If this is not the case, then we can just pick another random hash function $h_i \in \mathcal{H}_i$ and try again. In constantly many tries, with high probability we will find a hash function $h_i$ which has no collisions.
Thus, we have constructed a hash scheme with no collisions and total memory given by $$n + \sum_{i=1}^n n_i^2 = O(n) $$ as we wanted.