Homework 2

This homework covers the material from lectures 5 to 9

Due date: June 11th.

PDF version of Homework 2.

LaTeX template, in case you want to write it in LaTeX.

Required Exercises

Remember that you are only required to turn in 5 out of the 6 exercises below.

Problem 1 - Bashing (20 points)

Perfect hashing is nice, but does have the drawback that the perfect hash function has a lengthy description (since you have to describe the second-level hash function for each bucket). Consider the following alternative approach to producing a perfect hash function with a small description. Define bi-bucket hashing, or bashing, as follows. Given $n$ items, allocate two arrays of size $n^{3/2}$. When inserting an item, map it to one bucket in each array, and place it in the emptier of the two buckets.

  1. Suppose a random function (i.e., all function values are uniformly random and mutually independent) is used to map each item to buckets. Give a good upper bound on the expected number of collisions (i.e., the number of pairs of items that are placed in the same bucket).

Hint: what is the probablility that the $k^{th}$ inserted item collides with some previously inserted item?

  1. Argue that bashing can be implemented efficiently, with the same expected outcome, using the ideas from 2-universal hashing.

  2. Conclude an algorithm with linear expected time (ignoring array initialization) for identifying a perfect bash function for a set of $n$ items. How large is the description of the resulting function?

Problem 2 - Distinct Elements (20 points)

Consider again the distinct elements problem that we saw in class. We are given a stream of elements $a_1, \ldots, a_n$ from our universe $U = \{ 0, 1, \ldots, 2^b -1 \}$, possibly with repetitions, and we would like to know how many distinct elements are there in the sequence. Since we are in the streaming setting, we will make only one pass through the sequence above, and we have little memory.

We will now analyze a different algorithm for the distinct elements problem:

  • Let $N > n$ be an integer
  • Pick at random a function $h : U \rightarrow [N]$ from a strongly 2-universal family.
  • Let $m := \min \{ h(a_1), \ldots, h(a_n) \}$
  • Output $N/m$
  1. Suppose that $a_1, \ldots, a_n$ contains $k$ distinct elements. Show that $$ \Pr[ \text{ algorithm outputs a number } > 4k ] \leq \dfrac{1}{4} $$
  2. Suppose that $a_1, \ldots, a_n$ contains $k$ distinct elements. Show that $$ \Pr[ \text{ algorithm outputs a number } < k/4 ] \leq \dfrac{1}{4} $$
  3. Assuming that $U = [\mathsf{poly}(n)]$, what is the memory requirement of the algorithm above?

Problem 3 - Karger’s Algorithm (20 points)

One advantage of Karger’s random contraction algorithm for the minimum cut problem is that it can be used to output all minimum cuts. In this question, we assume Karger’s algorithm as a black box, which when given a graph $G$ as an input, for any minimum cut $(S, \overline{S})$, can be used to output the minimum cut $(S, \overline{S})$ with probability at least $2/n(n - 1)$ in time $O(n^2),$ where $n$ is the number of vertices in the input graph. Explain how Karger’s algorithm can be used to output all minimum cuts and analyze its running time to output all minimum cuts with success probability at least $0.9999.$

Problem 4 - Karger strikes back (20 points)

Another problem about Karger’s randomized algorithm for minimum cut:

  1. Suppose Karger’s algorithm is applied to a tree. Show that it finds a minimum cut in the tree with probability 1.
  2. Consider the following modification of Karger’s algorithm: instead of choosing an edge uniformly at random and merging the endpoints, the algorithm chooses any two distinct vertices uniformly at random and merges them. Show that for any $n$ there is a graph $G_n$ with $n$ vertices such that when the modified algorithm is run on $G_n$, the probability that it finds a minimum cut is exponentially small in $n$.
  3. How many times would you have to repeat the modified algorithm of the previous part to have a reasonable chance of finding a minimum cut? What does this tell us about the practicality of the modified algorithm?
  4. Show that for any $n \geq 3$ there is a graph $G_n$ with $n$ vertices that has $n(n-1)/2$ distinct minimum cuts.

Problem 5 - Sublinear-time Algorithms (20 points)

Sublinear-time algorithms for connectedness in graphs with bounded degree.

Given a graph $G$ of maximum degree $d$ (as adjacency list), and a parameter $\epsilon > 0$, give an algorithm which has the following behavior: if $G$ is connected, then the algorithm should output PASS with probability $1$, and if $G$ is $\epsilon$-far from connected (at least $\epsilon \cdot n \cdot d$ edges must be added to connect $G$), then the algorithm should output FAIL with probability at least $3/4$. Your algorithm should look at a number of edges that is independent of $n$, and polynomial in $d, \epsilon$.

For this problem, when proving the correctness of your algorithm, it is ok to show that if the input graph $G$ is likely to be passed, then it is $\epsilon$-close to a graph $G_0$ which is connected, without requiring that $G_0$ has degree at most $d$.

Problem 6 - Application of Dimension Reduction (20 points)

In this problem we analyze a common algorithmic application of the Johnson-Lindenstrauss lemma.

Consider the $k$-means clustering problem. Given points $x_1, \ldots, x_n \in \mathbb{R}^d$, the goal is to partition the points into $k$ disjoint sets (clusters) $C_1, \ldots, C_k$ to minimize the following cost:

$$ Cost(x_1, \ldots, x_n, C_1, \ldots, C_k) = \sum_{i=1}^k \sum_{j \in C_i} || x_j - \mu_i ||^2, $$

where the norm is the Euclidean norm and $\mu_i = \dfrac{1}{|C_i|} \cdot \sum_{j \in C_i} x_j$ is the mean of the points in cluster $C_i$.

  1. Prove that: $$ Cost(x_1, \ldots, x_n, C_1, \ldots, C_k) = \sum_{i=1}^k \dfrac{1}{|C_i|} \cdot \sum_{\substack{j, \ell \in C_i \ j < \ell}} || x_j - x_\ell ||^2 $$
  2. Suppose we embed $x_1, \ldots, x_n$ into $O(\log n/\epsilon^2)$ dimensional vectors $y_1, \ldots, y_n$ using the Johnson-Lindenstrauss construction. Show that for all clusterings simultaneously: $$ (1-\epsilon) \cdot Cost(x_1, \ldots, x_n, C_1, \ldots, C_k) \leq Cost(y_1, \ldots, y_n, C_1, \ldots, C_k) \leq (1+ \epsilon) \cdot Cost(x_1, \ldots, x_n, C_1, \ldots, C_k) $$ with high probability.
  3. Suppose we find a set of clusters $\Gamma_1,\ldots, \Gamma_k$ such that: $$ Cost(y_1, \ldots, y_n, \Gamma_1, \ldots, \Gamma_k) \leq \gamma \cdot Cost(y_1, \ldots, y_n, C_1^, \ldots, C_k^) $$ where $C_1^, \ldots, C_k^$ is the optimal clustering for the points $y_1,\ldots, y_n$. That is, we found a $\gamma$-approximation to the optimal clustering for the low-dimensional points. Show that for $\epsilon \leq 1/2$, we have $$ Cost(x_1, \ldots, x_n, \Gamma_1, \ldots, \Gamma_k) \leq (1+ O(\epsilon)) \cdot \gamma \cdot Cost(x_1, \ldots, x_n, C_1^{opt}, \ldots, C_k^{opt}), $$ where $C_1^{opt}, \ldots, C_k^{opt}$ is the optimal cluster for $x_1,\ldots, x_n$.

In other words, we can compute an approximate clustering for our original points using the low dimensional data. This speeds up algorithms for $k$-means whenever $\log n/\epsilon^2 < d$.

Practice Problems

You are not required to submit the solution to these problems, but highly encouraged to attempt them for your own knowledge sake. :)

Problem 1 - From Lecture 5

  1. Check that the hash family we constructed in slide 51 is strongly 2-universal. Is it strongly 3-universal?
  2. Prove that the hash family from slide 54 is 2-universal. Prove also that it is not strongly 2-universal.
  3. Prove proposition from slide 55
  4. Construct a family of 3-wise independent hash functions, which is not 4-wise independent. Prove that this family is 3-wise independent and not 4-wise independent.

(Hint: see slide 59.)

  1. Prove that the number of collisions is an upper bound on the square of the maximum load of a hash table.
  2. Use Markov’s inequality and the part 6 to prove the second lemma from slide 72.

Problem 2 - More hash functions

Consider the following examples of hash families. For each one, prove that it is 2-universal or give a counterexample.

  1. Let $p$ be a prime number and $n \leq p$ be an integer. Let $$\mathcal{H} := \{ h_a(x) = (ax \bmod p) \bmod n \ \mid \ a \in \{ 1, \ldots, p-1 \} \} $$
  2. Let $p$ be a prime number and $n \leq p$ be an integer. Let $$\mathcal{H} := \{ h_b(x) = (x + b \bmod p) \bmod n \ \mid \ b \in \{ 0,1, \ldots, p-1 \} \} $$
  3. Let $m$ be an integer multiple of $n$. Let $$\mathcal{H} := \{ h_{a, b}(x) = (ax + b \bmod m) \bmod n \ \mid \ a \in \{ 1, \ldots, m-1 \}, b \in \{ 0, 1, \ldots, m-1 \} \} $$

Problem 3 - Probability Amplification

In the distinct element algorithm we saw in lecture, we showed that with probability at least $2/3$ we have $(1-\varepsilon) D \leq Y \leq (1+\varepsilon) D$, where $Y$ is the answer that our algorithm returns. Suppose we run $k$ independent copies of the algorithm in parallel and obtain the estimates $Y_1, \ldots, Y_k$ and return the median $Y’$ of $Y_1, \ldots, Y_k$ as our answer. Prove that $(1-\varepsilon) D \leq Y’ \leq (1+\varepsilon) D$ with probability $1-\delta$ when $k = O(\log 1/\delta)$

This technique of amplifying the success probability is very general and useful when one would like to have an algorithm with arbitrarily high success rate, given only an algorithm with success probability being a constant strictly greater than 1/2.

Problem 4 - From Lecture 6

  1. Prove that $$Pr[ Y < (1-\epsilon) \cdot D ] \leq \dfrac{4}{\epsilon^2 t}.$$
  2. Work through the weighted version of heavy hitters.

Problem 5 - Testing Sortedness in Sublinear Time (courtesy of Nathan Harms)

In this question we mean that an array is sorted if it is sorted in increasing order.

Let $A$ be an array of $n$ integers, and let $\varepsilon \in (0,1)$. We say that $A$ is $\varepsilon$-far from sorted if every sorted array $B$ has at least $\varepsilon n$ entries that are different from $A$. In other words, at least $\varepsilon n$ entries of $A$ must be modified to turn it into a sorted array.

An $\varepsilon$-tester for sortedness with query complexity $q$ is a randomized algorithm that queries a sequence $i_1, \ldots, i_q$ of indices from $A$ and outputs YES or NO as follows:

  1. if $A$ is sorted, the algorithm output YES with probability 1
  2. if $A$ is $\varepsilon$-far from sorted, the algorithm outputs NO with probability at least $2/3$.

Question 1: the simplest idea for a tester is the random adjacent pair tester, which chooses a random index $i \in \{ 1, \ldots, n-1 \}$, queries $A[i]$ and $A[i+1]$ and rejects if $A[i] > A[i+1]$. Repeat this procedure $q$ times and reject if at any query we output reject.

Give an example array $A$ that is $1/2$-far from sorted, where this algorithm with $q=1$ repetitions outputs no with probability at most $1/(n-1)$. Conclude that the tester must repeat at least $q = \Omega(n)$ times, so the query complexity in this case is $\Omega(n)$

Question 2: Another idea is to choose $q$ independently, uniformly random indices $i_1, \ldots, i_q$ and output NO if the subsequence $A[i_1], A[i_2], \ldots, A[i_q]$ is not sorted.

Give an example of an array that is $1/2$-far from sorted where this algorithm requires $q = \Omega(\sqrt{n})$ queries.

Hint: remember the birthday paradox

Question 3: Assume each entry of $A$ is distinct. Consider the set $G$ of good indices of $A$, where index $i$ is good if using binary search on $A$ to find $A[i]$ will successfully find the index $i$.

  1. For any set $G = \{ i_1, \ldots, i_g \}$ of good indices, show that the subsequence $A[i_1], \ldots, A[i_g]$ is sorted.
  2. Show that, if $A$ is $\varepsilon$-far from sorted, then $|G| \leq (1-\varepsilon)n$.
  3. Show that there is an $\varepsilon$-tester for sortedness that uses $O(\frac{1}{\varepsilon} \log n)$ queries.

Hint: consider choosing a random index $i ~ \{1, \ldots, n\}$, querying $A[i]$, then trying to use binary search to find element $A[i]$.

Previous
Next