# Sublinear Time Algorithms

In this lecture, we will explore sublinear time algorithms. These are algorithms that do not read the entire input, but rather read only a small portion of the input.

## Motivation

Often times, we are interested in solving problems where the input is so large that it is very expensive to read the entire input. In certain situations, the really large input can also change over time, and we would like to update our knowledge about the input without reading the entire input again.

One model where such problems arise is the *sublinear time* model.
In this model, we are given access to the input by being able to *query small pieces of it*.
For instance:

- Social networks: each user is a node, and two users are connected by an edge if they are friends.
- Is the graph connected?
- What is the diameter of the graph? (the longest shortest path between any two nodes)

In the above case, we can query the graph by asking for the neighbors of a node, or by asking whether an edge is present in the graph.

- Program checking: given a program, we want to check if it is correct on all/most inputs of a particular size.
- Is the program correct on all inputs? (too many inputs to check the program on)

In the above case, we can query the program by asking for the output of the program on a particular input.

### What can we hope to achieve?

In the sublinear time model, we * are not* able to answer questions of the type “for all” or “there exists” or “exactly” type statements, such as:

- Are
**all**individuals in the social network connected? - Are
**all**individuals in the social network connected by at most 6 degrees of separation? - Does the program
**always**output the correct answer?

Instead, we can hope to answer questions of the type “for most” or “average” or “approximately” type statements, and with high probability, such as:

- Are
**most**individuals in the social network connected? - Are
**most**individuals in the social network connected by at most 6 degrees of separation? - Approximately how many individuals are left-handed?
- is my program correct on
**most**inputs?

And we are once again in the setting of randomized and approximate algorithms.

## Sublinear Models of Computation

There are several models of computation that allow us to access the input in a sublinear fashion. We will discuss two such models in this lecture.

### Random Access Queries

In this model, we can access any word of the input in unit time. There are many ways to represent the input in this model, such as:

- Adjacency matrix representation of a graph
- Adjacency list representation of a graph
- Location of the bits in a string
- many others

### Samples

In this model, we get samples from a certain distribution at each query, which takes unit time.

## Example 1: Approximating the diameter of a point set

**Input**: A set of $m$ points, labeled $1,2,\ldots,m$; and a distance matrix $D$ such that $D_{ij}$ is the distance between points $i$ and $j$. Note that matrix $D$ must be symmetric, with non-negative entries, have zeros on the diagonal, and satisfy the triangle inequality.**Output**: a pair of points $a,b$ that are at least half the maximum distance (diameter) apart. That is, $$ D_{ab} \geq \dfrac{1}{2} \cdot \max_{i,j} D_{ij}.$$

In this case note that the input size is $N := m^2$. Additionally, we are requiring a (multiplicative) 2-approximation to the diameter of the point set.

We will work in the random access query model. Our algorithm will be sublinear-time if it makes $o(N)$ queries to the input matrix.

### Algorithm

- Pick a random point $a$ uniformly at random.
- Pick point $b$ such that $D_{ab} = \max_j D_{aj}$.
- Output the pair $(a,b)$.

**Correctness Analysis:** to see that the algorithm is correct, we use the triangle inequality, and the property of the maximum distance.
Let $(k, \ell)$ be the pair of points which are furthest apart.
Then, we have
$$ D_{k\ell} \leq D_{ka} + D_{a\ell} \leq D_{ab} + D_{ab} = 2D_{ab}.$$

**Running Time Analysis:** The algorithm makes $m$ queries to the input matrix, and hence runs in time $O(m) = O(N^{1/2})$.

Upon seeing the above algorithm, you might wonder if we can do better - that is, can we get a better approximation ration than 2 in sublinear time? However, it turns out that this is the best we can do in the random access query model.

To see this, consider the following example: let $D$ be the following distance matrix: $$ D_{ij} = \begin{cases} 0 & \text{if } i = j, \ 1 & \text{if } i \neq j. \end{cases} $$

Let $D’$ be the same matrix as $D$, except that for one pair of points $a,b$, we set $D’_{ab} = D’_{ba} = 2 - \delta$, where $\delta > 0$ is a small constant.

Then, we can check that both matrices are valid distance matrices, and that the diameter of $D$ is 1, while the diameter of $D’$ is $2 - \delta$. Thus, any algorithm which obtains a better than 2-approximation to the diameter of the point set must distinguish between $D$ and $D’$. However, this requires $\Omega(N)$ queries to the input matrix, and hence the algorithm above is optimal.

**Practice Problem:** Prove the above claim.

## Main Result: Computing the Number of Connected Components

We are once again in the random access query model.

**Input**: A graph $G = (V,E)$, in the adjacency list representation, where $|V| = n$ and $|E| = m$. Approximation parameter $\varepsilon > 0$.**Output**: if $C$ is the number of connected components in $G$, then output a number $C’$ such that $$ |C - C’| \leq \varepsilon n.$$

Note that in this case, the input size is $N = O(n+m)$. We want an algorithm that makes $o(N)$ queries to the input graph and succeeds with high probability (say $\geq 3/4$).

To achieve this, we will use the following characterization of the number of connected components in a graph:

**Lemma (Number of Connected Components):** Let $G = (V,E)$ be a graph.
For each vertex $v \in V$, let $n_v$ be the number of vertices in the connected component of $v$.
Then, the number of connected components in $G$, denoted by $C$ is given by:
$$ C = \sum_{v \in V} \dfrac{1}{n_v}.$$

**Proof:** Note that for each connected component $\Gamma$ of $G$, we have
$$ \sum_{v \in \Gamma} \dfrac{1}{n_v} = \sum_{v \in \Gamma} \dfrac{1}{|\Gamma|} = 1.$$
Summing over all connected components, we get the desired result.

With this characterization in hand, a naive approach would be to sample a small number of vertices $v_1, v_2, \ldots, v_s$ uniformly at random, compute the connected component sizes $n_{v_1}, n_{v_2}, \ldots, n_{v_s}$, and output the normalized estimate: $$ C’ = \dfrac{n}{s} \cdot \sum_{i=1}^s \dfrac{1}{n_{v_i}}.$$

However, just computing the connected component sizes $n_{v_i}$ may take linear time in the worst case - for instance, if the graph is connected.

To overcome this, we will use the following idea: if $n_v$ is too large, then we can simply “drop it”! (of course, in a smart way) This intuition makes sense, since if $n_v$ is large, then $1/n_v$ is small, and hence the contribution of $v$ to the sum is small.

Here is how we could do it: we can set a threshold $T$ and check if $n_v \leq T$. In case $n_v > T$, we simply add $1/T$ to our estimate. This will essentially be our algorithm.

But why would this work? The following lemma will help us understand why.

**Lemma (Estimating the Number of Connected Components):** Let $G = (V,E)$ be a graph, and to each vertex $v \in V$, let
$$ n_v’ := \min{n_v, 2/\varepsilon}.$$
Then,
$$ \left| \sum_{v \in V} \dfrac{1}{n_v} - \sum_{v \in V} \dfrac{1}{n_v’} \right| \leq \varepsilon n/2.$$

**Proof:** By the triangle inequality, we have
$$ \left| \sum_{v \in V} \dfrac{1}{n_v} - \sum_{v \in V} \dfrac{1}{n_v’} \right| \leq \sum_{v \in V} \left| \dfrac{1}{n_v} - \dfrac{1}{n_v’} \right| \leq \sum_{v \in V} \varepsilon/2 = \varepsilon n/2.$$
Where in the second inequality, we used the fact that $\frac{1}{n_v’} \geq \frac{1}{n_v}$, and whenever their values differ, the difference is $\varepsilon/2 - 1/n_v \leq \varepsilon/2$.

How do we do this estimation? We don’t know how to estimate sums, but we do know how to estimate averages! By the above lemma, we have that the averages don’t differ much, as $$ \left| \dfrac{1}{n} \sum_{v \in V} \dfrac{1}{n_v} - \dfrac{1}{n} \sum_{v \in V} \dfrac{1}{n_v’} \right| \leq \varepsilon/2.$$

Thus, if we can estimate the average $\dfrac{1}{n} \sum_{v \in V} \dfrac{1}{n_v’}$ to within $\varepsilon/2$, then by the triangle inequality we can estimate the number of connected components to within $\varepsilon n$.

Now, to estimate the average, we can simply sample a small number of vertices $v_1, v_2, \ldots, v_s$ uniformly at random, and output the sample average. We know that this is a good estimate, since the sample average is a good estimate of the true average (from our lecture on concentration inequalities).

Thus, we are lead to the following algorithm:

### Algorithm

For $i = 1,2,\ldots,s = \Theta(1/\varepsilon^2)$:

- Pick a vertex $v_i$ uniformly at random.
- Compute $n_{v_i}’$ by running a BFS from $v_i$, and stopping the BFS if we reach $2/\varepsilon$ vertices.
- Output the estimate $C’ = \dfrac{n}{s} \sum_{i=1}^s \dfrac{1}{n_{v_i}’}$.

Note that we output the scaled average (scaled by $n$) to get the estimate of the number of connected components.

We can analyze the running time of the algorithm as follows:

- we sampled $\Theta(1/\varepsilon^2)$ vertices
- for each vertex, we ran a truncated BFS, which takes time $O(1/\varepsilon^2)$ to compute.
- computing the estimate takes time $O(s) = O(1/\varepsilon^2)$, as we need to sum the results.

Hence, the algorithm runs in time $O(1/\varepsilon^4)$, which is dominated by the total runtime of the s runs of the truncated BFS.

To prove correctness, we need to show that the estimate is close to the true value with high probability. By the above lemmas, this will be true if the sample average is close to the true average with high probability. That is, we need to show that with probability at least $3/4$, we have $$ \left| \dfrac{1}{s} \sum_{i=1}^s \dfrac{1}{n_{v_i}’} - \dfrac{1}{n} \sum_{v \in V} \dfrac{1}{n_v’} \right| \leq \varepsilon/2.$$

To show this, we will use Hoeffding’s inequality:

**Hoeffding’s Inequality:** Let $X_1, X_2, \ldots, X_s$ be independent random variables such that $a_i \leq X_i \leq b_i$ for all $i$.
Let $X = \sum X_i$.
Then, for any $\varepsilon > 0$, we have
$$ \Pr\left[ |X - \mathbb{E}[X]| \geq \ell \right] \leq 2 \exp\left( -\dfrac{2\ell^2}{\sum_{i=1}^s (b_i - a_i)^2} \right).$$

Let $X_i$ be the random variable corresponding to the (scaled) outcome of the $i$-th run of BFS of the algorithm. That is, $X_i$ is the random variable corresponding to sampling a vertex $v_i \in V$ uniformly at random, and outputting the scaled estimate $\dfrac{1}{s} \cdot \dfrac{1}{n_{v_i}’}$. Thus, for each $v \in V$ we have $X_i = \dfrac{1}{s} \cdot \dfrac{1}{n_{v}’}$ with probability $1/n$. In particular, for each $i \in [s]$ we have $0 \leq X_i \leq 1/s$.

Let $X = \sum_{i=1}^s X_i$. Then, we have $$ \mathbb{E}[X] = \sum_{i=1}^s \mathbb{E}[X_i] = \sum_{i=1}^s \sum_{v \in V} \dfrac{1}{n} \cdot \dfrac{1}{s \cdot n_v’} = \sum_{i=1}^s \dfrac{1}{s} \cdot \dfrac{1}{n} \cdot \sum_{v \in V} \dfrac{1}{n_v’} = \dfrac{1}{n} \sum_{v \in V} \dfrac{1}{n_v’}.$$

Hence, by Hoeffding’s inequality, we have $$ \Pr\left[ |X - \mathbb{E}[X]| \geq \varepsilon/2 \right] \leq 2 \exp\left( -\dfrac{2(\varepsilon/2)^2}{s \cdot (1/s)^2} \right) = 2 \exp\left( -\dfrac{\varepsilon^2 \cdot s}{2} \right).$$

Note that the RHS is at most $1/4$ when $s = 8/\varepsilon^2$.