# Data Streaming

In today’s world, often times we need to process data streams that are too large to fit into memory. This happens for instance when we are processing data from sensor networks, satellite data feeds, database logs and transaction records, internet search logs, network traffic, and so on.

Two important aspects of the above setting are:

- The data does not come all at once, but rather arrives in a stream.
- The data is too large to fit into memory.

In such cases, we need to process the data in a single pass (or constantly many passes), using sublinear space, and with limited time to process each item.

In the data stream model, we have a stream of elements $a_1, a_2, \ldots, a_N \in \Sigma$ where $\Sigma$ is the alphabet.

- Each element takes $b$ bits to represent, and one usually assumes that $N$ is known in advance.
- basic operations (comparison, arithmetic, bitwise operations) can be done in $O(1)$ time.
- any algorithm is allowed a single (or small number of) passes over the data.
- bounded storage: the algorithm is allowed to use $O(\log^c N)$ bits of memory for some constant $c$, or $O(N^\alpha)$ bits for some $0 < \alpha < 1$.
- algorithms are allowed to use randomness (almost always necessary), so we are once again in the probabilistic model.
- Usually want
*approximate answers*to the true answer with high probability.

The goal of a streaming algorithm is to minimize the space used and the processing time, while still providing a good approximation to the answer.

Here are some examples of problems that can be studied in the data streaming model:

**Sum of Elements**:**Input**: A stream of $N$ integers $a_1, a_2, \ldots, a_N \in [-2^b+1, 2^b-1]$.**Output**: maintain the current sum of the elements we have seen so far.

**Median of Elements**:**Input**: A stream of $N$ integers $a_1, a_2, \ldots, a_N \in [-2^b+1, 2^b-1]$.**Output**: maintain the current median of the elements we have seen so far.

**Distinct Elements**:**Input**: A stream of $N$ integers $a_1, a_2, \ldots, a_N \in [-2^b+1, 2^b-1]$.**Output**: maintain the number of distinct elements we have seen so far.

**Heavy Hitters**:**Input**: A stream of $N$ integers $a_1, a_2, \ldots, a_N \in [-2^b+1, 2^b-1]$, and a parameter $\varepsilon \in (0,1)$.**Output**: maintain a set of elements that contains all elements that appear more than $\varepsilon N$ times (i.e., the heavy hitters).**Note**: allowed to output false positives (low hitters), but not allowed to miss any heavy hitter!

In this lecture, we will study problems 3 and 4 above.

## Heavy Hitters

Let us begin by studying a simple version of the heavy hitters problem, where $\varepsilon = 1/2$. In this case, we want to maintain a set of elements that appear more than $N/2$ times in the stream. Since this set can only contain at most one element (the majority element), it will be enough to maintain a set of size at most one.

### Majority Element

To solve the majority element problem (i.e., the heavy hitters problem with $\varepsilon = 1/2$), at time $t$, we will maintain a set $S_t$ of the element which we believe to be the majority element, and in addition we will maintain a counter $c_t$ which accounts for the difference between the number of times the majority element has appeared against the number of times **any other element** has appeared.
This leads to the following algorithm:

**Majority Element Algorithm**:

**Initialization**: Set $S_0 = \emptyset$ and $c_0 = 0$.**Processing**: when the next element $a_t$ arrives, do the following:- if $c = 0$
- set $S_t = {a_t}$ and $c_t = 1$.

- else:
- if $a_t \in S_t$, then increment $c_t$.
- else, decrement $c_t$ and discard $a_t$.

- if $c = 0$
**Output**: return the element in $S_N$.

Let us now analyze the correctness of the above algorithm.

If there is no majority element, then the algorithm could output any element, and we will be correct, as we can output false positives (and we are not missing any majority element, since it doesn’t exist).

What happens if there is a majority element $m$? Note that every time we discard a copy of the majority element, we also discard a copy of a different element. To see this, we have two cases to consider:

- If $a_t = m$ and $m \not\in S_t$, then we discard $m$ and decrement $c_t$, the latter being equivalent to discarding a copy of the element in $S_t$.
- If $a_t \neq m$ and $m \in S_t$, then we discard $a_t$ and decrement $c_t$, the latter being equivalent to discarding a copy of $m$.

Thus, the majority element will always be in $S_t$ at the end of the stream, and the counter $c_t$ will be positive.

The space used by the algorithm is $O(b + \log N)$ bits, as we need to store the majority element and the counter.

### General Heavy Hitters

We are now ready to study the general heavy hitters problem, where $\varepsilon \in (0,1)$. We can assume that $\varepsilon < 1/2$, otherwise we can simply use the majority element algorithm.

**General Heavy Hitters Algorithm**:

**Initialization**: Set $k := \lceil 1/\varepsilon \rceil - 1$ and two arrays $T, C$ of size $k$.

- $T[i]$ will be used to store an element from the alphabet $\Sigma \ (= [-2^b+1, 2^b-1])$.
- $C[i]$ will be used to store the count of the element stored in $T[i]$.
- Initialize $T[i] \leftarrow \bot$ and $C[i] \leftarrow 0$ for all $i \in [k]$.

**Processing**: when the next element $a_t$ arrives, do the following:- if there is $i \in [k]$ such that $a_t = T[i]$, then increment $C[i]$.
- else, if there is $i \in [k]$ such that $C[i] = 0$, then set $T[i] \leftarrow a_t$ and $C[i] \leftarrow 1$.
- else, decrement the count of all elements in $T$ and discard $a_t$.

**Output**: return array $T$.

Let us now analyze the correctness of the above algorithm.

For each element $e \in \Sigma$, let $est(e)$ be the estimate of the count of $e$ at any point in time. That is, $$est(e) = \begin{cases} C[j] & \text{if } T[j] = e \text{ for some } j \in [k], \\ 0 & \text{otherwise}. \end{cases}$$

We have the following lemma:

**Lemma 1**: Let $count(e)$ be the number of times $e$ appears in the stream.
Then,
$$0 \leq count(e) - est(e) \leq \dfrac{N}{k+1} \leq \varepsilon N.$$

**Proof**: note that $count(e) \geq est(e)$, because we never increase $est(e)$ without seeing $e$.

If we don’t increase $est(e)$ by $1$ when we see $e$, then we must decrement the count of all elements in $T$. In this case, we have essentially discarded $e$ and one copy of each element in $T$, thus $k+1$ elements in total. Thus, we won’t increase $est(e)$ by $1$ for at most $\dfrac{N}{k+1}$ times, which proves the lemma.

Equipped with the above lemma, we can now prove the correctness of the algorithm.

**Correctness of algorithm**: At any time $N$, it is enough to show that the set $T$ contains all the heavy hitters.
If $e$ is a heavy hitter, then $count(e) > \varepsilon N$.
Thus, by lemma 1, we have $est(e) \geq count(e) - \varepsilon N > 0$.
By definition of $est(e)$, this implies that $e$ is in $T$, which concludes the proof.

**Space complexity**: The space used by the algorithm is $O(k \cdot b + k \cdot \log N) = O\left(\dfrac{1}{\varepsilon} \cdot (b + \log N)\right)$ bits.