Data Streaming
In today’s world, often times we need to process data streams that are too large to fit into memory. This happens for instance when we are processing data from sensor networks, satellite data feeds, database logs and transaction records, internet search logs, network traffic, and so on.
Two important aspects of the above setting are:
- The data does not come all at once, but rather arrives in a stream.
- The data is too large to fit into memory.
In such cases, we need to process the data in a single pass (or constantly many passes), using sublinear space, and with limited time to process each item.
In the data stream model, we have a stream of elements
- Each element takes
bits to represent, and one usually assumes that is known in advance. - basic operations (comparison, arithmetic, bitwise operations) can be done in
time. - any algorithm is allowed a single (or small number of) passes over the data.
- bounded storage: the algorithm is allowed to use
bits of memory for some constant , or bits for some . - algorithms are allowed to use randomness (almost always necessary), so we are once again in the probabilistic model.
- Usually want approximate answers to the true answer with high probability.
The goal of a streaming algorithm is to minimize the space used and the processing time, while still providing a good approximation to the answer.
Here are some examples of problems that can be studied in the data streaming model:
- Sum of Elements:
- Input: A stream of
integers . - Output: maintain the current sum of the elements we have seen so far.
- Input: A stream of
- Median of Elements:
- Input: A stream of
integers . - Output: maintain the current median of the elements we have seen so far.
- Input: A stream of
- Distinct Elements:
- Input: A stream of
integers . - Output: maintain the number of distinct elements we have seen so far.
- Input: A stream of
- Heavy Hitters:
- Input: A stream of
integers , and a parameter . - Output: maintain a set of elements that contains all elements that appear more than
times (i.e., the heavy hitters). - Note: allowed to output false positives (low hitters), but not allowed to miss any heavy hitter!
- Input: A stream of
In this lecture, we will study problems 3 and 4 above.
Heavy Hitters
Let us begin by studying a simple version of the heavy hitters problem, where
Majority Element
To solve the majority element problem (i.e., the heavy hitters problem with
Majority Element Algorithm:
- Initialization: Set
and . - Processing: when the next element
arrives, do the following:- if
- set
and .
- set
- else:
- if
, then increment . - else, decrement
and discard .
- if
- if
- Output: return the element in
.
Let us now analyze the correctness of the above algorithm.
If there is no majority element, then the algorithm could output any element, and we will be correct, as we can output false positives (and we are not missing any majority element, since it doesn’t exist).
What happens if there is a majority element
- If
and , then we discard and decrement , the latter being equivalent to discarding a copy of the element in . - If
and , then we discard and decrement , the latter being equivalent to discarding a copy of .
Thus, the majority element will always be in
The space used by the algorithm is
General Heavy Hitters
We are now ready to study the general heavy hitters problem, where
General Heavy Hitters Algorithm:
- Initialization: Set
and two arrays of size .
will be used to store an element from the alphabet . will be used to store the count of the element stored in .- Initialize
and for all .
- Processing: when the next element
arrives, do the following:- if there is
such that , then increment . - else, if there is
such that , then set and . - else, decrement the count of all elements in
and discard .
- if there is
- Output: return array
.
Let us now analyze the correctness of the above algorithm.
For each element
We have the following lemma:
Lemma 1: Let be the number of times appears in the stream.
Then,
Proof: note that
If we don’t increase
Equipped with the above lemma, we can now prove the correctness of the algorithm.
Correctness of algorithm: At any time
Space complexity: The space used by the algorithm is