Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

[.ps] [.pdf]


Queries that return a list of frequently occurring items are popular in the analysis of data streams such as real-time Internet traffic logs. In particular, Internet traffic patterns are believed to obey the power law, implying that most of the bandwidth is consumed by a small set of heavy users. While several results exist for computing frequent item queries using limited memory in the infinite stream model, in this paper we consider the limited-memory sliding window model. This model maintains the last N items that have arrived at any given time and forbids the storage of the entire window in memory. We present several deterministic algorithms for identifying frequent items in sliding windows using limited memory and making only one pass over the data, both under arbitrary distributions and assuming that packet types in each instance of the window conform to a multinomial distribution. The former is a straightforward extension of existing techniques and is shown to work well when tested on TCP traffic logs. Our algorithms for the multinomial distribution are shown to outperform classical inference based on random sampling from the sliding window, but lose their accuracy as predictors of item frequencies when the underlying distribution is not multinomial.

Bibtex Entry