author: | Jessica H. Fong and Martin Strauss |
---|---|
title: | An Approximate Lp Difference Algorithm for Massive Data Streams |
keywords: | streaming algorithms, data streams, Lp norms |
abstract: | Several recent papers have shown how to approximate the difference ∑i|ai-bi|
or ∑|ai-bi|2 between
two functions, when the function values ai and
bi are given in a data stream, and their order
is chosen by an adversary. These algorithms use little space (much
less than would be needed to store the entire stream) and little
time to process each item in the stream. They approximate with
small relative error. Using different techniques, we show how to
approximate the Lp-difference
∑i|ai-bi|p
for any rational-valued p∈(0,2], with comparable
efficiency and error. We also show how to approximate
∑i|ai-bi|p
for larger values of p but with a worse error
guarantee. Our results fill in gaps left by recent work, by
providing an algorithm that is precisely tunable for the application
at hand. These results can be used to assess the difference between
two chronologically or physically separated massive data sets,
making one quick pass over each data set, without buffering the data
or requiring the data source to pause. For example, one can use our
techniques to judge whether the traffic on two remote network
routers are similar without requiring either router to transmit a
copy of its traffic. A web search engine could use such algorithms
to construct a library of small ``sketches,'' one for each distinct
page on the web; one can approximate the extent to which new web
pages duplicate old ones by comparing the sketches of the web pages.
Such techniques will become increasingly important as the enormous
scale, distributional nature, and one-pass processing requirements
of data sets become more commonplace.
If your browser does not display the abstract correctly (because of the different mathematical symbols) you can look it up in the PostScript or PDF files. |
reference: | Jessica H. Fong and Martin Strauss (2001), An Approximate Lp Difference Algorithm for Massive Data Streams , Discrete Mathematics and Theoretical Computer Science 4, pp. 301-322 |
bibtex: | For a corresponding BibTeX entry, please consider our BibTeX-file. |
ps.gz-source: | dm040217.ps.gz (0 K) |
ps-source: | dm040217.ps (240 K) |
pdf-source: | dm040217.pdf (165 K) |
The first source gives you the `gzipped' PostScript, the second the plain PostScript and the third the format for the Adobe accrobat reader. Depending on the installation of your web browser, at least one of these should (after some amount of time) pop up a window for you that shows the full article. If this is not the case, you should contact your system administrator to install your browser correctly.
Due to limitations of your local software, the two formats may show up differently on your screen. If eg you use xpdf to visualize pdf, some of the graphics in the file may not come across. On the other hand, pdf has a capacity of giving links to sections, bibliography and external references that will not appear with PostScript.