DIMACS Workshop on Streaming Data Analysis and Mining

November 5, 2001
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:
Adam Buchsbaum, AT&T Labs - Research, alb@research.att.com
Rajeev Motwani, Stanford University, rajeev@cs.stanford.edu
Jennifer Rexford, AT&T Labs, jrex@research.att.com
Presented under the auspices of the Special Focus on Data Analysis and Mining.

Working Group on Streaming Data Analysis and Mining Home Page.


This material is based upon work supported by the National Science Foundation under Grant No. 0100921


Abstracts:


1.

Brian Babcock (Stanford)
	Title:	Characterizing Memory Requirements for Queries over 
	Continuous Data Streams 
	Slides(ppt.gzip) 
               

	Abstract:  Many queries over continuous data streams can require
	unbounded storage to answer (that is, storage that is proportional
	to the size of the data streams).  When building a system for
	processing queries over data streams, knowing the maximum memory
	requirements for a query in advance can be helpful in allocating
	resources among concurrent queries.  We consider conjunctive
	queries with equality and inequality predicates over multiple
	data streams.  For this class of queries, we specify an algorithm
	for determining whether or not a query can be answered using a
	bounded amount of memory regardless of the size and contents of
	the input data streams.  Our algorithm is constructive:  if a
	query is answerable using bounded memory, our algorithm outputs
	a description of how to build constant-sized synopses of the
	data streams and then use those synopses to answer the query.


2. Kevin Chen, University of California, Berkeley Title: Finding Frequent Items in Data Streams Slides(.ppt.gzip) Abstract: We present algorithms for estimating the most frequent items in a data stream, using limited storage space. Our algorithms achieve better pace bounds than the best known algorithms for this problem for several common distributions. We intruduce a notion of a count sketch which lets us estimate the frequency of the most common items in a stream. The sketches for two streams can be added or subtracted. Thus, our algorithm is easily adapted to estimate the objects with the largest change in frequency for two data streams. Previous approaches were not able to solve this latter problem. This is joint work with Moses Charikar and Martin Farach-Colton.
3. Mayur Datar (Stanford) Title: Maintaining Stream Statistics over Sliding Windows Slides(.ppt.gzip) Abstract: We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last $N$ data elements seen so far. We refer to this model as the {\em sliding window} model. We consider the following basic problem: Given a stream of bits, maintain a count of the number of $1$'s in the last $N$ elements seen from the stream. We show that using $O(\frac{1}{\epsilon} \log^2 N)$ bits of memory, we can estimate the number of $1$'s to within a factor of $1 + \epsilon$. We also give a matching lower bound of $\Omega(\frac{1}{\epsilon}\log^2 N)$ memory bits for any deterministic or randomized algorithms. We extend our scheme to maintain the sum of the last $N$ positive integers. We provide matching upper and lower bounds for this more general problem as well. We apply our techniques to obtain efficient algorithms for the $L_p$ norms (for $p \in [1,2]$) of vectors under the sliding window model. Using the algorithm for the basic counting problem, one can adapt many other techniques to work for the sliding window model, with a multiplicative overhead of $O(\frac{1}{\epsilon}\log N)$ in memory and a $1 +\epsilon$ factor loss in accuracy. These include maintaining approximate histograms, hash tables, and statistics or aggregates such as sum and averages.
4. Phil Gibbons (Bell Labs) Title: Distinct Sampling of Streams: Theory and Practice Abstract: One of the earliest interesting streaming algorithms was due to Flajolet and Martin, who described in the early 80's an O(log n) space synopsis for approximately counting the number of distinct values in a data stream. In many applications, however, the distinct values query of interest is not over the entire stream, but over a subset of the stream specified by an ad-hoc predicate. What is needed, then, is a single synopsis that can handle predicates specified only after the stream has flown by. We show how a new synopsis, called a "distinct sample", can handle this more general problem. We present both analytical guarantees and experimental results demonstrating the effectiveness of our synopsis. Moreover, we show how distinct samples have broad applicability in practice for session-based event recording environments.
5. Greg Humphreys (Stanford) Title: A Streaming Framework for Scalable Visualization on Clusters Slides(.ppt.gzip) Abstract: I'll be talking about my research on cluster graphics. Our group has developed systems for supporting unmodified applications in novel display environments, as well as enabling cluster-parallel applications to use the aggregated rendering power of commodity graphics accelerators housed in the cluster nodes. To make this possible, we abstract the entire graphics API as a stream, and manipulate these streams to form traditional parallel graphics pipelines using commodity parts as building blocks. I'll describe the basic parallel rendering problem, and show how stream manipulation has allowed us to build a very flexible and general system based entirely on commodity technology. Our software is currently in use in over 100 installations worldwide, and its underlying stream processing technology has been used to accomplish tasks other than scalability, such as non-photorealistic rendering.
6. Raul Jimenez (Rutgers) Title: Massive Lossless Data Compression and Multiple Parameter Estimation from Galaxy Spectra Slides(.ps.gzip) Abstract: We present a method for radical linear compression of datasets where the data are dependent on some number $M$ of parameters. We show that, if the noise in the data is independent of the parameters, we can form $M$ linear combinations of the data which contain as much information about all the parameters as the entire dataset, in the sense that the Fisher information matrices are identical; i.e. the method is lossless. We explore how these compressed numbers fare when the noise is dependent on the parameters, and show that the method, although not precisely lossless, increases errors by a very modest factor. The method is general, but we illustrate it with a problem for which it is well-suited: galaxy spectra, whose data typically consist of $\sim 10^3$ fluxes, and whose properties are set by a handful of parameters such as age, brightness and a parametrised star formation history. The spectra are reduced to a small number of data, which are connected to the physical processes entering the problem. This data compression offers the possibility of a large increase in the speed of determining physical parameters. This is an important consideration as datasets of galaxy spectra reach $10^6$ in size, and the complexity of model spectra increases. In addition to this practical advantage, the compressed data may offer a classification scheme for galaxy spectra which is based rather directly on physical processes. Reference: Alan F. Heavens, Raul Jimenez, OferLahav, Massive lossless data compression and multiple parameter estimation from galaxy spectra. Mon. Not. R. Astron. Soc. 317, 965-972 (2000)
7. Sampath Kannan (U. Penn) Title: Open Problems in Data Stream Algorithmics Slides(.ppt.gzip)
8. Stephen North (AT&T Labs) Title: A Large-Scale Network Visualization System Abstract: Visualization is part of a feedback loop in analyzing large data sets. Not just the end stage of analysis, visualization itself can provide new metaphors for expressing data, which can help spot new trends unseen by other methods of analysis. We have built an interactive viewer for large data sets of events and transactions on networks, with the initial goal of operating on a day's worth of telephone call records (about 400 million). We then adapted the viewer to a large frame/ATM packet data network. I will talk about the engineering of this system, its applications, and some ideas about future goals for such systems.
9. Liadan O'Callaghan (Stanford) Title: Clustering Data Streams Slides(.ppt.gzip) Abstract: Clustering a set of data points means grouping them according to some notion of similarity. For example, we might want to cluster web pages by content, putting similar web pages into the same group and ensuring that web pages that are very different are in different groups. We might want to classify phone call records so that we can identify which customers would benefit from certain promotions, or so that we can more easily identify fraud. For many recent clustering applications, the {\em data stream} model is more appropriate than the conventional data set model. By nature, a stored data set is an appropriate model when significant portions of the data are queried again and again, and updates are small and/or relatively infrequent. In contrast, a data stream is an appropriate model when a large volume of data is arriving continuously and it is either unnecessary or impractical to store the data in some form of memory. Data streams are also appropriate as a model of access to large data sets stored in secondary memory where performance requirements necessitate access via linear scans. In the data stream model~\cite{henzinger98digital}, the data points can only be accessed in the order in which they arrive. Random access to the data is not allowed; memory is assumed to be small relative to the number of points, and so only a limited amount of information can be stored. In general, algorithms operating on streams will be restricted to fairly simple calculations because of the time and space constraints. The challenge facing algorithm designers is to perform meaningful computation with these restrictions. I will discuss an approximation algorithm for clustering according to the $k$--Median objective function; this algorithm, due to Charikar and Guha, is based on their local search algorithm for the related Facility Location problem. Next I will discuss improvements made by Meyerson that maintain most of the theoretical soundness of the original algorithm but allow faster clustering, given some additional (reasonable) assumptions. Finally, I will discuss a general method of applying any clustering algorithm to the clustering of data streams, even if the algorithm of choice is not itself usable on data streams. This method itself has approximation guarantees for $k$--Median.
10. Jennifer Rexford (AT&T Labs) Title: Computing Traffic Demands From Flow-Level Measurements Slides(.ppt.gzip) Abstract: Controlling the distribution of traffic in a large Internet Service Provider (ISP) backbone requires an understanding of the network topology, traffic demands, and routing policies. Shifts in user behavior, the emergence of new applications, and the failure of network elements can result in significant (and sudden) fluctuations in the distribution of traffic across the backbone. Network operators can alleviate congestion by adapting the routing configuration to the prevailing traffic. This talk describes how to acquire an accurate, network-wide view of the traffic demands on an operational ISP backbone. We show that traffic demands can be computed based on flow-level measurements collected at each entry point to the provider network and information about the destinations reachable from each exit point. We describe our experiences applying this methodology to routing and measurement data collected in the AT&T IP Backbone. References: 1) Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, Jennifer Rexford, and Fred True, Deriving traffic demands for operational IP networks: Methodology and experience, IEEE/ACM Transactions on Networking, June 2001, pp. 265-279. 2) Anja Feldmann, Albert Greenberg, Carsten Lund, Nick Reingold, and Jennifer Rexford, NetScope: Traffic engineering for IP networks, IEEE Network Magazine, March/April 2000, pp. 11-19.
11. Anne Rogers (AT&T Labs) Title: Analyzing Transaction Streams with Hancock Abstract: Massive transaction streams present a number of opportunities for data mining techniques. Transactions might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how clients use the underlying services. For several years, we have computed evolving profiles (called signatures) of the clients in transaction streams using handwritten C code. The signature for each client captures the salient features of the client's transactions through time. These programs were carefully hand-optimized to ensure that the data could be processed in a timely fashion. They achieved the necessary performance but at the expense of readability, which led to programs that were difficult to verify and maintain. Hancock is a domain-specific language created to analyze transactions streams efficiently without sacrificing readability. In this talk, I will describe the obstacles to computing with large streams and explain how Hancock addresses these problems. Hancock is joint work with Corinna Cortes, Kathleen Fisher, Karin Hogstedt, Daryl Pregibon, and Fred Smith.
12. Martin Strauss (AT&T Labs) Title: Fast, Small-Space Algorithms for Approximate Histogram Maintenance Consider an array, A, of N values, defined implicitly by a stream of updates of the form "add/subtract 3 to/from A[5]". We find a B-bucket piecewise-constant representation H for A whose sum-square-error is at most (1+epsilon) times the error of the optimal representation. We consider three computational costs: the time to process an update, the time to reconstruct H from the synopsis data structure, and total space used; each of these is polynomial in B, log(N), and 1/epsilon. We also give extensions to piecewise-linear representations, Haar wavelet representations, and piecewise-constant representations under absolute (L1) error. Recently this problem has received attention, especially in the database community, where it corresponds to the problem of maintaining approximate histograms. Our work advances previous streaming work on arrays because we estimate the entire array, not just a statistical summary such as a norm or quantile. We hope our results will be more useful in practice than simple summaries. Furthermore, unlike a norm, that has a concise, data-independent, declarative definition, the best representation for A is defined as the optimum over the exponentially-large set of all representations. Thus estimating it quickly requires using a number of optimization techniques. Joint work with Anna C. Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, and S. Muthukrishnan.
13. Nitin Thaper (MIT) Title: Space-Efficient Algorithms for Maintaining Multi-Dimensional Histograms Slides(.ps.gzip) Abstract: Maintaining approximate histograms of data attracted recently a significant amount of attention in the database community. However, most of the research has been focused on the case of one-dimensional data. In this talk we present novel algorithms for maintaining histograms for 2 or more dimensional data, as well as their experimental evaluation. Unlike earlier work in this area, our algorithms also have *provable* guarantees on their performance. This is a joint work with S. Guha, P. Indyk and N. Koudas

Working Group Presentations:

1. Corinna Cortes (AT&T Labs - Research) Title: Communities of Interest Reference: Corinna Cortes, Daryl Pregicon, and Chris Volinsky, Communities of Interest, In 4th Int'l. Symp. on Intelligent Data Analysis (IDA 2001), Lisbon, Portugal, 2001.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on April 16, 2002.