Working Groups in Data Analysis and Mining

IILS-0100921

Principal Investigator

Fred S Roberts
The Center for Discrete Mathematics and Theoretical Computer Science
Rutgers University
CoRE Building, 4th floor
96 Frelinghuysen Road
Piscataway, NJ  08854-8018
(732)-445-5930
Fax (732) 445-5932
Email froberts@dimacs.rutgers.edu
http://www.dimacs.rutgers.edu/People/Staff/froberts/index.html

Senior Personnel: J. Douglas Carroll, Phipps Arabie, Adam Buchsbaum, Patrick Fowler, Pierre Hansen, Rajeev Motwani, James Abello, Melvin Janowitz

Graduate Students: Sorin Alexe, Gabrielle Alexe, Akshay Vashisht, Igor Zverovich, Khaled Elbassioni, Ulas Akkucuk

Keywords

computer-generated conjectures, data mining, massive data sets, multidimensional scaling, streaming data analysis

Project Summary

This project supports three interdisciplinary working groups who are exploring specific research areas. Each group consists of researchers with expertise in the field and/or in one of the applications areas. The groups are concerned with streaming data analysis and mining, multidimensional scaling, and computer-generated conjectures.   Two of the groups have met twice (the third has another meeting planned), with informal presentations and lots of time for discussion and interaction. There have  been three public workshops and two tutorials with more formal presentations, where others in the community interested in the field were given an opportunity to learn about the working group's efforts and the working group had the opportunity to learn about the efforts of others.  We have invited subgroups of researchers back for more intensive collaborations on ideas generated in the preliminary meetings.  The fields of the groups’ members include theoretical and applied computer science, statistics, discrete and non-discrete mathematics, chemistry, astronomy, economics, psychology, information theory, management science, ecology, molecular biology, and others. DIMACS has a long and successful history of getting researchers with different backgrounds and approaches together, stimulating new collaborations, setting the agenda for future research, and acting as a catalyst for major new developments at the interface among disciplines and we built on this tradition with these working groups.

Publications and Products

B. Babcock and S. Babu and M. Datar and R. Motwani, "Chain: Operator Scheduling for Memory Minimization in Data Stream Systems". Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), To appear.

P. Bose, E. Kranakis, P. Morin, Y. Tang, “Bounds for frequency estimation of packet streams,” Proceedings of the International Colloquium on Structural Information and Communication (SIROCCO), 2003.

G. Brinkmann, G. Caporossi, and P. Hansen, "A Survey and New Results on Computer Enumeration of Polyhex and Fusene Hydrocarbons", Journal of Chemical Information and Computer Sciences, 43(3): 842-851 (2003).

G. Brinkmann, G. Caporossi, P. Hansen, "A Constructive Enumeration of Fusenes and Benzenoids", Journal of Algorithms, 45(2): 155-166 (2002).

E. Cohen and M. Strauss, "Maintaining Time-Decaying Stream Aggregates", Proceedings 22nd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems  (PODS 2003), 223 - 233 (2003).

B. Choi, M. Mahoui, D. Wood, "On the Optimality of the Holistic Algorithms on Twig Queries", Proc. 14th International Conference on Database and Expert Systems Applications (DEXA 2003) Springer, LNCS, 28-37 (2003).

M. Charikar, L. O'Callaghan, R. Panigrahy, "Better streaming algorithms for clustering problems", Proceedings of the thirty-fifth ACM symposium on Theory of computing, (2003), ACM Press, New York, NY, USA (2003), 30-39.

N. Dean, "Mathematical programming model of bond length and angular resolution for minimum energy carbon nanotubes", Proceedings of the 2001 1st IEEE Conference on Nanotechnology, 513-515 (2001).

S. M. Husband, C. P. Husband, N. Dean and J. M. Tour, "Mathematical Details for the Nanocell Approach to Molecular Electronics", submitted.

L.J. Hubert, P. Arabie, J.J. Meulman, "Modelling dissimilarity: Generalizing untramatic and additive tree representations", British Journal of Mathematical and Statistical Psychology, p. 103-123, vol. 54, (2001).

L.J. Hubert, Phipps Arabie, J.J. Meulman, "Linear Unidimensional Scaling in the L2-Norm: Basic Optimization Methods Using MATLAB", Journal of Classification, p. 303-328, vol. 19, (2002).

M. Saks and X. Sun, "Space Lower Bounds for Distance Approximation in the Data Stream Model", Proc. 34th Annual ACM STOC, 360-369, (2002).

 

S. Fajtlowicz, P. Fowler, P. Hansen, M. Janowitz and F. Roberts (eds.), Graphs and Discovery, Proceedings of the DIMACS Workshop on Computer-Generated Conjectures from Graph Theoretic and Chemical Databases, American Math. Society. In preparation. http://dimacs.rutgers.edu/SpecialYears/2001_Data/Volume_CompGen.html

Project Impact

There have been important developments in each of the working groups. The interconnections among people working in data analysis but from different points of view were particularly striking. In the case of multidimensional scaling, experts in cluster analysis were matched with researchers in combinatorial optimization, for example. In streaming data analysis, those working on frequency statistics were matched with those working on probabilistic 'sketches.' In the case of computer-generated conjectures, developers of different methods compared and contrasted their software, with lively discussions as to the pros and cons, advantages and limitations, of each method. There were many connections made across disciplines.  In the MDS working group, participants represented areas such as chemometrics, marketing, social networks, ecology, biomolecular databases, social and clinical psychology. Of particular note is the connection to industry, with Kraft Foods, AT&T Labs, etc. participating. Several new applications of MDS have been jump-started or significantly enhanced; image visualization, structure of molecules, MRI imaging, distance geometry, data mining.  In the Streaming Data Analysis and Mining working group, applications areas discussed included financial modeling, astrophysics, and homeland security.  In the Computer-generated Conjectures working group, there was a strong mix of computer scientists, mathematicians, and chemists. A large number of the problems dealt with chemical structures. One of the most exciting results obtained was proof of a computer-generated conjecture about the separator of fullerenes.

Goals, Objectives and Targeted Activities

The goals were to formulate problems, share ideas and approaches, and set an agenda for future interactions. The emphasis was on unifying promising approaches that come from many distinct communities of researchers. The topics of interest included methodologies and algorithms for data mining, including clustering, discriminant analysis, enumerative methods, and multidimensional scaling; the increasingly abstract formulations and models of data mining questions using logical methods, conceptual clustering, learning and discovery that are critical in data mining and in particular for automatic, intelligent decision making; and the special problems that arise from applications to such important areas as fraud and intrusion detection, web mining, medical and scientific databases, marketing, and natural language data. Interdisciplinary working groups were formed and meetings were held from which fruitful collaborations developed.

Area Background

Theoretical and algorithmic approaches to data analysis have played a central role in the development of modern methods for handling data. Now, however, the massive amounts of data gathered in important modern applications ranging from the Internet to credit card fraud detection to astronomy and medicine have dramatically changed the requirements for algorithms and provide ample motivation for a great deal of new theoretical development. We need methods for data analysis and mining that scale to the huge volumes of data in such applications.

Area References (web pages of bibliographies)

Streaming Data Analysis and Mining http://dimacs.rutgers.edu/SpecialYears/2001_Data/Streaming/references.html

Multidimensional Scaling http://dimacs.rutgers.edu/SpecialYears/2001_Data/Algorithms/NewReferences.htm

Computer-generated Conjectures http://dimacs.rutgers.edu/SpecialYears/2001_Data/Conjectures/references.html

Potential Related Projects

DIMACS Monitoring Message Streams Research Project with its associated Seminar Series; The Special Focus on Communication Security and Information Privacy includes a Working Group on Privacy-Preserving Data Mining; The Special Focus on Computational and Mathematical Epidemiology includes a Working Group on Adverse Event/Disease Reporting, Surveillance and Analysis

Project Websites

http://dimacs.rutgers.edu/SpecialYears/2001_Data/workinggroups.html is the page for the entire project.

 

Each working group has its own web page with links to software, papers, and open problems.

Streaming Data Analysis and Mining http://dimacs.rutgers.edu/SpecialYears/2001_Data/Streaming_Web/Streaming_Home.htm

Multidimensional Scaling http://dimacs.rutgers.edu/SpecialYears/2001_Data/Algorithms/AlgorithmsMS.htm

Computer-generated conjectures http://dimacs.rutgers.edu/SpecialYears/2001_Data/Conjectures/GC_Discovery/

Illustrations

An algorithm for solving the distance geometry problem http://dimacs.rutgers.edu/SpecialYears/2001_Data/Algorithms/MDSTalk.ppt

Graph theory and the Kőnigsberg bridges: http://www.icmsstephens.com/graph.htm

Graph gallery showcases Tom Sawyer Software: http://www.tomsawyer.com/gallery/gallery.html

Online Software

GRAPH an expert system for the classification and extension of knowledge in the field of graph theory,  Cvetkovic, Kraus

Project Vega Home Page  http://vega.ijp.si/Htmldoc/Vega03.html

The MOLGEN pages: http://www.mathe2.uni-bayreuth.de/molgen4/ and http://www.mathe2.uni-bayreuth.de/markus/ei-ms/

The design theory page: http://btm2xl.mat.uni-bayreuth.de/discreta/

Graph generators:http://www.mathe2.uni-bayreuth.de/markus/markus.html and http://www.mathe2.uni-bayreuth.de/thomas_g/gradpart.html

Programs for generation of certain types of planar graph, plantri and fullgen: http://cs.anu.edu.au/people/bdm/plantri/

Graphviz - open source graph drawing software: http://www.research.att.com/sw/tools/graphviz/

GRaph INterface (GRIN): http://www.geocities.com/pechv_ru/

A toolkit for graph editors and graph algorithms http://www.infosun.fmi.uni-passau.de/Graphlet/

GTL - Graph Template Library Class Hierarchy http://www.infosun.fmi.uni-passau.de/GTL/index.html

DIMACS Projects: LINK: http://dimacs.rutgers.edu/~berryj/LINK.html

Open Directory http://dmoz.org/Science/Math/Combinatorics/Software/

Online Data

A database with graph visualizations: http://btm2xg.mat.uni-bayreuth.de/GRAPHDB/

Other Resources

Multidimensional Scaling Software:

Xgobi/XGvis: systems for multivariate data visualization /multidimensional scaling and graph layout  in any dimension

Clustering software (source + executables) and documentation

GTREE - program to fit additive trees to proximity data (Corter, 1998)

ADDTREE/P - Corter's (1982) version of Sattah & Tversky's ADDTREE algorithm

EXTREE - program to fit extended trees to proximity data (Corter & Tvesky, 1986)

MDSX - a MDS Library of Professor Tony Coxon 

Willem Heiser's PROXSCAL 

Kruskal's KYST2A

Some PC versions of various MDS software, including both KYST2A and the SINDSCAL program for three-way individual differences MDS implemented by Phipps Arabie (some versions are available at Netlib  MDS :  http://www.netlib.org/mds/)

Leland Wilkinson’s SYSTAT

 

A tutorial on combinatorial objects: http://www.mathe2.uni-bayreuth.de/people/buch.ps.gz
Combinatorics and chemistry, background: http://www.mathe2.uni-bayreuth.de/kerber/dimacs.psl
Some Graph Theorists' Home Page: http://www.geocities.com/jlwu65/theorist.htm

Doing chemistry in a virtual world:   http://pubs.acs.org/hotartcl/cenear/961209/virtual.html

Research Groups   http://www.math.hkbu.edu.hk/Research/research_groups.html#graph

The Mathematical Atlas http://www.math.niu.edu/~rusin/known-math/index/05CXX.html

Geometry.Net – Pure And Applied_Math: Graph Theory: http://www.geometry.net/pure_and_applied_math/graph_theory.html
Open Questions in Graph Theory http://www.cs.uwa.edu.au/~gordon/remote/questions.html

Open Problems in Discrete Math http://www.maths.qmw.ac.uk/~pjc/oldprob.html

http://www.math.binghamton.edu/zaslav/Bsg/sgbgprobs.html, http://www.emba.uvm.edu/~archdeac/newlist/problems.html

http://www.eecs.umich.edu/~qstout/constantques.html

Graph Theory  with Graffiti http://cms.dt.uh.edu/faculty/delavinae/StudentPictures/B_Chervenka_Poster_Contents.pdf

Tutte's 3, 4, and 5-flow conjectures   http://www.math.princeton.edu/~matdevos/open/345flow.html  

Conjectures on edge-connectivity        http://www.math.princeton.edu/~matdevos/open/edgecon.html