DIMACS Working Group on Monitoring Message Streams
Retrospective and Prospective Event Detection
This working group meets every two weeks and comprises DIMACS researchers
working on the project "Monitoring Message Streams: Retrospective and
Prospective Event Detection." The National
Science Foundation supports this project through its KDD program. Fred
Roberts is the principal investigator. Paul Kantor is the project coordinator.
The group is interested in topics such as:
 largescale text
categorization
 streaming data
 dimensionality reduction
Viewing documents marked with a requires a
password. Please contact madigan@stat.rutgers.edu, froberts@dmac.rutgers.edu,
or kantorp@cs.rutgers.edu for access.
BBR/BMR/BXR Software
Various versions of the software are at
bayesianregression.org
Schedule, Personnel, and Internal Documents
Papers

Algorithms for Sparse Linear Classifiers in the Massive Data Setting.
Balakrishnan, S. and Madigan, D., Journal of Machine Learning Research., 2008, to appear.

LAPS: Lasso with Partition Search.
Balakrishnan, S. and Madigan, D., ICDM 2007.

A Bayesian feature selection score based on Naive Bayes
models.
Eyheramendy, D. and Madigan, D., In: Computational Methods
of Feature Selection, H. Liu and H. Motoda, Editors, 2007.

A Flexible Bayesian Generalized Linear Model for Dichotomous Response
Data with an Application to Text Categorization.
Eyheramendy, S. and Madigan, D. In: IMS Lecture Notes  Monograph Series, Volume
54, Complex datasets and inverse problems: tomography, networks and
beyond. Regina Liu, William Strawderman & CunHui Zhang, Editors,
7691, 2007.

Decision Trees for Functional Variables.
Balakrishnan, S. and Madigan, D., ICDM 2006.
 Constructing Informative Prior Distributions from Domain Knowledge in Text
Classification.
A. Dayanik, D. Lewis, D. Madigan, V. Menkov, A. Genkin.
(SIGIR 06.)
 Author
Identification on the Large Scale
D. Madigan, A. Genkin, D. Lewis, S. Argamon, D. Fradkin, and L. Ye. CSNA, 2005.
 Exploration
Approaches to Adaptive Filtering
D. Fradkin and M. Littman, January 2005,
DIMACS Technical Report 200501
 Methods for
Learning Classifier Combinations: No Clear Winner.
D. Fradkin and P.
Kantor. Proceedings of ACM Symposium on Applied Computing Information Access
and Retrieval Track, 2005.
 Summarizing
and Mining Skewed Data Streams.
G. Cormode and S. Muthukrishnan. SIAM Conf
on Data Mining (SDM), 2005.
 Space Efficient Mining of Multigraph Streams.
G.
Cormode, S. Muthukrishan. ACM Principles of Database Systems (PODS), 2005.
 Bayesian
Multinomial Logistic Regression for Author identification.
D. Madigan, A.
Genkin, D. Lewis and D. Fradkin, MaxEnt 2005.
 A OnePass Sequential Monte Carlo Method for Bayesian
Analysis of Massive Datasets
 Suhrid Balakrishnan and David Madigan, 2004.
Presentation
Software
 Holistic
UDAFs at streaming speeds.
T.
Johnson, G. Cormode, F. Korn, S. Muthukrishnan, O. Spatscheck, D. Srivastava.
SIGMOD 2004.
 Diamond
in the Rough: Finding Hierarchical Heavy Hitters in MultiDimensional Data.
G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava. SIGMOD 2004.
 What is New: Finding
Significant Differences in Network Data Streams.
G. Cormode and S.
Muthukrishnan. INFOCOM 2004.
 An
Improved Data Stream Summary: The CountMin Sketch and its Applications.
G.
Cormode and S. Muthukrishnan. LATIN 2004.
 Largescale
Bayesian Logistic Regression for Text Categorization (PS) (PDF).
A.
Genkin, D. Lewis and D. Madigan. Technometrics, 2007.
 A Design Space
Approach to Analysis of Information Retrieval Adaptive Filtering Systems.
D. Fradkin and P. Kantor. Proceedings of the
Thirteenth ACM Conference on Information and Knowledge Management, CIKM 2004.
 Distinguishing
Mislabeled Data from Correctly Labeled Data in Classifier Design.
S.
Venkataraman, D. Metaxis, D. Fradkin, C. Kulikowski, I. Muchnik, International
Conference on Tools with AI (ICTAI), 2004.
 DIMACS
at the TREC 2004 Genomics Track.
A. Dayanik, D. Fradkin, A. Genkin, P.
Kantor, D. Lewis, D. Madigan, V. Menkov. TREC 2004.
 Combinatorial
PCA and SVM Methods for Feature Selection in Learning Classifications
(Applications to Text Categorization).
A.Anghelescu and I. Muchnik.
Proceedings of the International Conference on Integration of Knowledge
Intensive MultiAgent Systems, pg. 491496. ISBN 0780379586, 2003.
 Maintenance of
Multidimensional Histograms.
S. Muthukrishnan and M. Strauss. FST & TCS
2003.
 Finding
Hierarchical Heavy Hitters in Data Streams.
G. Cormode, F. Korn, S.
Muthukrishnan and D. Srivastava. VLDB 2003.
 What is Hot and What is
Not: Tracking Most Frequent Items Dynamically.
G. Cormode and S.
Muthukrishnan. PODS 2003.
 On the Naive
Bayes Model for Text Classification (PS) (PDF).
S.
Eyheramendy, D. Lewis, and D. Madigan In Proceedings of The Ninth International
Workshop on Artificial Intelligence and Statistics, C.M. Bishop and B.J. Frey
(Editors), 332339, 2003.
 Experiments with
Random Projections for Machine Learning.
D. Fradkin and D. Madigan. In
Proceedings of KDD03, The Ninth International Conference on Knowledge
Discovery and Data Mining, 2003.
Project Reports
 Constructing Informative Prior Distributions from Domain Knowledge in Text Classification
Aynur Dayanik, David D. Lewis, David Madigan, Vladimir Menkov, and Alexander Genkin, January 2006.

Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD Challenge of 2005
Andrei Anghelescu, Aynur Dayanik, Dmitriy Fradkin, Alex Genkin, Paul Kantor, David Lewis, David Madigan, Ilya Muchnik and Fred Roberts,
December 2005, DIMACS Technical Report 200542.
 Decision Trees for Functional Variables
Suhrid Balakrishnan and David Madigan, 2006.
 A Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization
Susana Eyheramendy and David Madigan, 2006
 Bayesian Multinomial Logistic Regression
for Author Identification
David Madigan, Alexander Genkin, David D. Lewis, and Dmitriy Fradkin,
August 2005.
 DIMACS at the TREC 2005 Genomics Track
Aynur Dayanik, Alexander Genkin, Paul Kantor, David D. Lewis, and David Madigan,
2005.
 Sparse
Logistic Regression for Text Categorization
Alexander Genkin, David D. Lewis, and David Madigan. April 2005
 Polytomous Logistic
Regression
David D. Lewis, October 2004.
 Project
Overview (PS)
Paul B. Kantor and Fred S. Roberts, April 2003.
 A design
space approach to analysis of IR/AF systems
Dmitriy Fradkin and Paul Kantor, February 2004.
 Methods for learning
classifier combinations: No clear winner
Dmitriy Fradkin, Paul Kantor, and Kwong Bor Ng, February 2004.
 Optimization
of SVM in a Space of Two Parameters: Weak Margin and Intercept.
Andrei V. Anghelescu and Ilya B. Muchnik, May 2003.
 Prospective Data Fusion for Batch
Filtering
Andrei Anghelescu, Endre Boros, Dmitriy Fradkin, David D. Lewis, Vladimir
Menkin, David J. Neu, Kwong Bor Ng., and Paul Kantor, May 2003.
 Feature
Selection and Batch Filtering
Endre Boros and David J. Neu, May 2003.
 Sparse
Bayesian Classifiers for Text Categorization
Susana Eyheramendy, Alexander Genkin, WenHua Ju, David D. Lewis, and
David Madigan, May 2003.
 A Closer Look at Nearest
Neighbor Text Classification.
David D. Lewis and Vladimir Menkov, May 2003.
 Prospective Data Fusion for Batch Filtering
(PS) (PDF)
Andrei Anghelescu, Dmitiry Fradkin, David D. Lewis, Vladimir Menkov, Kwong
Bor Bg, and Paul Kantor, February 2003.
 A Closer Look at
Nearest Neighbor Classification (PS)
David D. Lewis and Vladimir Menkov, January 2003.
 Towards
a Formal Framework for Adaptive Filtering (PS) (PDF)
Paul Kantor and David Madigan, December 2002.
 Towards
Fast, Effective Nearest Neighbor Text Classification (PS) (PDF)
David D. Lewis, S. Muthukrishnan, R. Ostrovsky, M. Strauss, November 2002.
 Distances
from Sketches (PS) (PDF)
Martin Strauss, November 2002.
 Sparse
Bayesian Classifiers for Text Categorization  Batch Learning (PS) (PDF)
Alexander Genkin, David D. Lewis, and David Madigan, November 2002.
 Sparse
Bayesian Classifiers for Text Categorization  Online Learning (PS) (PDF)
WenHua Ju and David Madigan, November 2002.
 Combinatorial
Clustering for Textual Data Representation in Machine Learning Models
Andrei Anghelescu and Ilya Muchnik, November 2002.
 An
Experimental Evaluation of a Combinatorial Clustering Model for Textual
Data Representation
Andrei Anghelescu and Ilya Muchnik, November 2002.
 A
Simple, Coarse Vector Nearest Neighbors Algorithm (PS) (PDF)
S. Muthukrishnan, November 2002.
 Some
Statistical Analysis Algorithms on Data Streams (PS) (PDF)
S. Muthukrishnan, November 2002.
 Notes on Term Selection
Endre Boros, November 2002.
 Training
a Rocchio classifier for adaptive filtering (PS) (PDF)
Andrei Anghelescu, David D. Lewis, Vladimir Menkov, and Paul Kantor, TREC
2002.
 Feature
Selection and Batch Filtering (PS) (PDF)
Endre Boros and David J. Neu, TREC 2002.
 CCRDIMACS
Workshop
Bob Grossman, Paul Kantor, and Muthu Muthukrishnan, June 2002.
Online Tools
Project Deliverables
Project Presentations
 August 2006 Site Visit
 September 2004 PI's Meeting Overview
Presentation
 April 2005 Site Visit
 July 2004 Site Visit
 February 2004 Site Visit
 a href="MMSProject111803rev3.ppt">Monitoring Message
Streams: Algorithmic Methods for Automatic Processing of Messages (PPT)
Fred Roberts, November 2003.
 Monitoring
Message Streams: Retrospective and Prospective Event Detection (PPT)
Fred Roberts, December 2002.
 Monitoring Message
Streams: Future Work (PPT)
Fred Roberts, December 2002.
 Combinatorial
Clustering for Textual Data Representation (PDF)
Andrei Anghelescu and Ilya Muchnik, December 2002.
 Towards Fast Algorithms for
NearestNeighbor Searching (PPT)
Rafail Ostrovsky, December 2002.
 Nearest Neighbor Text
Classification (PPT)
David D. Lewis, S. Muthukrishnan, R. Ostrovsky, and M. Strauss, December 2002.
 Infrastructure:
Software and Data (PPT)
Andrei Anghelescu, Paul Kantor, David D. Lewis, and Vladimir Menkov, December
2002.
 Fusion of Methods (PPT)
Paul Kantor and Dmitriy Fradkin, December 2002.
 How do we measure success?
(PPT)
Paul Kantor, December 2002.
 Towards a Formal
Framework for Adaptive Filtering (PPT)
Paul Kantor and David Madigan, December 2002.
 Bayesian Methods for Text
Categorization (PPT)
Susana Eyheramendy, Alex Genkin, WehHua Ju, David D. Lewis, and David Madigan,
November 2002.
 Streaming Data Analysis (PPT)
S. Muthu Muthukrishnan, November 2002.
 Randomized Projection Methods
(PS)
Martin Strauss, November 2002.
 Feature Selection
Endre Boros, November 2002.
 Project
presentation (PPT), August 2002
 Project
presentation (PPT), May 2002
 Project proposal (Word),
March 2002
Informal Results Pages
Related Events
Other pointers to relevant papers by group members: