2012-2017 Special Focus on Information Sharing and Dynamic Data Analysis: Overview
If the 20th Century was dominated by rapid advances in technology, most notably the development
and growth of computers, then the 21st will be dominated by the huge growth in data.
The ability to instrument, monitor and collect data on every action within society provides huge
potential to use this information to improve life: designing better systems (e.g. in healthcare),
better managing interactions between large complex systems (e.g. in urban planning and traffic
management), identifying subtle problems and bad outcomes (e.g. in homeland security),
and more.
Yet while there have been great strides made in building and designing systems which have
strong control over their data sources (think of large web-based applications, such as search engines
and social networks), systems which need to span multiple, noisy data sources have shown less
progress. Consider traffic management in a modern city: there are many complex inputs (historical
travel patterns, current and predicted weather activity, maintenance, accident reports, and current
observed traffic density). Various controls are possible (adjusting traffic signal timings, blocking or
opening lanes, reversible flows etc.), but it remains a hard problem to ensure smooth flow of traffic
at peak times. A key stumbling block is provided by the difficulties inherent in bringing together
multiple diverse sources of information and using these to correctly draw conclusions and make
decisions. Similar examples can be seen in other application areas: identifying disease spread from
signals as broad as pharmacy purchases and web search trends; finding potential terrorist plots
from a mixture of open and classified sources.
The goal of the Special Focus on Information Sharing and Dynamic Data Analysis is to address
the technical problems at the heart of these data challenges. No simple solution or single algorithm
can revolutionize this area. Instead, there needs to be protracted effort to understand, model and
make progress on these fundamental issues. However, there many embedded technical problems
that need to be solved: this is a matter of science, as well as engineering. Our approach is to
highlight the key technical problems that need to be solved in this area. This cannot be done
in clinical isolation, but requires the participation of practitioners and data users, in addition to
scientists and academics. The special focus aims to provide venues for these interactions to occur,
and to stimulate further progress via meetings, reports, and identification of open problems. The
focus will take place over four years, allowing time for some topics to be visited multiple times as
advances are made.
Cross-cutting themes throughout the Special Focus will be:
- Data Preparation and Quality. Real world data collection systems must deal with substantial
quality problems in the data: values are missing, or corrupted. The same object may appear
within the data with multiple referents. Detecting and correcting such problems are challenging:
there is the need to define (or learn) complex rules that define the expected behavior, and use these
to identify errors without erasing genuine outlier behavior. This is compounded when bringing
together data from multiple disparate sources (as in the traffic management example): sources
might appear self-consistent, but be mutually inconsistent. It is then a challenge to reconcile these
discrepancies.
- Privacy and Security. When information is shared between different entities, there are inevitably
concerns about the privacy of individuals in the data, and the security of the information
against attackers. Recent years have seen renewed interest in these topics from both the research
4
community and the broader legislative/public policy domain, but these two efforts have only begun
to meet. In particular, while there has been much technical advance in methods to
provide privacy for data sharing and release, this has not been matched in practice or regulation:
law and policy has so far adhered to simplistic definitions of privacy (such as in HIPAA), but data
releases which meet these requirements, such as the AOL and Netflix releases, have nevertheless
failed to provide the protection needed. By running this SF concurrently with the Cybersecurity
SF, we will be able to bring together people with broad range of expertise including security, privacy,
public policy, and a number of practical domains in order to lay out and advance the research
agenda in data privacy and security for the next decade.
- Continual and Distributed Processing. The historic model of data sharing is of receiving a
data set from another party. But rather than data sets, the current world works on data feeds:
streams of data arriving continually. This brings new challenges in terms of not just how
to combine multiple such feeds, but how to do so efficiently: when streams arrive at high rates,
there are additional costs in terms of network bandwidth, local storage, and processing power to
manage. These push towards distributed computation in place of centralizing data, which leads
to new models of parallel computation: cloud computation, and paradigms such as MapReduce and
Hadoop.
- Fusion and Inference. Current techniques for data fusion and inference over data from multiple
sources are still somewhat crude. Learning techniques are being developed to identify optimal
ways to collect information in uncertain environments and settings. It is possible to draw on existing
and developing techniques, such as Bayesian Hierarchical modeling, Kalman filters and so on.
But still there is much potential for new results on combining heterogeneous data from disparate
sources to draw conclusions and make decisions.
Opportunities to Participate:
- Workshops: A variety of workshops and mini-workshops are
being planned
- Working Groups: Interdisciplinary ``working groups'' will
explore special focus research topics.
- Implementation Challenges: Implementation challenges focus on empirical aspects of algorithm design and evaluation.
- Tutorial: A tutorial will provide background knowledge to
those who wish to participate in the special focus or just get an
introduction to some of the fundamental issues in the field.
- Seminar Series: There will be a mix of research talks and
practitioner presentations.
- Visitor Programs: Applications for research and graduate
student visits to the center are invited. Some funds may be available
for travel and local support.
- Graduate Student Support: Funds will be set aside for graduate students interested in attending workshops. Students interested in visiting DIMACS during the special focus are encouraged to apply to the special focus organizers.
- Publications: We anticipate that a variety of
publications, including AMS-DIMACS volumes, technical reports,
abstracts and notes on the WWW, and DIMACS modules will result from
the special focus.
Index of Special Focus on Information Sharing and Dynamic Data Analysis:
DIMACS Homepage
Contacting the Center
Document last modified on September 9, 2017.