|
Computer Sciences Seminar
Tuesday, March 12
12:30 PM, NAC 8/206
Algorithms for High-Dimensional Clustering
Sanjoy Dasgupta
ATT Research
Abstract
Clustering -- finding groups in data -- lies at the
heart of some of the most exciting new developments
in computer science, such as computational genomics and web
document retrieval.
These emerging areas require the analysis of massive data
sets, often involving millions of data points in very high-
dimensional spaces. The importance of these data, along
with the technological challenges they pose, have brought
to the forefront the need for efficient, provably good
clustering algorithms.
I will give an overview of my work on three long-standing
open problems in clustering:
- the first efficient, provably good algorithm for a
classic statistical task
(learning mixtures of Gaussians);
- a proof that a popular clustering procedure (EM)
will actually converge to the global optimum of
its search space, rather than merely a local
optimum, under certain conditions;
- an algorithm for constructing a "hierarchical
clustering" -- a hierarchy of nested clusterings --
which is close to optimal (under a natural clustering
cost function) at every level of granularity, simultaneously.
Bio
Sanjoy Dasgupta received his doctorate in
Computer Science in May 2000, at the
University of California, Berkeley.
Since then he has been at AT&T Shannon
Laboratory. The main focus of his research
is on algorithmic aspects of statistics and
machine learning, particularly for high-dimensional
data.
|