Computer Sciences Seminar
Tuesday, March 12
12:30 PM, NAC 8/206

Algorithms for High-Dimensional Clustering

Sanjoy Dasgupta
ATT Research

Abstract
Clustering -- finding groups in data -- lies at the heart of some of the most exciting new developments in computer science, such as computational genomics and web document retrieval. These emerging areas require the analysis of massive data sets, often involving millions of data points in very high- dimensional spaces. The importance of these data, along with the technological challenges they pose, have brought to the forefront the need for efficient, provably good clustering algorithms.

I will give an overview of my work on three long-standing open problems in clustering:

  1. the first efficient, provably good algorithm for a classic statistical task (learning mixtures of Gaussians);
  2. a proof that a popular clustering procedure (EM) will actually converge to the global optimum of its search space, rather than merely a local optimum, under certain conditions;
  3. an algorithm for constructing a "hierarchical clustering" -- a hierarchy of nested clusterings -- which is close to optimal (under a natural clustering cost function) at every level of granularity, simultaneously.

Bio
Sanjoy Dasgupta received his doctorate in Computer Science in May 2000, at the University of California, Berkeley. Since then he has been at AT&T Shannon Laboratory. The main focus of his research is on algorithmic aspects of statistics and machine learning, particularly for high-dimensional data.