Computer Sciences Seminar
Monday, April 14
11:00 AM, NAC 8/206

The Effect of Class Distribution on Classifier Learning

Gary Weiss
Rutgers University/AT&T Labs

Abstract

Classifier learning involves generating a predictive model from a set of preclassified examples, which can then be used to classify future examples. Each learning problem will have an associated class distribution. Although in some cases the distribution of examples will contain equal numbers of examples belonging to each class, in most cases the class distribution will be unbalanced-often severely so.

In this talk I present results from the first comprehensive study on the effect of class distribution on learning, which analyzes the decision-tree classifiers induced from twenty-six data sets. I begin by showing that the examples belonging to rare classes are misclassified much more frequently than examples belonging to common classes. Next, I show how varying the class distribution of the training data affects the performance of the induced classifier. These results are used to answer important questions, such as, "What class distribution is best for learning?" My conclusion is that the naturally occurring class distribution is often not best for learning and that a balanced class distribution generally yields a classifier robust to different misclassification costs.

In real-world situations it is often necessary to limit the amount of training data, due to costs associated with obtaining and learning from the data. I describe a budget-sensitive progressive-sampling algorithm for selecting training examples in this situation, such that the resulting class distribution performs well for learning. This sampling algorithm can thus be used to maximize classifier performance when the cost of procuring training examples is high.

Bio
Gary Weiss received his B.S. from Cornell University in 1985, his M.S. from Stanford University in 1986 and will receive his Ph.D. from Rutgers University in May 2003. Since 1985 he has been employed at Bell Labs and AT&T Labs. For the first several years at AT&T, Gary worked as a software engineer developing telephone-switching software. He then went on to develop an expert system to remotely monitor and diagnose faults with central office switches. For the past several years Gary has been using machine learning and data mining techniques to analyze AT&T business data to improve AT&T sales and marketing efforts. Gary's research interests include machine learning/data mining and the fundamental issues that arise when tackling complex, real-world problems.