|
Computer Sciences Seminar The Effect of Class Distribution on Classifier Learning
Gary Weiss Abstract Classifier learning involves generating a predictive model from a set of preclassified examples, which can then be used to classify future examples. Each learning problem will have an associated class distribution. Although in some cases the distribution of examples will contain equal numbers of examples belonging to each class, in most cases the class distribution will be unbalanced-often severely so. In this talk I present results from the first comprehensive study on the effect of class distribution on learning, which analyzes the decision-tree classifiers induced from twenty-six data sets. I begin by showing that the examples belonging to rare classes are misclassified much more frequently than examples belonging to common classes. Next, I show how varying the class distribution of the training data affects the performance of the induced classifier. These results are used to answer important questions, such as, "What class distribution is best for learning?" My conclusion is that the naturally occurring class distribution is often not best for learning and that a balanced class distribution generally yields a classifier robust to different misclassification costs. In real-world situations it is often necessary to limit the amount of training data, due to costs associated with obtaining and learning from the data. I describe a budget-sensitive progressive-sampling algorithm for selecting training examples in this situation, such that the resulting class distribution performs well for learning. This sampling algorithm can thus be used to maximize classifier performance when the cost of procuring training examples is high. Bio |