Summarizing Email Threads

Dr. Owen Rambow, Columbia University

Abstract

Email is an interesting genre of inter-human communication as it has aspects of both spoken and written language. On the one hand, it has an interactive structure that resembles dialog, and email is not edited and often uses informal language. On the other hand, email exchanges happen over time, so that discourse participants have developed special strategies to remind each other of the context in which they are communicating. These strategies include various ways of citing previous parts of the conversation. In this talk, I will review some work done by members of the Columbia Natural Language Processing (NLP) Group on email. I will describe a small corpus collection effort we have undertaken, and then concentrate on summarization by sentence extraction. While summarization by extraction works well for certain genres such as newswire, the approach needs to be modified for email. I will show how using email-specific features improves the choice of relevant sentences for the extraction. I will also present some related work we have been doing. One line of research aims at classifying email, both at the thread level and at the email message level, into different categories. In a second line of research, we identify questions that are asked and attempt to identify corresponding answers. We have investigated various ways of integrating the simple sentence-extraction approach with question/answer information. I will also present a summarization client we have designed and implemented which can be used in conjunction with Microsoft Outlook. In conclusion I will outline some plans for extending our work to the immense Enron email corpus and some initial investigations.

Biography

Owen Rambow is a Research Scientist at the Center for Computational Learning Systems at Columbia University, New York. He received his Ph.D.in Computer and Information Sciences from the University of Pennsylvania in 1994. Rambow's research interests lie in the areas of formal representations for linguistic knowledge, especially syntax and lexical semantics, and applications of such representations to summarization, natural language generation, and dialog systems. His recent work has used machine learning in combination with sophisticated linguistic representations. For example, he has used machine learning to determine optimal ways of achieving communicative goals in dialog systems and to make surface generation choices in natural language generation. Work on email summarization uses features both general and specific to email to automatically learn what information to include in summaries of multi-party email threads. Current work also includes a project aimed at finding an optimal representation for the lexicon, morphology, and syntax of a group of closely related languages (dialects).