Email is an interesting genre of inter-human communication as it has
aspects of both spoken and written language. On the one hand, it has an
interactive structure that resembles dialog, and email is not edited and
often uses informal language. On the other hand, email exchanges happen
over time, so that discourse participants have developed special
strategies to remind each other of the context in which they are
communicating. These strategies include various ways of citing previous
parts of the conversation.
In this talk, I will review some work done by members of the Columbia
Natural Language Processing (NLP) Group on email. I will describe a small
corpus collection effort we have undertaken, and then concentrate on
summarization by sentence extraction. While summarization by extraction
works well for certain genres such as newswire, the approach needs to be
modified for email. I will show how using email-specific features
improves the choice of relevant sentences for the extraction.
I will also present some related work we have been doing. One line of
research aims at classifying email, both at the thread level and at the
email message level, into different categories. In a second line of
research, we identify questions that are asked and attempt to identify
corresponding answers. We have investigated various ways of integrating
the simple sentence-extraction approach with question/answer information.
I will also present a summarization client we have designed and
implemented which can be used in conjunction with Microsoft Outlook. In
conclusion I will outline some plans for extending our work to the immense
Enron email corpus and some initial investigations.
Biography
Owen Rambow is a Research Scientist at the Center for Computational
Learning Systems at Columbia University, New York. He received his
Ph.D.in Computer and Information Sciences from the University of
Pennsylvania in 1994. Rambow's research interests lie in the areas of
formal representations for linguistic knowledge, especially syntax and
lexical semantics, and applications of such representations to
summarization, natural language generation, and dialog systems. His
recent work has used machine learning in combination with sophisticated
linguistic representations. For example, he has used machine learning to
determine optimal ways of achieving communicative goals in dialog systems
and to make surface generation choices in natural language generation.
Work on email summarization uses features both general and specific to
email to automatically learn what information to include in summaries of
multi-party email threads. Current work also includes a project aimed at
finding an optimal representation for the lexicon, morphology, and syntax
of a group of closely related languages (dialects).