WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Hiroaki Hayashi; Prashant Budania; Peng Wang; Chris Ackerson; Raj Neervannan; Graham Neubig

doi:10.1162/tacl_a_00362

Transactions of the Association for Computational Linguistics (Jan 2021)

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Hiroaki Hayashi,
Prashant Budania,
Peng Wang,
Chris Ackerson,
Raj Neervannan,
Graham Neubig

Affiliations

Hiroaki Hayashi: Language Technologies Institute, Carnegie Mellon University, United States. [email protected]
Prashant Budania: AlphaSense, United States. [email protected]
Peng Wang: AlphaSense, United States. [email protected]
Chris Ackerson: AlphaSense, United States. [email protected]
Raj Neervannan: AlphaSense, United States. [email protected]
Graham Neubig: Language Technologies Institute, Carnegie Mellon University, United States. [email protected]

DOI: https://doi.org/10.1162/tacl_a_00362
Journal volume & issue: Vol. 9
pp. 211 – 225

Abstract

Read online

AbstractAspect-based summarization is the task of generating focused summaries based on specific points of interest. Such summaries aid efficient analysis of text, such as quickly understanding reviews or opinions from different angles. However, due to large differences in the type of aspects for different domains (e.g., sentiment, product features), the development of previous models has tended to be domain-specific. In this paper, we propose WikiAsp,1 a large-scale dataset for multi-domain aspect- based summarization that attempts to spur research in the direction of open-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. We propose several straightforward baseline models for this task and conduct experiments on the dataset. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.

Published in Transactions of the Association for Computational Linguistics

ISSN: 2307-387X (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/tacl

About the journal