SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Arnold, Sebastian; Schneider, Rudolf; Cudré-Mauroux, Philippe; Gers, Felix A.; Löser, Alexander

doi:10.1162/tacl_a_00261

Transactions of the Association for Computational Linguistics (Nov 2019)

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Arnold, Sebastian,
Schneider, Rudolf,
Cudré-Mauroux, Philippe,
Gers, Felix A.,
Löser, Alexander

Affiliations

Arnold, Sebastian
Schneider, Rudolf
Cudré-Mauroux, Philippe
Gers, Felix A.
Löser, Alexander

DOI: https://doi.org/10.1162/tacl_a_00261
Journal volume & issue: Vol. 7
pp. 169 – 184

Abstract

Read online

When searching for information, a human reader first glances over a document, spots relevant sections, and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates the identification of the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available data set with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR long short-term memory model with Bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 over state-of-the-art CNN classifiers with baseline segmentation.

Published in Transactions of the Association for Computational Linguistics

ISSN: 2307-387X (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/tacl

About the journal