Dataset construction method of cross-lingual summarization based on filtering and text augmentation

Hangyu Pan; Yaoyi Xi; Ling Wang; Yu Nan; Zhizhong Su; Rong Cao

doi:10.7717/peerj-cs.1299

PeerJ Computer Science (Mar 2023)

Dataset construction method of cross-lingual summarization based on filtering and text augmentation

Hangyu Pan,
Yaoyi Xi,
Ling Wang,
Yu Nan,
Zhizhong Su,
Rong Cao

Affiliations

Hangyu Pan: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Yaoyi Xi: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Ling Wang: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Yu Nan: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Zhizhong Su: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Rong Cao: State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China

DOI: https://doi.org/10.7717/peerj-cs.1299
Journal volume & issue: Vol. 9
p. e1299

Abstract

Read online Read online

Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords