Data Augmentation for Sentiment Analysis Using Sentence Compression-Based SeqGAN With Data Screening

Jiawei Luo; Mondher Bouazizi; Tomoaki Ohtsuki

doi:10.1109/ACCESS.2021.3094023

IEEE Access (Jan 2021)

Data Augmentation for Sentiment Analysis Using Sentence Compression-Based SeqGAN With Data Screening

Jiawei Luo,
Mondher Bouazizi,
Tomoaki Ohtsuki

Affiliations

Jiawei Luo: ORCiD; Graduate School of Science and Technology, Keio University, Yokohama, Japan
Mondher Bouazizi: ORCiD; Department of Information and Computer Science, Keio University, Yokohama, Japan
Tomoaki Ohtsuki: ORCiD; Department of Information and Computer Science, Keio University, Yokohama, Japan

DOI: https://doi.org/10.1109/ACCESS.2021.3094023
Journal volume & issue: Vol. 9
pp. 99922 – 99931

Abstract

Read online

Sentiment analysis refers to the process of automatically identifying the emotions expressed by people. Its accuracy is highly dependent on the amount of training data. However, it takes time and cost for humans to collect a large number of data. Many research works used generative models to generate a large amount of data based on a small amount of data for sentiment analysis. However, training on long texts and inaccurate sentiment information that might be generated are two severe challenges. It is difficult to improve the sentiment analysis accuracy effectively. In this paper, we propose a novel data augmentation framework based on Sequence generative adversarial networks (SeqGAN) to improve the sentiment analysis accuracy when the dataset already has a certain amount of data and contains long texts. Penalty-based SeqGAN is used to generate high-quality and diversified text data. Long short-term memory (LSTM) networks with attention mechanisms are used to conduct sentence compression for the training data of SeqGAN. A sentiment dictionary is used to retain the sentiment words for compressed data. We also propose a data screening method to obtain more accurate data from the generated data. The results of the usability, novelty, and diversity of the generated data show that the proposed sentence compression method can help SeqGAN learn more information from the long text data. The data generated by the proposed framework improve the classification accuracy of four classifiers applied on two distinct text datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords