Applied Sciences (Sep 2023)

A Study of Contrastive Learning Algorithms for Sentence Representation Based on Simple Data Augmentation

  • Xiaodong Liu,
  • Wenyin Gong,
  • Yuxin Li,
  • Yanchi Li,
  • Xiang Li

DOI
https://doi.org/10.3390/app131810120
Journal volume & issue
Vol. 13, no. 18
p. 10120

Abstract

Read online

In the era of deep learning, representational text-matching algorithms based on BERT and its variant models have become mainstream and are limited by the sentence vectors generated by the BERT model, and the SimCSE algorithm proposed in 2021 has improved the sentence vector quality to a certain extent. In this paper, to address the problem that the SimCSE algorithm has—that the greater the difference in sentence length, the smaller the probability that the sentence pairs are similar—an EdaCSE algorithm is proposed to perturb the sentence length using a simple data enhancement method without affecting the semantics of the sentences. The perturbation is applied to the sentence length by adding meaningless English punctuation marks to the original sentence so that the model no longer tends to recognise sentences of similar length as similar sentences. Based on the BERT series of models, experiments were conducted on five different datasets, and the experiments proved that the EdaCSE method improves an average of 1.67, 0.84, and 1.08 on the five datasets.

Keywords