PeerJ Computer Science (Oct 2024)

Learning-based short text compression using BERT models

  • Emir Öztürk,
  • Altan Mesut

DOI
https://doi.org/10.7717/peerj-cs.2423
Journal volume & issue
Vol. 10
p. e2423

Abstract

Read online Read online

Learning-based data compression methods have gained significant attention in recent years. Although these methods achieve higher compression ratios compared to traditional techniques, their slow processing times make them less suitable for compressing large datasets, and they are generally more effective for short texts rather than longer ones. In this study, MLMCompress, a word-based text compression method that can utilize any BERT masked language model is introduced. The performance of MLMCompress is evaluated using four BERT models: two large models and two smaller models referred to as “tiny”. The large models are used without training, while the smaller models are fine-tuned. The results indicate that MLMCompress, when using the best-performing model, achieved 3838% higher compression ratios for English text and 42% higher compression ratios for multilingual text compared to NNCP, another learning-based method. Although the method does not yield better results than GPTZip, which has been developed in recent years, it achieves comparable outcomes while being up to 35 times faster in the worst-case scenario. Additionally, it demonstrated a 20% improvement in compression speed and a 180% improvement in decompression speed in the best case. Furthermore, MLMCompress outperforms traditional compression methods like Gzip and specialized short text compression methods such as Smaz and Shoco, particularly in compressing short texts, even when using smaller models.

Keywords