EPJ Data Science (Feb 2025)

Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach

  • Ting Li,
  • Lewen Mi,
  • Xiangyu Meng,
  • Yongju Jia,
  • Lin Zhao,
  • Qi Zhao,
  • Zihao Wei,
  • Guandong Gao,
  • Xiangxian Li

DOI
https://doi.org/10.1140/epjds/s13688-025-00533-1
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 22

Abstract

Read online

Abstract Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing availability of judicial data, such as legal documents, machine learning approaches hold great potential in this domain. However, a significant challenge is the lack of comprehensive datasets to train these models, and the distribution of crime motive categories in publicly available legal texts often exhibits a long-tailed imbalance. This imbalance can lead to model bias, where the model tends to predict more common criminal motives. To address these challenges, we collected 11,589 legal documents from China Judgements Online (2019–2024) to create a crime motive text dataset. To mitigate the long-tailed issue, we propose a Category-Aware Balanced Contrastive Learning (CA-BCL) method, which effectively enhances the model’s representation of long-tailed data. Specifically, CA-BCL first balances the sampling process to alleviate the class imbalance during prototype construction and then applies balanced contrastive learning to improve the model’s ability to generalize to long-tailed categories, leading to better overall classification performance. Our experimental results demonstrate that CA-BCL significantly outperforms existing text classification models in crime motive classification, while also showing strong generalization capabilities on standard text classification benchmark.

Keywords