Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach

Ting Li; Lewen Mi; Xiangyu Meng; Yongju Jia; Lin Zhao; Qi Zhao; Zihao Wei; Guandong Gao; Xiangxian Li

doi:10.1140/epjds/s13688-025-00533-1

EPJ Data Science (Feb 2025)

Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach

Ting Li,
Lewen Mi,
Xiangyu Meng,
Yongju Jia,
Lin Zhao,
Qi Zhao,
Zihao Wei,
Guandong Gao,
Xiangxian Li

Affiliations

Ting Li: Shandong University
Lewen Mi: Shandong University
Xiangyu Meng: Shandong University
Yongju Jia: Shandong University
Lin Zhao: Shandong University
Qi Zhao: Shandong University
Zihao Wei: Shandong University
Guandong Gao: The National Police University for Criminal Justice
Xiangxian Li: Shandong University

DOI: https://doi.org/10.1140/epjds/s13688-025-00533-1
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 22

Abstract

Read online

Abstract Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing availability of judicial data, such as legal documents, machine learning approaches hold great potential in this domain. However, a significant challenge is the lack of comprehensive datasets to train these models, and the distribution of crime motive categories in publicly available legal texts often exhibits a long-tailed imbalance. This imbalance can lead to model bias, where the model tends to predict more common criminal motives. To address these challenges, we collected 11,589 legal documents from China Judgements Online (2019–2024) to create a crime motive text dataset. To mitigate the long-tailed issue, we propose a Category-Aware Balanced Contrastive Learning (CA-BCL) method, which effectively enhances the model’s representation of long-tailed data. Specifically, CA-BCL first balances the sampling process to alleviate the class imbalance during prototype construction and then applies balanced contrastive learning to improve the model’s ability to generalize to long-tailed categories, leading to better overall classification performance. Our experimental results demonstrate that CA-BCL significantly outperforms existing text classification models in crime motive classification, while also showing strong generalization capabilities on standard text classification benchmark.

Published in EPJ Data Science

ISSN: 2193-1127 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://www.epjdatascience.com/

About the journal

Abstract

Keywords