Automated labeling of PDF mathematical exercises with word N-grams VSM classification

Taisei Yamauchi; Brendan Flanagan; Ryosuke Nakamoto; Yiling Dai; Kyosuke Takami; Hiroaki Ogata

doi:10.1186/s40561-023-00271-9

Smart Learning Environments (Oct 2023)

Automated labeling of PDF mathematical exercises with word N-grams VSM classification

Taisei Yamauchi,
Brendan Flanagan,
Ryosuke Nakamoto,
Yiling Dai,
Kyosuke Takami,
Hiroaki Ogata

Affiliations

Taisei Yamauchi: Graduate School of Informatics, Kyoto University
Brendan Flanagan: Center for Innovative Research and Education in Data Science, Institute for Liberal Arts and Sciences, Kyoto University
Ryosuke Nakamoto: Graduate School of Informatics, Kyoto University
Yiling Dai: Academic Center for Computing and Media Studies, Kyoto University
Kyosuke Takami: Education Data Science Center, National Institute for Educational Policy Research
Hiroaki Ogata: Academic Center for Computing and Media Studies, Kyoto University

DOI: https://doi.org/10.1186/s40561-023-00271-9
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 30

Abstract

Read online

Abstract In recent years, smart learning environments have become central to modern education and support students and instructors through tools based on prediction and recommendation models. These methods often use learning material metadata, such as the knowledge contained in an exercise which is usually labeled by domain experts and is costly and difficult to scale. It recognizes that automated labeling eases the workload on experts, as seen in previous studies using automatic classification algorithms for research papers and Japanese mathematical exercises. However, these studies didn’t delve into fine-grained labeling. In addition to that, as the use of materials in the system becomes more widespread, paper materials are transformed into PDF formats, which can lead to incomplete extraction. However, there is less emphasis on labeling incomplete mathematical sentences to tackle this problem in the previous research. This study aims to achieve precise automated classification even from incomplete text inputs. To tackle these challenges, we propose a mathematical exercise labeling algorithm that can handle detailed labels, even for incomplete sentences, using word n-grams, compared to the state-of-the-art word embedding method. The results of the experiment show that mono-gram features with Random Forest models achieved the best performance with a macro F-measure of 92.50%, 61.28% for 24-class labeling and 297-class labeling tasks, respectively. The contribution of this research is showing that the proposed method based on traditional simple n-grams has the ability to find context-independent similarities in incomplete sentences and outperforms state-of-the-art word embedding methods in specific tasks like classifying short and incomplete texts.

Published in Smart Learning Environments

ISSN: 2196-7091 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Education: Special aspects of education
Website: https://slejournal.springeropen.com

About the journal

Abstract

Keywords