Applied Sciences (Feb 2025)

A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study

  • Phuong Thao Nguyen,
  • Bernd Nuss,
  • Roswita Dressler,
  • Katie Ovens

DOI
https://doi.org/10.3390/app15052476
Journal volume & issue
Vol. 15, no. 5
p. 2476

Abstract

Read online

Grammatical error correction (GEC) has become increasingly important for enhancing the quality of OCR-scanned texts. This small-scale study explores the application of Large Language Models (LLMs) for GEC in German children’s literature, a genre with unique linguistic challenges due to modified language, colloquial expressions, and complex layouts that often lead to OCR-induced errors. While conventional rule-based and statistical approaches have been used in the past, advancements in machine learning and artificial intelligence have introduced models capable of more contextually nuanced corrections. Despite these developments, limited research has been conducted on evaluating the effectiveness of state-of-the-art LLMs, specifically in the context of German children’s literature. To address this gap, we fine-tuned encoder-based models GBERT and GELECTRA on German children’s literature, and compared their performance to decoder-based models GPT-4o and Llama series (versions 3.2 and 3.1) in a zero-shot setting. Our results demonstrate that all pretrained models, both encoder-based (GBERT, GELECTRA) and decoder-based (GPT-4o, Llama series), failed to effectively remove OCR-generated noise in children’s literature, highlighting the necessity of a preprocessing step to handle structural inconsistencies and artifacts introduced during scanning. This study also addresses the lack of comparative evaluations between encoder-based and decoder-based models for German GEC, with most prior work focusing on English. Quantitative analysis reveals that decoder-based models significantly outperform fine-tuned encoder-based models, with GPT-4o and Llama-3.1-70B achieving the highest accuracy in both error detection and correction. Qualitative assessment further highlights distinct model behaviors: GPT-4o demonstrates the most consistent correction performance, handling grammatical nuances effectively while minimizing overcorrection. Llama-3.1-70B excels in error detection but occasionally relies on frequency-based substitutions over meaning-driven corrections. Unlike earlier decoder-based models, which often exhibited overcorrection tendencies, our findings indicate that state-of-the-art decoder-based models strike a better balance between correction accuracy and semantic preservation. By identifying the strengths and limitations of different model architectures, this study enhances the accessibility and readability of OCR-scanned German children’s literature. It also provides new insights into the role of preprocessing in digitized text correction, the comparative performance of encoder- and decoder-based models, and the evolving correction tendencies of modern LLMs. These findings contribute to language preservation, corpus linguistics, and digital archiving, offering an AI-driven solution for improving the quality of digitized children’s literature while ensuring linguistic and cultural integrity. Future research should explore multimodal approaches that integrate visual context to further enhance correction accuracy for children’s books with image-embedded text.

Keywords