CzeGPT-2&#x2013;Training New Model for Czech Generative Text Processing Evaluated With the Summarization Task

Adam Hajek; Ales Horak

doi:10.1109/access.2024.3371689

IEEE Access (Jan 2024)

CzeGPT-2–Training New Model for Czech Generative Text Processing Evaluated With the Summarization Task

Adam Hajek,
Ales Horak

Affiliations

Adam Hajek: ORCiD; Natural Language Processing Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ales Horak: ORCiD; Natural Language Processing Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic

DOI: https://doi.org/10.1109/access.2024.3371689
Journal volume & issue: Vol. 12
pp. 34570 – 34581

Abstract

Read online

Automatic text summarization (ATS), alongside neural machine translation or question answering, is one of the leading tasks in Natural Language Processing (NLP). In recent years, ATS has experienced significant development, especially in the English NLP world. Modern approaches are mainly based on the versatile Transformer architecture proposed by Vaswani et al. in 2017, which has revolutionized the field, and was later tuned and adjusted to various needs of different tasks. Non-mainstream languages, with Czech taken as a representative, on the other hand, are a little bit behind these efforts and tend to use lighter or heuristic methods. With the new CzeGPT-2 model and abstractive summarizer, we would like to take a step forward detailing the process of training a GPT-2 generative transformer model for a new language with a comprehensive evaluation of the task of Czech summarization and pointing out the benefits of this approach. We also present an in-depth analysis of the errors in generated summaries, allowing to locate the model’s weak spots.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords