Arbitrer (Sep 2024)

Developing the Corpus of Minangkabau Language: Insights, Challenges, and Future Directions

  • Handoko Handoko

DOI
https://doi.org/10.25077/ar.11.3.413-429.2024
Journal volume & issue
Vol. 11, no. 3
pp. 413 – 429

Abstract

Read online

This paper discusses the design for developing the Minangkabau language corpus, especially regarding the opportunities and challenges. The corpus development of Minangkabau is a crucial project to document, preserve, and revive the treasure trove of culture within the language. The availability of a Minangkabau language corpus can open opportunities for more intensive research on the Minangkabau language with a more modern and data-based approach. It can also encourage the development of Minangkabau corpus-based teaching materials. The corpus is manually assembled using various sources’ comprehensive data collection, annotation, and curation pipelines. These may be manuscripts, books, newspapers, or other written texts and spontaneous conversations, such as interviews or public speeches. Multimedia resources, such as television and radio broadcasts, audio-video recordings, and social media content, also add to the diversity of data gathered. The availability of accessible digital sources, such as online videos, online radio programs, and ebooks, can make data collection easier. However, several challenges may appear in developing the Minangkabau language corpus, such as limited technology accessibility, dialect variations, and the involvement of highly skilled human resources. This paper explains some opportunities for developing the Minangkabau language corpus and increasing the role of the corpus in revitalizing and documenting the Minangkabau language. Furthermore, the availability of the Minangkabau language corpus can also be a starting point for developing linguistic technology, such as voice recognition, text-to-speech, and natural language processing.

Keywords