Building English – Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

Kaur Dilshad; Singh Satwinder

doi:10.2478/acss-2023-0024

Applied Computer Systems (Dec 2023)

Building English – Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

Kaur Dilshad,
Singh Satwinder

Affiliations

Kaur Dilshad: 1Department of Computer Science and Technology, Central University of Punjab Bathinda, Punjab, India
Singh Satwinder: 1Department of Computer Science and Technology, Central University of Punjab Bathinda, Punjab, India

DOI: https://doi.org/10.2478/acss-2023-0024
Journal volume & issue: Vol. 28, no. 2
pp. 245 – 251

Abstract

Read online

Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of “Mann Ki Baat”, which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.

Published in Applied Computer Systems

ISSN: 2255-8691 (Online)
Publisher: Sciendo
Country of publisher: Poland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://sciendo.com/journal/ACSS

About the journal

Abstract

Keywords