In Autumn 2020, DOAJ will be relaunching with a new website with updated functionality, improved search, and a simplified application form. More information is available on our blog. Our API is also changing.

Hide this message

The Shona Corpus and the Problem of Tagging?

Lexikos. 2011;10 DOI 10.5788/10--887


Journal Homepage

Journal Title: Lexikos

ISSN: 1684-4904 (Print); 2224-0039 (Online)

Publisher: Woordeboek van die Afrikaanse Taal-WAT

Society/Institution: Stellenbosch University

LCC Subject Category: Language and Literature: Philology. Linguistics | Language and Literature: Languages and literature of Eastern Asia, Africa, Oceania | Language and Literature: Germanic languages. Scandinavian languages

Country of publisher: South Africa

Language of fulltext: English, French, Dutch, Afrikaans, German

Full-text formats available: PDF



Emmanuel Chabata


Blind peer review

Editorial Board

Instructions for authors

Time From Submission to Publication: 7 weeks


Abstract | Full Text

<p>Abstract: In this paper the writer examines problems the African Languages Lexical (ALLEX) Project (at present the African Languages Research Institute (ALRI? encountered while tagging the Shona corpus. The problems to be highlighted include general problems which apply to more than one language as well as problems peculiar to Shona. The paper was inspired by the challenges the writer encountered when he took part in building the Shona corpus. An analysis of the problems that most corpus builders face shows that more problems are likely to be encountered when dealing with spoken corpora than with written corpora. The paper demonstrates that tagging is an important component of corpus building as it makes it easier for a researcher to extract relevant data. To utilise the benefits of a tagged corpus, the tagging should be thorough and accurate. Wellinformed decisions form an integral part of the tagging process since the utility of a tagged corpus depends largely on the input of the tagging process. This paper shows the need to take the tagging process seriously.</p><p>Keywords: ALLEX PROJECT, COMPUTER, CORPUS, ENCODING, FOREIGN WORD, LEMMATIZATION, LEXICOGRAPHY, MONITOR CORPUS, PART OF SPEECH, SCANNING, SHONA, SLANG, TAGGING, TRANSCRIPTION, WORD</p><p>Opsomming: Die Shonakorpus en die probleem van etikettering, In hierdieartikel ondersoek die outeur probleme wat die African Languages Lexical (ALLEX) Project (tansdie African Languages Research Institute (ALRI┬╗ teegekom het terwyl die Shonakorpus geetiketteeris. Die probleme wat bespreek word, sluit algemene probleme in wat van toepassing is opmeer as een taa, sowel as spesifieke probleme wat eie aan Shona is. Die artikel het sy ontstaan indie uitdagings wat die outeur teegekom het terwyl hy deel gehad het aan die opbou van die Shonakorpus.'n Ontieding van die probleme waarvoor die meeste korpusbouers te staan kom, toon datdaar waarskynlik meer probleme teegekom word wanneer daar met gesproke korpora as metgeskrewe korpora gewerk word. Die artikel toon dat etikettering 'n belangrike komponent van korpusbouis, aangesien dit dit vir die navorser makliker maak om relevante data te onttrek. Om dievoordele van korpusetikettering te realiseer, moet die etikettering deeglik en akkuraat wees. Ingeligtebesluite vonn 'n integrale deel van die etiketteringsproses aangesien die bruikbaarheid van 'ngeetiketteerde korpus hoofsaaklik afhang van die inset tydens die etiketteringsproses. Hierdie artikeltoon die noodsaaklikheid om die etiketteringsproses ernstig op te neem.</p><p>Keywords: ALLEXPROJEK, REKENAAR, KORPUS, ENKODERING, VREEMDE WOORD,LEMMATISERING, LEKSIKOGRAFIE, MONITORKORPUS, WOORDSOORT, SKANDERING,SHONA, SLENG, ETIKETIERING, TRANSKRIPSIE, WOORD</p>