Eesti Rakenduslingvistika Ühingu Aastaraamat (May 2017)

Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks

  • Kristina Koppel

DOI
https://doi.org/10.5128/ERYa13.04
Journal volume & issue
Vol. 13
pp. 53 – 71

Abstract

Read online

"Automatic detection of good dictionary examples in Estonian learner’s dictionaries" This paper explains, firstly, how a tool called Good Dictionary Example (GDEX) (Kilgarriff et. al 2008) scores corpus sentences and helps the lexicographer automatically select the best examples for dictionaries. Secondly, the training datasets containing example sentences from the Estonian Collocations Dictionary (ECD) are introduced. Thirdly, the paper focuses on different parameters of good dictionary examples. Most of the paper is based on an analysis of the training datasets and an evaluation of the previous GDEX configurations. For evaluating the configurations, the graphical user interface GDEX Editor was used. Based on the results of statistical analysis and on the evaluation of different configurations, a new configuration 1.4 is introduced. There are 16 new parameters implemented in GDEX 1.4. The main parameters of GDEX 1.4 are as follows: the desired sentence is a full sentence; sentence length is 4–20 tokens; the sentence contains a verb; it does not contain low frequency words or words from the blacklist; the optimal length is 6–12 tokens; sentences containing more than 1 adverb, pronoun, proper name, numeral, conjunction, comma, more than 2 verbs and sentences containing certain pronouns are penalized. The output of GDEX 1.4 can be applied to the ECD project and to create a web interface SkELL for learners of Estonian. Artiklis keskendutakse tööriista Good Dictionary Example ehk GDEX (Kilgarriff jt 2008) eesti mooduli versiooni 1.4 loomisele. GDEX on tööriist, mis aitab sõnastiku näitelauseks sobivaid korpuslauseid automaatselt tuvastada. GDEX-i moodul on seni loodud inglise, sloveeni, hollandi, portugali, hispaania, jaapani ja eesti keele jaoks. Siinses artiklis seletatakse esmalt lahti tööriista üldised tööpõhimõtted. Seejärel keskendutakse näitelauseid tuvastavate parameetrite statistilisele analüüsile ja parameetrite väärtuste määramisele. Parameetrite väärtuste hindamisele ning eri moodulite võrdlusele toetudes pakutakse välja eesti mooduli uus versioon 1.4.

Keywords