Psihologija (Feb 2009)
STABILITIY OF THE SYNTAGMATIC PROBABILITY DISTRIBUTIONS
Abstract
The aim of the present study is to establish criteria for the optimal sizeof a corpus that can provide stable conditional probabilities of morphologicaland/or syntagmatic types. The optimality of corpus size is defined in terms ofthe smallest sample that generates probability distribution equal to distributionderived from the large sample that generates stable probabilities. The latterdistribution we refer to as “target distribution”. In order to establish theabove criteria we varied the sample size, the word sequence size (bigrams andtrigrams), sampling procedure (randomly chosen words and continuous text)and position of the target word in a sequence. The obtained distributions ofconditional probabilities derived from smaller samples have been correlatedwith target distributions. Sample size at which probability distribution reachesmaximal correlation (r=1) with the target distribution was taken as beingoptimal. The research was done on Corpus of Serbian language. In case ofbigrams the optimal sample size for random word selection is 65.000 words,and 281.000 words for trigrams. In contrast, continuous text sampling requiresmuch larger samples to reach stability: 810.000 words for bigrams and 868.000words for trigrams. The factors that caused these differences remain unclear andneed additional empirical investigation.