The Entropy of Words—Learnability and Expressivity across More than 1000 Languages

Christian Bentz; Dimitrios Alikaniotis; Michael Cysouw; Ramon Ferrer-i-Cancho

doi:10.3390/e19060275

Entropy (Jun 2017)

The Entropy of Words—Learnability and Expressivity across More than 1000 Languages

Christian Bentz,
Dimitrios Alikaniotis,
Michael Cysouw,
Ramon Ferrer-i-Cancho

Affiliations

Christian Bentz: DFG Center for Advanced Studies, University of Tübingen, Rümelinstraße 23, D-72070 Tübingen, Germany
Dimitrios Alikaniotis: Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge CB3 9DP, UK
Michael Cysouw: Forschungszentrum Deutscher Sprachatlas, Philipps-Universität Marburg, Pilgrimstein 16, D-35032 Marburg, Germany
Ramon Ferrer-i-Cancho: Complexity and Quantitative Linguistics Lab, LARCA Research Group, Departament de Ciències de la Computació, Universitat Politècnica de Catalunya, 08034 Barcelona, Catalonia, Spain

DOI: https://doi.org/10.3390/e19060275
Journal volume & issue: Vol. 19, no. 6
p. 275

Abstract

Read online

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.

Published in Entropy

ISSN: 1099-4300 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Astronomy: Astrophysics; Science: Physics
Website: http://www.mdpi.com/journal/entropy

About the journal

Abstract

Keywords