Preparing Non-English Texts for Computational Analysis

Quinn Dombrowski

doi:10.3828/mlo.v0i0.294

Modern Languages Open (Aug 2020)

Preparing Non-English Texts for Computational Analysis

Quinn Dombrowski

Affiliations

Quinn Dombrowski: Stanford University

DOI: https://doi.org/10.3828/mlo.v0i0.294
Journal volume & issue: Vol. 0, no. 1

Abstract

Read online

Most methods for computational text analysis involve doing things with “words”: counting them, looking at their distribution within a text, or seeing how they are juxtaposed with other words. While there’s nothing about these methods that limits their use to English, they tend to be developed with certain assumptions about how “words” work – among them, that words are separated by a space, and that words are minimally inflected (i.e. that there aren’t a lot of different forms of a word). English fits both of these assumptions, but many languages do not. This tutorial covers major challenges for doing computational text analysis caused by the grammar or writing systems of various languages, and ways to overcome these issues.

Published in Modern Languages Open

ISSN: 2052-5397 (Online)
Publisher: Liverpool University Press
Country of publisher: United Kingdom
LCC subjects: Language and Literature
Website: http://www.modernlanguagesopen.org

About the journal