Considering Performance in the Automated and Manual Coding of Sociolinguistic Variables: Lessons From Variable (ING)

Tyler Kendall; Charlotte Vaughn; Charlotte Vaughn; Charlie Farrington; Charlie Farrington; Kaylynn Gunter; Jaidan McLean; Chloe Tacata; Shelby Arnson

doi:10.3389/frai.2021.648543

Frontiers in Artificial Intelligence (Apr 2021)

Considering Performance in the Automated and Manual Coding of Sociolinguistic Variables: Lessons From Variable (ING)

Tyler Kendall,
Charlotte Vaughn,
Charlotte Vaughn,
Charlie Farrington,
Charlie Farrington,
Kaylynn Gunter,
Jaidan McLean,
Chloe Tacata,
Shelby Arnson

Affiliations

Tyler Kendall: Linguistics Department, University of Oregon, Eugene, OR, United States
Charlotte Vaughn: Linguistics Department, University of Oregon, Eugene, OR, United States
Charlotte Vaughn: Language Science Center, University of Maryland, College Park, MD, United States
Charlie Farrington: Linguistics Department, University of Oregon, Eugene, OR, United States
Charlie Farrington: English Department, North Carolina State University, Raleigh, NC, United States
Kaylynn Gunter: Linguistics Department, University of Oregon, Eugene, OR, United States
Jaidan McLean: Linguistics Department, University of Oregon, Eugene, OR, United States
Chloe Tacata: Linguistics Department, University of Oregon, Eugene, OR, United States
Shelby Arnson: Linguistics Department, University of Oregon, Eugene, OR, United States

DOI: https://doi.org/10.3389/frai.2021.648543
Journal volume & issue: Vol. 4

Abstract

Read online

Impressionistic coding of sociolinguistic variables like English (ING), the alternation between pronunciations like talkin' and talking, has been a central part of the analytic workflow in studies of language variation and change for over a half-century. Techniques for automating the measurement and coding for a wide range of sociolinguistic data have been on the rise over recent decades but procedures for coding some features, especially those without clearly defined acoustic correlates like (ING), have lagged behind others, such as vowels and sibilants. This paper explores computational methods for automatically coding variable (ING) in speech recordings, examining the use of automatic speech recognition procedures related to forced alignment (using the Montreal Forced Aligner) as well as supervised machine learning algorithms (linear and radial support vector machines, and random forests). Considering the automated coding of pronunciation variables like (ING) raises broader questions for sociolinguistic methods, such as how much different human analysts agree in their impressionistic codes for such variables and what data might act as the “gold standard” for training and testing of automated procedures. This paper explores several of these considerations in automated, and manual, coding of sociolinguistic variables and provides baseline performance data for automated and manual coding methods. We consider multiple ways of assessing algorithms' performance, including agreement with human coders, as well as the impact on the outcome of an analysis of (ING) that includes linguistic and social factors. Our results show promise for automated coding methods but also highlight that variability in results should be expected even with careful human coded data. All data for our study come from the public Corpus of Regional African American Language and code and derivative datasets (including our hand-coded data) are available with the paper.

Published in Frontiers in Artificial Intelligence

ISSN: 2624-8212 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/artificial-intelligence#

About the journal

Abstract

Keywords