Principes et outils pour l’annotation des corpus

Mary Amoyal; Roxane Bertrand; Brigitte Bigi; Auriane Boudin; Christine Meunier; Berthille Pallaud; Béatrice Priego-Valverde; Stéphane Rauzy; Marion Tellier

doi:10.4000/tipa.5424

TIPA. Travaux interdisciplinaires sur la parole et le langage (Jan 2023)

Principes et outils pour l’annotation des corpus

Mary Amoyal,
Roxane Bertrand,
Brigitte Bigi,
Auriane Boudin,
Christine Meunier,
Berthille Pallaud,
Béatrice Priego-Valverde,
Stéphane Rauzy,
Marion Tellier

Affiliations

Mary Amoyal
Roxane Bertrand
Brigitte Bigi
Auriane Boudin
Christine Meunier
Berthille Pallaud
Béatrice Priego-Valverde
Stéphane Rauzy
Marion Tellier

DOI: https://doi.org/10.4000/tipa.5424
Journal volume & issue: Vol. 38

Abstract

Read online

Corpus linguistics (i.e. research on language based on written or oral linguistic material that has been collected and saved) has been considerably developed over the last decades. This development has occurred using more numerous and larger corpora. The increased size of corpora has required the development of automatic tools for their analysis, but also a real reflection on the nature and the objectives of the annotation of corpora. The increased size of the corpora has required the development of automatic tools for their analysis, but also a real reflection on the nature and the objectives of the annotation of the corpora. The enrichment of corpora by a set of specific annotations has emerged, in most cases, as a preliminary to any linguistic analysis.Annotating a corpus consists in adding relevant information for its exploitation. The interest of having annotated corpora (i.e. enriched at different linguistic levels) is to be able to study each annotated levels and the mutual links between them. The work carried out at the LPL on these issues of corpus enrichment was initially meant to make possible the study of multimodality, such as the finest levels of granularity (phonemes) up to the mimo-gestural levels, passing through the syntactic, discursive, prosodic, and interactional levels. It was therefore necessary to think about annotation early on, at the level of information representation. A global annotation scheme allows to consider all these levels in a single formal approach that facilitates their subsequent interrogation.Whatever the level of annotation, several questions have arisen: on the one hand there were questions about the labels used (e.g. decomposition, typology, function, gradual/categorical nature); on the other hand there were questions about the temporary embedment of these labels (location and boundaries). For certain levels of annotation, it will be necessary to describe the levels of dependence between the different labels. These questions must be considered in relation to the research objectives. The work within each annotation level is then relatively similar. It is a question of establishing an annotation scheme that allows the most consistent and robust annotation possible. This scheme is established based on theoretical knowledge and in order to answer research questions. Once the annotation scheme is established, it is also possible to build an annotation guide for potential annotators (expert/naive). Most often, annotations are performed using several annotators to make possible an evaluation of the consistency (inter-annotator agreements). The transversal issue of heterogeneity in human annotations will be addressed in this chapter.In this chapter, we develop some of the main annotation steps that have been performed to annotate corpora manually or automatically, as well as the research issues associated with them. These steps are listed below: - Automatic search of IPUs and orthographic transcriptionFrom the collected primary data, we automatically search for IPUs - Inter-Pausal Units - which allow us to obtain a segmentation into silence blocks versus sound blocks. We then perform the orthographic transcription within these IPUs. This transcription step is crucial as it constitutes the tier from which the other annotation levels will be developed. Here again, the choices made in terms of transcription (chosen convention) have an impact on the links between annotation levels. Once the orthographic transcription is done - and aligned with the signal on IPUs - many annotations can be obtained, either manually, automatically, or semi-automatically. - Phonetic and lexical annotationWe develop, distribute, and regularly enrich an automatic annotation software -SPPAS, which also allows to normalize the transcribed text, which means to obtain the tokens. From these tokens within the IPUs, SPPAS can perform the grapheme-phoneme conversion based on a grammar of the possible pronunciations of each IPU. Finally, SPPAS provides the temporal alignment of phonemes which is now rarely performed manually. However, the manual and automatic aspects of phonetic annotation are different but complementary processes. Thus, spontaneous speech generates phonetic realizations (reductions) that are difficult to manage at the level of automatic alignment. Consequently 1/it may be necessary to manually correct some parts of the automatic alignment: 2/it is possible to use the alignment errors to locate these specific phonetic realizations. In this chapter, we will address the issues related to these two aspects. Other annotations can then be obtained from this phoneme segmentation. They allow to automatically obtain the alignment of tokens; a rule-based system allows to group phonemes into syllables. - Syntactic annotationSyntactic annotation is based on tokens. If there are automatic syntactic analyzers available for written language, syntactic analysis of spoken French remains a challenge. We present here the methodology we have adopted to adapt our writing tagger to handle spontaneous spoken transcripts. If the performances of our MarsaTag tagger are already acceptable, the improvement of our tool will require a multi-level modeling including the phenomena of disfluencies (see below) and the more precise treatment of discourse markers. - Annotation of disfluenciesOral utterances contain many variations in verbal fluency at several levels (e.g. the rate of pronunciation of words, phrases, or clauses). But these variations can also occur at the acoustic and phonetic levels. On the morphological and syntactic levels, some of these variations are translated by real self-interruptions which suspend the syntagmatic flow in the verbal emission. In our corpus analyses, we have planned to keep (in addition to filled or unfilled pauses, discourse elements, interjections) the evidence of the discourse elaboration which are, among other things, initiations or fragments of words and the syntagms’ breaks. This strategy made it possible to envisage a detailed and exhaustive description of these phenomena designated under the term of “disfluency”. - Annotation of speech and interactionsFrom the speech signal and its transcription, it is also possible to consider an annotation of several pragmatic levels such as the thematic organization of conversational interactions. Several levels of annotation will be described in this chapter: the annotation of conversational themes, thematic transitions (i.e. conversational movements that allow to go from one topic to another), and the phases of these transitions. Other phenomena will also be described, such as feedback items and humorous sequences. We will present the annotation protocol associated with these different phenomena as well as the evaluation methods chosen to assess the reliability of these annotations. - Mimogestual annotation From the video signal, it is possible to consider a mimogestual annotation (facial expressions or coverbal manual gestures for example). This can be done either manually or semi-automatically. In this chapter, first we will present the semi-automatic annotation protocol of smiles that we have developed in order to annotate two conversational corpora. We will present the SMAD tool which allows to automatically annotate smiles. Then, we will describe the protocol of correction of these annotations. Finally, we will discuss the evaluation method chosen to assess the robustness of the annotated data. We will also present the manual annotation of coverbal gestures as well as the inherent methodological issues such as annotation schemes and guides, typologies and segmentation. We will give examples of studies carried out at LPL that propose different approaches for gesture annotation.

Published in TIPA. Travaux interdisciplinaires sur la parole et le langage

ISSN: 2264-7082 (Online)
Publisher: Publications de l’Université de Provence
Country of publisher: France
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journals.openedition.org/tipa/

About the journal

Abstract

Keywords