TIPA. Travaux interdisciplinaires sur la parole et le langage (Dec 2013)

Une étude quantitative des marqueurs discursifs, disfluences et chevauchements de parole dans des interviews politiques

  • Philippe Boula de Mareüil,
  • Gilles Adda,
  • Martine Adda-Decker,
  • Claude Barras,
  • Benoît Habert,
  • Patrick Paroubek

DOI
https://doi.org/10.4000/tipa.830
Journal volume & issue
Vol. 29

Abstract

Read online

At the interface between corpus linguistics and automatic speech processing, this study aims at increasing our understanding of spontaneous speech-related phenomena, based on 8 hours of television shows (L’heure de vérité) of French political interviews recorded in the early nineties. During each show, a political figure or a representative of civil society is interviewed by several journalists. The reported work focuses on the transcription, annotation and analysis of discourse markers, disfluencies and speech overlaps. Press-oriented (bona fide) transcripts available for these shows and the output of a speech recognition system were used and aligned to speed up the transcription process, in order to provide a fine-grained (verbatim) transcription of the audio data, including all audible speech events. Sibling corpora are very useful resources to facilitate hand corrections. A segmentation into multi-speaker speech portions was also performed manually by relaxing temporal anchoring constraints in the case of overlaps, because even the precise localisation of the beginning and the end of such events is not straightforward. The Transcriber software (trans.sourceforge.net/en/presentation.php) was customised accordingly to facilitate this task. Two situations were distinguished: (1) the overlap does not entail a speaker change (the primary speaker remains the same at the end of the overlap); (2) the primary speaker stops and the secondary speaker becomes the primary speaker of a new turn.Three types of disfluencies were distinguished: filled pauses, repetitions and false starts. Together with discourse markers, they were analysed by utterance, speaker and pattern types. Silent pauses and lengthening phenomena were also measured, but they are not addressed in this paper. Speech overlaps were annotated by using 4 tags: back-channel, turn stealing, anticipated turn taking, and complementary. Back-channels like “`hmm”s indicate that we follow our interlocutor, understand him/her, agree with him/her; they barely disturb the main speaker. On the opposite, turn stealings clearly interrupt the main speaker, even though the attempt may fail as any other speech act. Anticipated turn taking corresponds to the case where the incoming speaker seems to perceive cues indicating that the main speaker has finished (phrase or clause boundary, falling pitch, etc.). Finally, the complementary label was introduced for overlaps which aim at complementing the main speaker’s utterance: a possibly paraphrased repetition of the primary speaker’s statement, an explicit agreement or disagreement, a short anticipated answer, a precision forwarded or required, not only on the content but also on the form of the exchange (schedule, approached topic), a witty remark or the continuation of the utterance. This complementary label, contrary to the turn stealing one, is assigned to self-sufficient comments or utterances: the entering speaker does not take the floor to develop an argument. This type of overlap may be favoured by the situational context: beyond the speakers actively involved in the show, an actor may wish to provide additional information to the audience.Differences between overlap tags may happen to be subtle and give rise to diverging interpretations. A unique label assignment is not always straightforward. Even “hmm”s can have different communicative functions such as signalling that one is eager to jump in. From one extreme to the other, progressive transitions are common during long-lasting turns. Two shows were annotated by 5 annotators, and the reference resulted from harmonising the different annotations through first negotiation, then adjudication, for the disputed labels. The label distribution for the different annotators confirmed the intermediate nature of the complementary label, and showed a rather high confusion percentage (24%) between anticipated turn takings and turn stealings. Yet, the manual annotation of the corpus based on the four overlap types gave a good inter-annotator agreement (Kappa measures around 0.7). This first result allowed us to study the distribution of overlaps and their interplay with disfluencies and discourse markers.In non-overlapping speech, each disfluency type (as well as discourse markers) accounts for about 2% of the corpus. Among disfluencies, hesitations (transcribed as euh in French) can be found almost anywhere. More precisely, 35% of filled pauses occur at a sentence boundary indicated by a full stop (14%) or at a major phrase boundary indicated by a comma (21%) in the bona fide transcription. In the middle of a sentence, hesitations frequently precede a determiner or a preposition and they rather follow a conjunction or a preposition. This asymmetry suggests that hesitations are avoided within noun phrases, especially between a determiner and a noun. In this situation, other mechanisms such as final lengthening or repetitions are preferred. Repetitions and false starts exhibit some features in common: first, they both involve 1 or 2 words on average, and there is a high correlation (0.8) among speakers between their numbers of repetitions and false starts: speakers who produce many repetitions also tend to make many revisions. Second, most frequent repetitions and false starts tend to be monosyllabic function words: de ‘of’, le (corresponding to the determiner ‘the’ far more often than the pronoun ‘him’), etc. Interestingly, le outweighs la in both repetitions and false starts: it may be considered not only as the masculine form but also as the neutral form of the determiner. By contrast, the conjunction et ‘and’ hardly lends itself to revisions, and it is only found among repetitions. It may also be considered as a discourse marker: as such, it is even more frequent than alors ‘so’ in the corpus.Our study then focused on overlaps, which are frequent (3-4 per minute on average) even if they are short (2.5 words compared to 30-word speaker turns on average). Their cumulative duration represents less than 5% of the data. Non-intrusive overlaps such as back-channels, which encourage a fluid interaction, are particularly short. Figures are comparable for active and passive speakers (i.e. incoming speakers who produce the overlap situation and floor holding speakers who are interrupted). However, active speakers in the turn stealing situation tend to speak faster (they produce more words) than their passive competitors.Overlaps generate twice as many disfluencies as non-overlapping speech portions. The disfluency rate increase mainly concerns repetitions, in particular for active speakers in intrusive overlap situations such as turn stealings. More repetitions and discourse markers are observed for active speakers than for passive speakers, which can also be explained by the turn-start position. Our study showed that disfluencies and discourse markers occur at the beginning rather than at the end of utterances.Passive (primary) speakers become dramatically disfluent within complementary comments brought by their interlocutors. This corroborates the intrusive nature of these complementary overlaps which do not aim at a speaker change but may disturb the main speaker due to their length and informational content. By contrast, back-channels do not increase the disfluency rate of passive speakers. This rate is even lower than it is in non-overlapping speech. Finally, interesting differences are observed between journalists and interviewees, whose roles are asymmetric. Even though their disfluency rates are on the whole comparable, journalists show higher disfluency rates when they are passive speakers in intrusive (turn stealing or complementary) overlap situations. In this case, there seems to be an exchange of standard roles (active interruption for journalists and passive overlaps for interviewees). Enriched and more accurate models are necessary for both talk-in-interaction analysis and speech recognition. We think that drawing up a descriptive inventory of discourse markers, disfluencies and speech overlaps may contribute to the design of a pragmatics model and may be profitable to improve automatic conversational speech transcription, whose performance is still poor as compared to prepared speech recognition.

Keywords