Remote Sensing in Ecology and Conservation (Mar 2021)

Fine‐scale observations of spatio‐spectro‐temporal dynamics of bird vocalizations using robot audition techniques

  • Shinji Sumitani,
  • Reiji Suzuki,
  • Shiho Matsubayashi,
  • Takaya Arita,
  • Kazuhiro Nakadai,
  • Hiroshi G. Okuno

DOI
https://doi.org/10.1002/rse2.152
Journal volume & issue
Vol. 7, no. 1
pp. 18 – 35

Abstract

Read online

Abstract Ecoacoustics needs sophisticated acoustic monitoring tools to extract a wide level of features from an observed mixture of sounds. We have developed a portable acoustic monitoring system called ‘HARKBird’ which consists of a laptop PC and an inexpensive commercial microphone array with the robot audition software HARK. HARKBird can extract acoustic events in a recording, and we can obtain the begin and end timings, the spatial information (e.g., position or direction from the microphone array), and the spectrogram of the sound separated from the original recording. In this study, we report how robot audition techniques contribute to monitoring spatio‐spectro‐temporal dynamics of bird behaviors, using an extended and minimal system based on multiple microphone arrays. The dimension reduction of separated sounds is important to integrate the information from multiple microphone arrays. As a dimension reduction algorithm, we use t‐SNE to help manual annotation of each sound and to generate the vocalization distribution automatically. We conduct playback experiments to Spotted Towhee (Pipilo maculatus) to simulate different cases of territorial intrusions (song/call/no playback). Our hypothesis in playback experiments is that playback of conspecific vocalizations would invoke aggressive responses of males against song playbacks and the effects would be more prominent than those of call playbacks. Our primary aim is to test whether our system can extract the necessary information on the aggressiveness of target individuals to examine our hypothesis. We show the system with manual annotation of vocalizations can extract their different spatio‐spectro‐temporal dynamics in different conditions, which supported our hypothesis. We also consider the spectral affinity‐based automatic matching of localized sounds from different microphone arrays. The relative number of localized songs depending on the playback conditions reflected a similar trend to those in the manual approach, implying that we can grasp the long‐term dynamics of vocalizations without costly annotations.

Keywords