Data in Brief (Aug 2024)
KritiSamhita: A machine learning dataset of South Indian classical music audio clips with tonic classification
Abstract
There are currently a limited number of Indian classical music datasets, especially those large enough and with useful annotations, particularly the subtler ones, such as the tonic, for training classification or prediction models. The dataset described in this paper is created with useful tonic annotations, to fill this gap. The tonic pitch, or base pitch, plays an important role in music, so much so that it is sometimes called the keynote. The vocalists and the accompanying instrumental ensemble are fine-tuned to this keynote to render the composition. The first and second authors of this paper, who are vocalists themselves, recorded songs in four different tonics: F#, G, G#, and A. Using the Python library pydub, each 3+ minute song was segmented into 20-second snippets, including the remainder as a separate snippet. The raw audio snippet data is available in folders separated by tonic, and a directory contains each snippet's file path and tonic. This dataset can be reused for tonic classification work in the future, as well as for training other automated systems targeting higher-level attributes of ICM, such as melodic framework, as a tonic can be the basis for them all.