Data in Brief (Aug 2024)

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

  • Noor Mairukh Khan Arnob,
  • A. Faiyaz,
  • Md Mubtasim Fuad,
  • Shah Murtaza Rashid Al Masud,
  • Baivab Das,
  • M.F. Mridha

Journal volume & issue
Vol. 55
p. 110690

Abstract

Read online

The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization.

Keywords