Nepali Dialogue Corpus

Nepali Dialogue Corpus

The focus of this research lies in examining the current landscape of Nepali dialogue corpora, addressing their limitations, and ultimately establishing a new benchmark for Nepali dialogue data.

Problem:

Currently, there is a scarcity of comprehensive Nepali dialogue corpora, limiting the development and evaluation of dialogue-based applications and systems. Existing corpora, if any, may possess constraints such as limited size, domain specificity, or lack of diversity, hindering their utility for broader research and development purposes. Furthermore, the absence of robust methodologies for collecting dialogues in a weakly supervised manner presents additional challenges.

Research Aim:

The primary aim of this research is threefold:

  1. Evaluate existing Nepali dialogue corpora, if available, to identify their limitations and areas for improvement.
  2. Establish a new benchmark for Nepali dialogue data by curating a diverse and comprehensive corpus that addresses the identified limitations.
  3. Explore methodologies for collecting dialogues in a weakly supervised manner, leveraging platforms such as Twitter or similar platforms, to enhance the scalability and diversity of the dialogue corpus.

Outcome So Far:

The research has commenced with a comprehensive review of existing Nepali dialogue corpora, focusing on their strengths, weaknesses, and potential areas for enhancement. Efforts have also been initiated to explore methods for collecting dialogues in a weakly supervised manner, aiming to leverage the vast conversational data available on platforms like Twitter. The ultimate goal is to establish a curated Nepali dialogue benchmark and apply machine learning baseline methods to validate the generalization of the curated dataset, thereby contributing to the advancement of dialogue-based research and applications in the Nepali language domain.

References:
Lowe, Ryan, et al. “The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems.” arXiv preprint arXiv:1506.08909 (2015).

Research Themes: B Bhattarai MultiModal Learning Lab (MMLL)
Project Category: NLP