NLP for Low Resource Languages of South Asia

South Asia is rich in languages. For example, even Nepal with a relatively moderate population size has more than 100 spoken languages (not just dialects!). However, even the most widely spoken language in Nepal, Nepali, has limited data and resources. Thus, one cannot easily leverage recent advancements in Natural Language Processing (NLP) and AI such as large transformer models that need a large amount of data and computing resources. We explore more efficient methods useful for various steps of NLP including language models. Developing AI models that can better understand these languages will make AI more inclusive as one can develop tools and applications such as chatbots and voice bots for the non-English speaking population. This will also increase the access to digital technology that is rapidly being transformed with AI.

Latest Related Publications

View All Publications
Sulav Timilsina, Milan Gautam, Binod Bhattarai
NepBERTa: Nepali Language Model Trained in a Large Corpus
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing - AACL-IJCNLP 2022, 2022
Bibtex

@inproceedings{timilsina-etal-2022-nepberta,
    title = "{N}ep{BERT}a: {N}epali Language Model Trained in a Large Corpus",
    author = "Timilsina, Sulav  and
      Gautam, Milan  and
      Bhattarai, Binod",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = nov,
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-short.34",
    pages = "273--284",
    abstract = "Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven{'}t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.",
}

Rabin Adhikari, Safal Thapaliya, Nirajan Basnet, Samip Poudel, Aman Shakya, Bishesh Khanal
COVID-19-related Nepali Tweets Classification in a Low Resource Setting
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications (SMM4H), Workshop & Shared Task, COLING 2022, Korea, 2022
Bibtex

@inproceedings{adhikari-etal-2022-covid,
    title = "{COVID}-19-related {N}epali Tweets Classification in a Low Resource Setting",
    author = "Adhikari, Rabin  and
      Thapaliya, Safal  and
      Basnet, Nirajan  and
      Poudel, Samip  and
      Shakya, Aman  and
      Khanal, Bishesh",
    booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.smm4h-1.52",
    pages = "209--215",
}

Latest Related News

View All News

Nothing Found

There are no content in this category for this theme. Please check again later , or try a different theme


Latest Related Blogs

View All Blogs

Nothing Found

There are no content in this category for this theme. Please check again later , or try a different theme