NRU :: Browsing by Author "Mukiibi, Jonathan"

Browsing by Author "Mukiibi, Jonathan"

Now showing 1 - 5 of 5

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
(arXiv preprint arXiv, 2024-06-05) Adelani, David Ifeoluwa; Zhuang, Jian Yun; Ochieng, Millicent; Mukiibi, Jonathan; Kabongo, Salomon; Stenetorp, Pontus
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63\% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
Keyword Spotter Model for Crop Pest and Disease Monitoring from Community Radio Data
(arXiv preprint arXiv, 2019) Akera, Benjamin; Nakatumba-Nabende, Joyce; Mukiibi, Jonathan; Hussein, Ali; Baleeta, Nathan; Ssendiwala, Daniel; Nalwooga, Samiiha
In societies with well developed internet infrastructure, social media is the leading medium of communication for various social issues especially for breaking news situations. In rural Uganda however, public community radio is still a dominant means for news dissemination. Community radio gives audience to the general public especially to individuals living in rural areas, and thus plays an important role in giving a voice to those living in the broadcast area. It is an avenue for participatory communication and a tool relevant in both economic and social development.This is supported by the rise to ubiquity of mobile phones providing access to phone-in or text-in talk shows. In this paper, we describe an approach to analysing the readily available community radio data with machine learning-based speech keyword spotting techniques. We identify the keywords of interest related to agriculture and build models to automatically identify these keywords from audio streams. Our contribution through these techniques is a cost-efficient and effective way to monitor food security concerns particularly in rural areas. Through keyword spotting and radio talk show analysis, issues such as crop diseases, pests, drought and famine can be captured and fed into an early warning system for stakeholders and policy makers.
Machine Translation for African Languages: Community Creation of Datasets and Models in Uganda
(n African Natural Language Processing, 2022) Akera, Benjamin; Mukiibi, Jonathan; Sanyu Naggayi, Lydia; Babirye, Claire; Owomugisha, Isaac; Nsumba, Solomon; Nakatumba-Nabende, Joyce; Bainomugisha, Engineer; Mwebaze, Ernest; Quinn, John
Reliable machine translation systems are only available for a small proportion of the world’s languages, the key limitation being a shortage of training and evaluation data. We provide a case study in the creation of such resources by NLP teams who are local to the communities in which these languages are spoken. A parallel text corpus, SALT, was created for five Ugandan languages (Luganda, Runyankole, Acholi, Lugbara and Ateso) and various methods were explored to train and evaluate translation models. The resulting models were found to be effective for practical translation applications, even for those languages with no previous NLP data available, achieving mean BLEU score of 26.2 for translations to English, and 19.9 from English. The SALT dataset and models described are publicly available at
Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda
(ICLR, 2022-03-30) Akera, Benjamin; Mukiibi, Jonathan; Naggayi, Lydia Sanyu; Nsumba, Solomon; Mwebaze, Ernest; Quinn, John
Reliable machine translation systems are only available for a small proportion of the world’s languages, the key limitation being a shortage of training and evaluation data. We provide a case study in the creation of such resources by NLP teams who are local to the communities in which these languages are spoken. A parallel text corpus, SALT, was created for five Ugandan languages (Luganda, Runyankole, Acholi, Lugbara and Ateso) and various methods were explored to train and evaluate translation models. The resulting models were found to be effective for practical translation applications, even for those languages with no previous NLP data available, achieving mean BLEU score of 26.2 for translations to English, and 19.9 from English. The SALT dataset and models described are publicly available at https://github.com/SunbirdAI/salt.
The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition
(arXiv, 2022) Mukiibi, Jonathan; Katumba, Andrew; Nakatumba-Nabende, Joyce; Hussein, Ali; Meyer, Josh
Building a usable radio monitoring automatic speech recognition (ASR) system is a challenging task for under-resourced languages and yet this is paramount in societies where radio is the main medium of public communication and discussions. Initial efforts by the United Nations in Uganda have proved how understanding the perceptions of rural people who are excluded from social media is important in national planning. However, these efforts are being challenged by the absence of transcribed speech datasets. In this paper, The Makerere Artificial Intelligence research lab releases a Luganda radio speech corpus of 155 hours. To our knowledge, this is the first publicly available radio dataset in sub-Saharan Africa. The paper describes the development of the voice corpus and presents baseline Luganda ASR performance results using Coqui STT toolkit, an open source speech recognition toolkit.

Browsing by Author "Mukiibi, Jonathan"

Results Per Page

Sort Options