Building Text and Speech Benchmark Datasets and Models for Low-Resourced East African Languages: Experiences and Lessons

Nakatumba-Nabende, Joyce; Nabende, Peter; Mukiibi,  Jonathan; Mutebi, Chodrine; Katumba, Andrew

Building Text and Speech Benchmark Datasets and Models for Low-Resourced East African Languages: Experiences and Lessons

dc.contributor.author	Nakatumba-Nabende, Joyce
dc.contributor.author	Nabende, Peter
dc.contributor.author	Mukiibi, Jonathan
dc.contributor.author	Mutebi, Chodrine
dc.contributor.author	Katumba, Andrew
dc.date.accessioned	2025-03-11T08:23:20Z
dc.date.available	2025-03-11T08:23:20Z
dc.date.issued	2025-03-26
dc.description.abstract	Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high-quality natural language processing resources for low-resourced African languages. Obtaining high-quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore-Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.
dc.identifier.citation	Nakatumba‐Nabende, J., Babirye, C., Nabende, P., Tusubira, J. F., Mukiibi, J., Wairagala, E. P., ... & Katumba, A. (2024). Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons. Applied AI Letters, 5(2), e92.https://doi.org/10.1002/ail2.92
dc.identifier.other	https://doi.org/10.1002/ail2.92
dc.identifier.uri	https://nru.uncst.go.ug/handle/123456789/10111
dc.language.iso	en
dc.publisher	Applied AI Letters,
dc.title	Building Text and Speech Benchmark Datasets and Models for Low-Resourced East African Languages: Experiences and Lessons
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Applied AI Letters - 2024 - Nakatumba‐Nabende - Building Text and Speech Benchmark Datasets and Models for Low‐Resourced.pdf
Size:: 1.36 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Humanities and the Arts