MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages Cheikh M. Bamba Dione1,†,∗, David Ifeoluwa Adelani2,†,∗, Peter Nabende3,†, Jesujoba O. Alabi4,†, Thapelo Sindane5, Happy Buzaaba6†, Shamsuddeen Hassan Muhammad7,8†, Chris Chinenye Emezue9,10†, Perez Ogayo11†, Anuoluwapo Aremu†, Catherine Gitau†, Derguene Mbaye12†, Jonathan Mukiibi3†, Blessing Sibanda†, Bonaventure F. P. Dossou10,13,14†, Andiswa Bukula15, Rooweither Mabuya15, Allahsera Auguste Tapo16†, Edwin Munkoh-Buabeng17†, Victoire Memdjokam Koagne†, Fatoumata Ouoba Kabore18†, Amelia Taylor19, Godson Kalipe†, Tebogo Macucwa5, Vukosi Marivate5,13†, Tajuddeen Gwadabe†, Elvis Tchiaze Mboning†, Ikechukwu Onyenwe20, Gratien Atindogbe21, Tolulope Anu Adelani†, Idris Akinade22, Olanrewaju Samuel†, Marien Nahimana, Théogène Musabeyezu, Emile Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete Agbolo23, Seydou Traore24, Chinedu Uchechukwu20, Aliyu Yusuf8, Muhammad Abdullahi8, Dietrich Klakow4 †Masakhane NLP, 1Université Gaston Berger, Senegal, 2University College London, UK, 3Makerere University, Uganda, 4Saarland University, Germany, 5University of Pretoria, South Africa, 6 RIKEN Center for AIP, Japan, 7Bayero University Kano, Nigeria. 8University of Porto, Portugal, 9Technical University of Munich, Germany, 10Lanfrica, 11Carnegie Mellon University, USA, 12Baamtu, Senegal, 13Lelapa AI, 14Mila Quebec AI Institute, Canada, 15SADiLaR, South Africa, 16Rochester Institute of Technology, USA, 17TU Clausthal, Germany, 18Uppsala University, Sweden, 19Malawi University of Business and Applied Science, Malawi, 20Nnamdi Azikiwe University, Nigeria, 21University of Buea, Cameroon, 22University of Ibadan, Nigeria, 23Ewegbe Akademi, Togo, 24AMALAN, Mali. Abstract In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 ty- pologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal de- pendencies) guidelines. We conducted exten- sive POS baseline experiments using condi- tional random field and several multilingual pre- trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the Masakha- POS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Cru- cially, transferring knowledge from a language that matches the language family and mor- phosyntactic properties seems more effective for POS tagging in unseen languages. 1 Introduction Part-of-Speech (POS) tagging is a process of as- signing the most probable grammatical category ∗Equal contribution. (or tag) to each word (or token) in a given sen- tence of a particular natural language. POS tagging is one of the fundamental steps for many natural language processing (NLP) applications, including machine translation, parsing, text chunking, spell and grammar checking. While great strides have been made for (major) Indo-European languages such as English, French and German, work on the African languages is quite scarce. The vast major- ity of African languages lack annotated datasets for training and evaluating basic NLP systems. There have been recent works on the develop- ment of benchmark datasets for training and eval- uating models in African languages for various NLP tasks, including machine translation (NLLB- Team et al., 2022; Adelani et al., 2022a), text-to- speech (Ogayo et al., 2022; Meyer et al., 2022), speech recognition (Ritchie et al., 2022), senti- ment analysis (Muhammad et al., 2022, 2023), news topic classification (Adelani et al., 2023), and named entity recognition (Adelani et al., 2021, 2022b). However, there is no large-scale dataset for POS covering several African languages. To tackle the data bottleneck issue for low- resource languages, recent work applied cross- lingual transfer (Artetxe et al., 2020; Pfeiffer et al., ar X iv :2 30 5. 13 98 9v 1 [ cs .C L ] 2 3 M ay 2 02 3 2020; Ponti et al., 2020) using multilingual pre- trained language models (PLMs) (Conneau et al., 2020) to model specific phenomena in low-resource target languages. While such a cross-lingual trans- fer is often evaluated by fine-tuning multilingual models on English data, more recent work has shown that English is not often the best transfer language (Lin et al., 2019; de Vries et al., 2022; Adelani et al., 2022b). Contributions In this paper, we develop MasakhaPOS — the largest POS dataset for 20 typologically diverse African languages. We high- light the challenges of annotating POS for these diverse languages using the universal dependencies (UD) (Nivre et al., 2016) guidelines such as tok- enization issues, and POS tags ambiguities. We provide extensive POS baselines using conditional random field (CRF) and several multilingual pre- trained language models (PLMs). Furthermore, we experimented with different parameter-efficient cross-lingual transfer methods (Pfeiffer et al., 2021; Ansell et al., 2022), and transfer languages with available training data in the UD. Our evaluation demonstrates that choosing the best transfer lan- guage(s) in both single-source and multi-source setups leads to large improvements in POS tag- ging performance, especially when combined with parameter-fine-tuning methods. Finally, we show that a transfer language that belongs to the same language family and shares similar morphologi- cal characteristics (e.g. Non-Bantu Niger-Congo) seems to be more effective for tagging POS in un- seen languages. For reproducibility, we release our code, data and models on GitHub1 2 Related Work In the past, efforts have been made to build a POS tagger for several African languages, includ- ing Hausa (Tukur et al., 2020), Igbo (Onyenwe et al., 2014), Kinyarwanda (Cardenas et al., 2019), Luo (De Pauw et al., 2010), Setswana (Malema et al., 2017, 2020), isiXhosa (Delman, 2016), Wolof (Dione et al., 2010), Yorùbá (Sèmiyou et al., 2012; Ishola and Zeman, 2020), and isiZulu (Kol- eva, 2013). While POS tagging has been investi- gated for the aforementioned languages, annotated datasets exist only in a few African languages. In the Universal dependencies dataset (Nivre et al., 1https://github.com/masakhane-io/ masakhane-pos 2016), nine African languages2 are represented. Still, only four of the nine languages have training data, i.e. Afrikaans, Coptic, Nigerian-Pidgin, and Wolof. In this work, we create the largest POS dataset for 20 African languages following the UD annotation guidelines. 3 Languages and their characteristics We focus on 20 Sub-Saharan African languages, spoken in circa 27 countries in the Western, East- ern, Central and Southern regions of Africa. An overview of the focus languages is provided in Table 1. The selected languages represent four lan- guage families: Niger-Congo (17), Afro-Asiatic (Hausa), Nilo-Saharan (Luo), and English Creole (Naija). Among the Niger-Congo languages, eight belong to the Bantu languages. The writing system of our focus languages is mostly based on Latin script (sometimes with additional letters and diacritics). Besides Naija, Kiswahili, and Wolof, the remaining languages are all tonal. As far as morphosyntax is concerned, noun classification is a prominent grammatical fea- ture for an important part of our focus languages. 12 of the languages actively make use of between 6–20 noun classes. This includes all Bantu lan- guages, Ghomálá’, Mossi, Akan and Wolof (Nurse and Philippson, 2006; Payne et al., 2017; Bodomo and Marfo, 2002; Babou and Loporcaro, 2016). Noun classes can play a central role in POS anno- tation. For instance, in isiXhosa, adding the class prefix can change the grammatical category of the word (Delman, 2016). All languages use the SVO word order, while Bambara additionally uses the SOV word order. Appendix A provides the details about the language characteristics. 4 Data and Annotation for MasakhaPOS 4.1 Data collection Table 1 provides the data source used for POS an- notation — collected from online newspapers. The choice of the news domain is threefold. First, it is the second most available resource after the reli- gious domain for most African languages. Second, it covers a diverse range of topics. Third, the news domain is one of the dominant domains in the UD. We collected monolingual news corpus with an open license for about eight African languages, mostly from local newspapers. For the remaining 2including Amharic, Bambara, Beja, Yorùbá, and Zaar with no training data in UD. https://github.com/masakhane-io/masakhane-pos https://github.com/masakhane-io/masakhane-pos African No. of # Average sentence Language Family Region Speakers Source Train / dev / test Tokens Length (# Tokens) Bambara (bam) NC / Mande West 14M MAFAND-MT (Adelani et al., 2022a) 793/ 158/ 634 40,137 25.9 Ghomálá’ (bbj) NC / Grassfields Central 1M MAFAND-MT 750/ 149/ 599 23,111 15.4 Éwé (ewe) NC / Kwa West 7M MAFAND-MT 728/ 145/ 582 28,159 19.4 Fon (fon) NC / Volta-Niger West 2M MAFAND-MT 798/ 159/ 637 49,460 30.6 Hausa (hau) Afro-Asiatic / Chadic West 63M Kano Focus and Freedom Radio 753/ 150/ 601 41,346 27.5 Igbo (ibo) NC / Volta-Niger West 27M IgboRadio and Ka O. dI. Taa 803/ 160/ 642 52,195 32.5 Kinyarwanda (kin) NC / Bantu East 10M IGIHE, Rwanda 757/ 151/ 604 40,558 26.8 Luganda (lug) NC / Bantu East 7M MAFAND-MT 733/ 146/ 586 24,658 16.8 Luo (luo) Nilo-Saharan East 4M MAFAND-MT 757/ 151/ 604 45,734 30.2 Mossi (mos) NC / Gur West 8M MAFAND-MT 757/ 151/ 604 33,791 22.3 Chichewa (nya) NC / Bantu South-East 14M Nation Online Malawi 728/ 145/ 582 24,163 16.6 Naija (pcm) English-Creole West 75M MAFAND-MT 752/ 150/ 600 38,570 25.7 chiShona (sna) NC / Bantu South 12M VOA Shona 747/ 149/ 596 39,785 26.7 Kiswahili (swa) NC / Bantu East & Central 98M VOA Swahili 675/ 134/ 539 40,789 29.5 Setswana (tsn) NC / Bantu South 14M MAFAND-MT 753/ 150/ 602 41,811 27.9 Akan/Twi (twi) NC / Kwa West 9M MAFAND-MT 775/ 154/ 618 41,203 26.2 Wolof (wol) NC / Senegambia West 5M MAFAND-MT 770/ 154/ 616 44,002 28.2 isiXhosa (xho) NC / Bantu South 9M Isolezwe Newspaper 752/ 150/ 601 25,313 16.8 Yorùbá (yor) NC / Volta-Niger West 42M Voice of Nigeria and Asejere 875/ 174/ 698 43,601 24.4 isiZulu (zul) NC / Bantu South 27M Isolezwe Newspaper 753/ 150/ 601 24,028 16.0 Table 1: Languages and Data Splits for MasakhaPOS Corpus. Language, family (NC: Niger-Congo), number of speakers, news source, and data split in number of sentences. 12 languages, we make use of MAFAND-MT (Ade- lani et al., 2022a) translation corpus that is based on the news domain. While there are a few is- sues with translation corpus such as translationese effect, we did not observe serious issues in anno- tation. The only issue we experienced was a few misspellings of words, which led to annotators la- beling a few words with the "X" tag. However, as a post-processing step, we corrected the misspellings and assigned the correct POS tags. 4.2 POS Annotation Methodology For the POS annotation task, we collected 1,500 sentences per language. As manual POS annota- tion is very tedious, we agreed to manually anno- tate 100 sentences per language in the first instance. This data is then used as training data for automatic POS tagging (i.e., fine-tuning RemBERT (Chung et al., 2021) PLM) of the remaining unannotated sentences. Annotators proceeded to fix the mis- takes of the predictions (i.e. 1,400 sentences). This drastically reduced the manual annotation efforts since a few tags are predicted with almost 100% accuracy like punctuation marks, numbers and sym- bols. Proper nouns were also predicted with high accuracy due to the casing feature. To support work on manual corrections of an- notations, most of the languages used the IO An- notator3 tool, a collaborative annotation platform for text and images. The tool provides support for multi-user annotations simultaneously on datasets. For each language, we hired three native speakers with linguistics backgrounds to perform POS an- 3https://ioannotator.com/ notation.4 To ensure high-quality annotation, we recruited a language coordinator to supervise anno- tation in each language. In addition, we provided online support (documentation and video tutori- als) to train annotators on POS annotation. We made use of the Universal POS tagset (Petrov et al., 2012), which contains 17 tags.5 To avoid the use of spurious tags, for each word to be annotated, anno- tators have to choose one of the possible tags made available on the IO Annotator tool through a drop- down menu. For each language, annotation was done independently by each annotator. At the end of annotation, language coordinators worked with their team to resolve disagreements using IOAnno- tator or Google Spreadsheet. We refer to our newly annotated POS dataset as MasakhaPOS. 4.3 Quality Control Computation of automatic inter-agreement metrics scores like Fleiss Kappa was a bit challenging due to tokenization issues, e.g. many compound family names are split. Instead, we adopted the tokeniza- tion defined by annotators since they are annotating all words in the sentence. Due to several annota- tion challenges as described in section 5, seven language teams (Ghomálá’, Fon, Igbo, Chichewa chiShona, Kiswahili, and Wolof) decided to en- gage annotators on online calls (or in person dis- cussions) to agree on the correct annotation for each word in the sentence. The other language teams allowed their annotators to work individu- ally, and only discuss sentences on which they did not agree. Seven of the 13 languages achieved a 4Each annotator was paid $750 for 1,500 sentences. 5https://universaldependencies.org/u/pos/ https://ioannotator.com/ https://universaldependencies.org/u/pos/ sentence-level annotation agreement of over 75%. Two more languages (Luganda and isiZulu) have sentence-level agreement scores of between 64.0% to 67.0%. The remaining four languages (Ewe, Luo, Mossi, and Setswana) only agreed on less than 50% of the annotated sentences. This con- firms the difficulty of the annotation task for many language teams. Despite this challenge, we ensured that all teams resolved all disagreements to produce high-quality POS corpus. Appendix B provides de- tails of the number of agreed annotation by each language team. After quality control, we divided the annotated sentences into training, development and test splits consisting of 50%, 10%, 40% of the data respec- tively. We chose a larger test set proportion that is similar to the size of test sets in the UD, usually larger than 500 sentences. Table 1 provides the de- tails of the data split. We split very long sentences into two to fit the maximum sequence length of 200 for PLM fine-tuning. We further performed manual checks to correct sentences split at arbitrary parts. 5 Annotation challenges When annotating our focus languages, we faced two main challenges: tokenization and POS ambi- guities. 5.1 Tokenization and word segmentation In UD, the basic annotation units are syntactic words (rather than phonological or orthographical words) (Nivre et al., 2016). Accordingly, clitics need to be split off and contraction must be un- done where necessary. Applying the UD annotation scheme to our focus languages was not straightfor- ward due to the nature of those languages, espe- cially with respect to the notion of word, the use of clitics and multiword units. 5.1.1 Definition of word For many of our focus languages (e.g. Chichewa, Luo, chiShona, Wolof and isiXhosa), it was dif- ficult to establish a dividing line between a word and a phrase. For instance, the chiShona word ndakazomuona translates into English as a whole sentence (‘I eventually saw him’). This word consists of several morphemes that convey dis- tinct morphosyntactic information (Chabata, 2000): Nda- (subject concord), -ka- (aspect), -zo- (aux- iliary), -mu- (object concord), -ona- (verb stem). This illustrates pronoun incorporation (Bresnan and Mchombo, 1987), i.e. subject and/or object pro- nouns appear as bits of morphology on a verb or other head, functioning as agreement markers. Nat- urally, one may want to split this word into several tokens reflecting the different grammatical func- tions. For UD, however, morphological features such as agreement are encoded as properties of words and there is no attempt at segmenting words into morphemes, implying that items like ndaka- zomuona should be treated as a single unit. 5.1.2 Clitics In languages like Hausa, Igbo, IsiZulu, Kin- yarwanda, Wolof and Yorùbá, we observed an ex- tensive use of cliticization. Function words such as prepositions, conjunctions, auxiliaries and de- terminers can attach to other function or content words. For example, the Igbo contracted form yana consists of a pronoun (PRON) ya and a coordi- nating conjunction (CCONJ) na. Following UD, we segmented such contracted forms, as they cor- respond to multiple (syntactic) words. However, there were many cases of fusion where a word has morphemes that are not necessarily easily seg- mentable. For instance, the chiShona word vave translates into English as ‘who (PRON) are (AUX) now (ADV)’. Here, the morpheme -ve, which func- tions both as auxiliary and adverb, cannot be further segmented, even though it corresponds to multiple syntactic words. Ultimately, we treated the word vave as a unit, which received the AUX POS tag. In addition, there were word contractions with phonological changes, posing serious challenges, as proper segmentation may require to recover the underlying form first. For instance, the Wolof con- tracted form “cib" (Dione, 2019) consists of the preposition ci ‘in’ and the indefinite article ab ‘a’. However, as a result of phonological change, the initial vowel of the article is deleted. Accordingly, to properly segment the contracted form, it won’t be sufficient to just extract the preposition ci be- cause the remaining form b will not have meaning. Also, some word contractions are ambiguous. For instance, in Wolof, a form like geek can be split into gi ‘the’ and ak where ak can function as a conjunction ‘and’ or as a preposition ‘with’. 5.1.3 One unit or multitoken words? Unlike the issue just described in 5.1.2, it was some- times necessary to go in the other direction, and combine several orthographic tokens into a sin- gle syntactic word. Examples of such multitoken words are found e.g. in Setswana (Malema et al., 2017). For instance, in the relative structure ng- wana yo o ratang (the child who likes ...), the rela- tive marker yo o is a multitoken word that matches the noun class (class 1) of the relativized noun ng- wana (‘child’), which is subject of the verb ratang (‘to like’). In UD, multitoken words are allowed for a restricted class of phenomena, such as numerical expressions like 20 000 and abbreviations (e. g.). We advocate that this restricted class be expanded to phenomena like Setswana relative markers. 5.2 POS ambiguities There were cases where a word form lies on the boundary between two (or more) POS categories. 5.2.1 Verb or conjunction? In quite a few of our focus languages (e.g. Yorùbá, Wolof), a form of the verb ‘say’ is also used as a subordinate conjunction (to mark out clause bound- aries) with verbs of speaking. For example, in the Yorùbá sentence Olú gbàgbé pé Bolá tí jàde (lit. ‘Olu forgot that Bola has gone’) (Lawal, 1991), the item pé seems to behave both like a verb and a sub- ordinate conjunction. On the one hand, because of the presence of another verb gbàgbé ‘to forget’, the pattern may be analyzed as a serial verb construc- tion (SVC) (Oyelaran, 1982; Güldemann, 2008), i.e. a construction that contains sequences of two or more verbs without any syntactic marker of sub- ordination. This would mean that pé is a verb. On the other hand, however, this item shows properties of a complementizer (Lawal, 1991). For instance, pé can occur in sentence initial position, which in Yorùbá is typically occupied by subordinating conjunctions. Also, unlike verbs, pé cannot un- dergo reduplication for nominalization (an ability that all Yorùbá verbs have). This seems to pro- vide evidence for treating this item as a subordinate conjunction rather than a verb. 5.2.2 Adjective or Verb? In some of our focus languages, the category of ad- jectives is not entirely distinct morpho-syntactically from verbs. In Wolof and Yorùbá, the notions that would be expressed by adjectives in English are en- coded through verbs (McLaughlin, 2004). Igbo (Welmers, 2018) and Éwé (McLaughlin, 2004) have a very limited set of underived adjectives (8 and 5, respectively). For instance, in Wolof, unlike in English, an ‘adjective’ like gaaw ‘be quick’ does not need a copula (e.g. ‘be’ in English) to function as a predicate. Likewise, the Bambara item téli ‘quick’ as in the sentence Sò ka téli ‘The horse is quick’ (Aplonova and Tyers, 2017) has adjectival properties, as it is typically used to modify nouns and specify their properties or attributes. It also has verbal properties, as it can be used in the main predicative position functioning as a verb. This is signaled by the presence of the auxiliary ka, which is a special predicative marker ka that typically accompanies qualitative verbs (Vydrin, 2018). 5.2.3 Adverbs or particles? The distinction between adverbs and particles was not always straightforward. For instance, many of our focus languages have ideophones, i.e. words that convey an idea by means of a sound (often reduplicated) that expresses an action, quality, man- ner, etc. Ideophones may behave like adverbs by modifying verbs for such categories as time, place, direction or manner. However, they can also func- tion as verbal particles. For instance, in Wolof, an ideophone like jërr as in tàng jërr “very hot” (tàng means “to be hot”) is an intensifier that only co- occurs as a particle of that verb. Thus, it would not be motivated to treat it as another POS other than PART. Whether such ideophones are PART or ADV or the like varies depending on the language. 6 Baseline Experiments 6.1 Baseline models We provide POS tagging baselines using both CRF and multilingual PLMs. For the PLMs, we fine- tune three massively multilingual PLMs pre-trained on at least 100 languages (mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), and Rem- BERT (Chung et al., 2021)), and three Africa- centric PLMs like AfriBERTa (Ogueji et al., 2021), AfroXLMR (Alabi et al., 2022), and AfroLM (Dos- sou et al., 2022) pre-trained on several African languages. The baseline models are: CRF is one of the most successful sequence la- beling approach prior to PLMs. CRF models the sequence labeling task as an undirected graphical model, using both labelled observations and contex- tual information as features. We implemented the CRF model using sklearn-crfsuite,6 using the following features: the word to be tagged, two consecutive previous and next words, the word in lowercase, prefixes and suffixes of words, length 6https://sklearn-crfsuite.readthedocs.io/ https://sklearn-crfsuite.readthedocs.io/ Model bam bbj ewe fon hau ibo kin lug luo mos nya pcm sna swa tsn twi wol xho yor zul AVG CRF 89.1 78.9 88.0 88.1 89.8 75.2 95.3 88.3 84.6 86.0 77.7 85.6 85.9 89.3 81.4 81.5 91.0 81.8 92.0 84.2 85.7 Massively-multilingual PLMs mBERT (172M) 89.9 75.2 86.0 87.6 90.7 76.5 96.9 89.6 87.0 86.5 79.9 90.4 87.5 92.0 81.9 83.9 92.5 85.9 93.4 86.8 87.0 XLM-R-base (270M) 90.1 83.6 88.5 90.1 92.5 77.2 96.7 89.1 87.2 90.7 79.9 90.5 87.9 92.9 81.3 84.1 92.4 87.4 93.7 88.0 88.2 XLM-R-large (550M) 90.2 85.4 88.8 90.2 92.8 78.1 97.3 90.0 88.0 91.1 80.5 90.8 88.1 93.2 82.2 84.9 92.9 88.1 94.2 89.4 88.8 RemBERT (575M) 90.6 82.6 88.9 90.8 93.0 79.3 98.0 90.3 87.5 90.4 82.4 90.9 89.1 93.1 83.6 86.0 92.1 89.3 94.7 90.2 89.1 Africa-centric PLMs AfroLM (270M) 89.2 77.8 87.5 82.4 92.7 77.8 97.4 90.8 86.8 89.6 81.1 89.5 88.7 92.8 83.8 83.9 92.1 87.5 91.1 88.8 87.6 AfriBERTa-large (126M) 89.4 79.6 87.4 88.4 93.0 79.3 97.8 89.8 86.5 89.9 79.7 89.8 87.8 93.0 82.5 83.7 91.7 86.1 94.5 86.9 87.8 AfroXLMR-base (270M) 90.2 83.5 88.5 90.1 93.0 79.1 98.2 90.9 86.9 90.9 82.7 90.8 89.2 92.9 82.7 84.3 92.4 88.5 94.5 89.4 88.9 AfroXLMR-large (550M) 90.5 85.3 88.7 90.4 93.0 78.9 98.4 91.6 88.1 91.2 83.2 91.2 89.5 93.2 83.0 84.9 92.9 88.7 95.0 90.1 89.4 Table 2: Accuracy of baseline models on MasakhaPOS dataset . We compare several multilingual PLMs including the ones trained on African languages. Average is over 5 runs. ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM VERB X ACC bam 41.0 77.0 72.0 82.0 91.0 0.0 91.0 90.0 95.0 97.0 82.0 100.0 71.0 25.0 83.0 0.0 90.7 bbj 71.0 80.0 67.0 89.0 84.0 85.0 0.0 82.0 86.0 78.0 91.0 92.0 100.0 88.0 86.0 85.6 ewe 72.0 83.0 57.0 94.0 89.0 100.0 91.0 91.0 87.0 90.0 93.0 100.0 84.0 13.0 82.0 88.7 fon 91.0 88.0 69.0 75.0 94.0 96.0 91.0 90.0 89.0 95.0 91.0 100.0 51.0 89.0 90.4 hau 86.0 80.0 71.0 96.0 89.0 84.0 0.0 94.0 98.0 95.0 76.0 98.0 99.0 86.0 96.0 62.0 92.9 ibo 95.0 89.0 56.0 98.0 76.0 79.0 0.0 70.0 95.0 0.0 98.0 95.0 100.0 6.0 0.0 81.0 79.2 kin 86.0 99.0 91.0 0.0 100.0 99.0 99.0 100.0 84.0 98.0 97.0 100.0 97.0 0.0 99.0 0.0 98.4 lug 71.0 96.0 72.0 90.0 90.0 76.0 94.0 93.0 94.0 15.0 94.0 100.0 89.0 92.0 91.6 luo 73.0 88.0 69.0 87.0 69.0 82.0 89.0 96.0 86.0 42.0 89.0 100.0 94.0 100.0 86.0 0.0 88.2 mos 64.0 83.0 72.0 91.0 93.0 84.0 91.0 93.0 94.0 83.0 90.0 100.0 95.0 92.0 91.2 nya 74.0 79.0 56.0 25.0 77.0 81.0 20.0 92.0 86.0 12.0 73.0 86.0 99.0 6.0 89.0 83.1 pcm 78.0 97.0 74.0 86.0 98.0 92.0 95.0 98.0 90.0 86.0 91.0 98.0 86.0 45.0 91.0 91.1 sna 51.0 94.0 44.0 87.0 89.0 83.0 95.0 96.0 0.0 78.0 92.0 99.0 58.0 60.0 94.0 89.4 swa 95.0 86.0 65.0 82.0 95.0 56.0 97.0 98.0 86.0 51.0 97.0 100.0 91.0 95.0 0.0 93.1 tsn 57.0 80.0 82.0 42.0 53.0 78.0 17.0 94.0 97.0 62.0 76.0 91.0 99.0 18.0 0.0 95.0 0.0 82.4 twi 55.0 82.0 68.0 52.0 87.0 93.0 0.0 86.0 77.0 21.0 82.0 92.0 100.0 9.0 0.0 87.0 84.8 wol 0.0 94.0 81.0 94.0 96.0 90.0 22.0 91.0 90.0 98.0 92.0 96.0 100.0 85.0 62.0 94.0 92.9 xho 73.0 69.0 47.0 17.0 88.0 54.0 0.0 87.0 100.0 80.0 95.0 100.0 57.0 0.0 90.0 88.3 yor 84.0 92.0 82.0 99.0 97.0 97.0 95.0 94.0 83.0 95.0 96.0 100.0 98.0 95.0 0.0 95.1 zul 68.0 26.0 72.0 21.0 67.0 82.0 0.0 91.0 99.0 81.0 99.0 100.0 91.0 100.0 91.0 96.0 90.0 AVE 69.2 83.1 68.4 69.1 86.4 79.0 15.9 90.8 93.4 69.7 79.0 92.8 99.7 68.0 33.8 90.4 19.8 89.4 Table 3: Tag distribution of the “AfroXLMR-large” -based POS tagger (reporting results from the first run). The tags with high average accuracy (> 90.0% ) across all languages are highlighted in gray . of the word, and other boolean features like is the word a digit, a punctuation mark, the beginning of a sentence or end of a sentence. Massively multilingual PLM We fine-tune mBERT, XLM-R (base & large), and RemBERT pre-trained on 100-110 languages, but only few African languages. mBERT, XLM-R, and Rem- BERT were pre-trained on two (swa & yor), three (hau, swa, & xho), and eight (hau, ibo, nya, sna, swa, xho, yor, & zul) of our focus lan- guages respectively. The three models were all pre-trained using masked language model (MLM), mBERT and RemBERT additionally use the next- sentence prediction objective. Africa-centric PLMs We fine-tune AfriBERTa, AfroLM and AfroXLMR (base & large). The first two PLMs were pre-trained using XLM-R style pre- training, AfroLM additionally make use of active learning during pre-training to address data scarcity of many African languages. On the other hand, AfroXLMR was created through language adapta- tion (Pfeiffer et al., 2020) of XLM-R on 17 African languages, “eng”, “fra”, and “ara”. AfroLM was pre-trained on all our focus languages, while AfriB- ERTa and AfroXLMR were pre-trained on 6 (hau, ibo, kin, pcm, swa, & yor) and 10 (hau, ibo, kin, nya, pcm, sna, swa, xho, yor, & zul) respectively. We fine-tune all PLMs using the Hug- gingFace Transformers library (Wolf et al., 2020). For PLM fine-tuning, we make use of a max- imum sequence length of 200, batch size of 16, gradient accumulation of 2, learning rate of 5e− 5, and number of epochs 50. The experiments were performed on using Nvidia V100 GPU. 6.2 Baseline results Table 2 shows the results of training POS tag- gers for each focus language using the CRF and PLMs. Suprinsingly, the CRF model gave a very impressive result for all languages with only a few points below the best PLM (−3.7). In general, fine- tuning PLMs gave a better result for all languages. The mBERT performance is (+1.3) better in accu- racy than CRF. AfroLM and AfriBERTa are only slightly better than mBERT with (< 1 point). One of the reasons for AfriBERTa’s poor performance is that most of the languages are unseen during pre-training.7 On the other hand, AfroLM was pre- trained on all our focus languages but on a small dataset (0.73GB) which makes it difficult to train a good representation for each of the languages covered during pre-training. Furthermore, XLM- R-base gave slightly better accuracy on average than both AfroLM (+0.6) and AfriBERTa (+0.4) despite seeing fewer African languages. However, the performance of the AfroXLMR-base exceeds that of XLM-R-base because it has been further adapted to 17 typologically diverse African lan- guages, and the performance (±0.1) is similar to the larger PLMs i.e RemBERT and XLM-R-large. Impressive performance was achieved by large versions of massively multilingual PLMs like XLM-R-large and RemBERT, and AfroXLMR (base & large) i.e better than mBERT (+1.8 to +2.4) and better than CRF (+3.1 to +3.7). The performance of the large PLMs (e.g. AfroXLMR- large) is larger for some languages when compared to mBERT like bbj (+10.1), mos (+4.7), nya (+3.3), and zul (+3.3). Overall, AfroXLMR- large achieves the best accuracy on average over all languages (89.4) because it has been pre-trained on more African languages with larger monolin- gual data and it’s large size. Interestingly, 11 out of 20 languages reach an impressive accuracy of (> 90%) with the best PLM which is an indication of consistent and high quality POS annotation. Accuracy by tag distribution Table 3 shows the POS tagging results by tag distribution using our best model “AfroXLMR-large”. The tags that are easiest (with accuracy over > 90%) to detect across all languages are PUNCT, NUM, PROPN, NOUN, and VERB, while the most difficult are SYM, INTJ, and X tags. The difficult tags are often infrequent, which does not affect the overall accuracy. Sur- prisingly, a few languages like Yorùbá and Kin- yarwanda, have very good accuracy on almost all tags except for the infrequent tags in the language. 7 Cross-lingual Transfer 7.1 Experimental setup for effective transfer The effectiveness of zero-shot cross-lingual trans- fer depends on several factors including the choice of the best performing PLM, choice of an effective cross-lingual transfer method, and the choice of the best source language for transfer. Oftentimes, the source language chosen for cross-lingual transfer 714 out of 20 languages are unseen is English due to the availability of training data which may not be ideal for distant languages espe- cially for POS tagging (de Vries et al., 2022). To further improve performance, parameter-efficient fine-tuning approaches (Pfeiffer et al., 2020; Ansell et al., 2022) can be leveraged with additional mono- lingual data for both source and target languages. We highlight how we combine these different fac- tors for effective transfer below: Choice of source languages Prior work on the choice of source language for POS tagging shows that the most important features are geographi- cal similarity, genetic similarity (or closeness in language family tree) and word overlap between source and target language (Lin et al., 2019). We choose seven source languages for zero-shot trans- fer based on the following criteria (1) availability of POS training data in UD,8. Only three African languages satisfies this criteria (Wolof, Nigerian- Pidgin, and Afrikaans) (2) geographical prox- imity to African languages – this includes non- indigeneous languages that have official status in Africa like English, French, Afrikaans, and Arabic. (3) language family similarity to target languages. The languages chosen are: Afrikaans (afr), Ara- bic (ara), English (eng), French (fra), Nigerian- Pidgin (pcm), Wolof (wol), and Romanian (ron). While Romanian does not satisfy the last two cri- teria - it was selected based on the findings of de Vries et al. (2022) — Romanian achieves the best transfer performance to the most number of languages in UD. Appendix C shows the data split for the source languages. Parameter-efficient cross-lingual transfer The standard way of zero-shot cross-lingual transfer involves fine-tuning a multilingual PLM on the source language labelled data (e.g. on a POS task), and evaluate it on a target language. We refer to it as FT-Eval (or Fine-tune & evaluate). However, the performance is often poor for un- seen languages in PLM and distant languages. One way to address this is to perform language adaptation using monolingual corpus in the tar- get language before fine-tuning on the downstream task (Pfeiffer et al., 2020), but this setup does not scale to many languages since it requires modify- ing all the parameters of the PLM and requires large disk space (Alabi et al., 2022). Several parameter-efficient approaches have been proposed 8https://universaldependencies.org/ https://universaldependencies.org/ afr ara eng fra pcm ron wol eng-ron-wol Source Languages 30 35 40 45 50 55 60 65 70 Ac cu ra cy 48.9 36.9 52.6 45.6 32.8 53.1 48.1 56.7 60.5 48.4 63.4 61.7 49.6 63.5 64.8 66.8 61.4 50 66 64.9 52.5 67 65.7 68.8FT-Eval LT-SFT MAD-X Figure 1: Zero-shot cross-lingual transfer results using FT-Eval, LT-SFT and MAD-X. Average over 20 languages. Experiments performed using AfroXLMR-base. Evaluation metric is Accuracy. like Adapters (Houlsby et al., 2019) and Lottery- Ticketing Sparse Fine-tunings (LT-SFT) (Ansell et al., 2022) —they are also modular and compos- able making them ideal for cross-lingual transfer. Here, we make use of MAD-X 2.09 adapter based approach (Pfeiffer et al., 2020, 2021) and LT-SFT approach. The setup is as follows: (1) We train language adapters/SFTs using monolin- gual news corpora of our focus languages. We perform language adaptation on the news corpus to match the POS task domain, similar to (Alabi et al., 2022). We provide details of the monolin- gual corpus in Appendix E. (2) We train a task adapter/SFT on the source language labelled data using source language adapter/SFT. (3) We sub- stitute the source language adapter/SFT with the target language/SFT to run prediction on the target language test set, while retaining the task adapter. Choice of PLM We make use of AfroXLMR- base as the backbone PLM for all experiments be- cause it gave an impressive performance in Table 2, and the availability of language adapters/SFTs for some of the languages by prior works (Pfeif- fer et al., 2021; Ansell et al., 2022; Alabi et al., 2022). When a target language adapter/SFT of AfroXLMR-base is absent, XLM-R-base language adapter/SFT can be used instead since they share the same architecture and number of parameters, as demonstrated in Alabi et al. (2022). We did not find XLM-R-large based adapters and SFTs online,10 and they are time-consuming to train especially for high-resource languages like English. 7.2 Experimental Results Parameter-efficient fine-tuning are more effec- tive Figure 1 shows the result of cross-lingual 9an extension of MAD-X where the last adapter layers are dropped, which has been shown to improve performance 10https://adapterhub.ml/ transfer from seven source languages with POS training data in UD, and their average accuracy on 20 African languages. We report the performance of the standard zero-shot cross-lingual transfer with AfroXLMR-base (i.e. FT-Eval), and parameter- efficient fine-tuning approaches i.e MAD-X and LT-SFT. Our result shows that MAD-X and LT- SFT gives significantly better results than FT-Eval, the performance difference is over 10% accuracy on all languages. This shows the effectiveness of parameter-efficient fine-tuning approaches on cross-lingual transfer for low-resource languages despite only using small monolingual data (433KB - 50.2MB, as shown in Appendix E) for training tar- get language adapters and SFTs. Furthermore, we find MAD-X to be slightly better than LT-SFT espe- cially when ron (+3.5), fra (+3.2), pcm (+2.9), and eng (+2.6) are used as source languages. The best source language In general, we find eng, ron, and wol to be better as source lan- guages to the 20 African languages. For the FT- Eval, eng and ron have similar performance. However, for LT-SFT, wol was slightly better than the other two, probably because we are transfering from an African language that shares the same fam- ily or geographical location to the target languages. For MAD-X, eng was surprisingly the best choice. Multi-source fine-tuning leads to further gains Table 4 shows that co-training the best three source languages (eng, ron, and wol) leads to improved performance, reaching an impressive accuracy of 68.8% with MAD-X. For the FT-Eval, we per- formed multi-task training on the combined train- ing set of the three languages. LT-SFT supports multi-source fine-tuning — where a task SFT can be trained on data from several languages jointly. However, MAD-X implementation does not sup- port multi-source fine-tuning. We created our ver- https://adapterhub.ml/ Method bam bbj ewe fon hau ibo kin lug luo mos nya pcm sna swa tsn twi wol xho yor zul AVG AVG* eng as a source language FT-Eval 52.1 31.9 47.8 32.5 67.1 74.5 63.9 57.8 38.4 45.3 59.0 82.1 63.7 56.9 49.4 35.9 35.9 45.9 63.3 48.8 52.6 51.9 LT-SFT 67.9 57.6 67.9 55.5 69.0 76.3 64.2 61.0 74.5 70.3 59.4 82.4 64.6 56.9 49.5 52.1 78.2 45.9 65.3 49.8 63.4 61.5 MAD-X 62.9 58.5 68.7 55.8 67.0 77.8 70.9 65.7 73.0 71.8 70.1 83.2 69.8 61.2 49.8 53.0 75.2 57.1 66.9 60.9 66.0 64.5 ron as a source language FT-Eval 46.5 30.5 37.6 30.9 67.3 77.7 73.3 56.9 36.7 40.6 62.2 78.9 66.3 61.0 55.8 35.7 33.8 49.6 63.5 56.3 53.1 52.7 LT-SFT 60.6 57.0 64.9 60.4 67.5 77.4 68.2 58.5 70.2 67.9 58.2 78.1 64.6 59.7 57.4 55.7 81.9 46.3 64.8 51.2 63.5 61.7 MAD-X 63.5 62.2 66.6 61.8 66.5 80.0 73.5 62.7 76.5 71.8 66.0 83.7 71.1 64.5 61.2 53.5 79.5 48.6 69.5 57.8 67.0 65.4 wol as a source language FT-Eval 40.8 36.5 39.8 37.4 55.1 58.6 49.2 51.8 35.1 44.9 49.0 51.6 53.8 42.9 45.0 38.4 88.6 46.0 52.5 45.5 48.1 45.7 LT-SFT (N) 64.4 64.3 69.8 63.0 67.0 79.7 63.7 64.0 74.1 72.2 56.5 72.7 67.7 53.0 51.3 56.2 92.5 46.0 69.8 47.7 64.8 62.8 MAD-X (N) 46.6 41.8 47.2 37.8 53.9 51.8 41.0 39.0 46.5 44.0 38.3 40.2 44.3 38.8 44.6 40.1 85.6 39.2 46.4 36.0 45.2 43.2 MAD-X (N+W) 61.7 63.6 68.9 63.1 66.8 77.0 67.8 69.1 73.7 71.3 63.2 75.1 68.9 55.8 50.7 54.9 90.4 49.6 70.0 51.7 65.7 63.8 multi-source: eng-ron-wol FT-Eval 44.2 36.3 39.3 39.3 69.4 78.5 70.6 59.2 35.5 46.8 60.9 81.4 65.8 58.5 53.8 38.8 89.1 48.8 65.2 53.5 56.7 53.6 LT-SFT 67.4 64.6 70.0 64.2 70.4 81.1 68.7 63.9 76.4 73.9 58.8 83.0 69.6 57.3 52.7 57.2 93.1 45.8 69.8 48.3 66.8 64.4 MAD-X 66.2 65.5 70.3 64.9 69.1 82.3 73.1 68.0 75.1 74.2 69.2 83.9 69.4 62.6 53.6 55.2 90.1 52.3 70.8 59.4 68.8 66.7 Table 4: Cross-lingual transfer to MasakhaPOS . Zero-shot Evaluation using FT-Eval, LT-SFT, and MAD-X, with ron, eng, and wol as source languages. Experiments are based on AfroXLMR-base. Non-Bantu Niger-Congo languages highlighted with gray . AVG* excludes pcm and wol from the average since they are source languages. sion of multi-source fine-tuning following these steps: (1) We combine all the training data of the three languages (2) We train a task adapter using the combined data and one of the best source lan- guages’ adapter. We experiment using eng, ron, and wol as source language adapter for the com- bined data. Our experiment shows that eng or wol achieves similar performance when used as language adapter for multi-source fine-tuning. We only added the result using wol as source adapter on Table 4. Appendix Appendix F provides more details on MAD-X multi-source fine-tuning. Performance difference by language family Ta- ble 4 shows the transfer result per language for the three best source languages. wol has a better transfer performance to non-Bantu Niger-Congo languages in West Africa than eng and ron, es- pecially for bbj, ewe, fon, ibo, mos, twi, and yor despite having a smaller POS training data (1.2k sentences) compared to ron (8k sentences) and eng (12.5k sentences). Also, wol adapter was trained on a small monolingual corpus (5.2MB). This result aligns with prior studies that choosing a source language from the same family leads to more effective transfer (Lin et al., 2019; de Vries et al., 2022). However, we find MAD-X to be more sensitive to the size of monolingual corpus. We obtained a very terrible transfer accuracy when we only train language adapter for wol on the news do- main (2.5MB) i.e MAD-X (N), lower than FT-Eval. By additionally combining the news corpus with Wikipedia corpus (2.7MB) i.e MAD-X (N+W), we were able to obtain an impressive result comparable to LT-SFT. This highlight the importance of using larger monolingual corpus to train source language adapter. wol was not the best source language for Bantu languages probably because of the difference in language characteristics. For example, Bantu lan- guages are very morphologically-rich while non- Bantu Niger-Congo languages (like wol) are not. Our further analysis shows that sna was better in transferring to Bantu languages. Appendix G provides result for the other source languages. 8 Conclusion In this paper, we created MasakhaPOS, the largest POS dataset for 20 typologically-diverse African languages. We showed that POS annotation of these languages based on the UD scheme can be quite challenging, especially with regard to word segmentation and POS ambiguities. We provide POS baseline models using CRF and by fine-tuning multilingual PLMs. We analyze cross-lingual trans- fer on MasakhaPOS dataset in single-source and multi-source settings. An important finding that emerged from this study is that choosing the appro- priate transfer languages substantially improves POS tagging for unseen languages. The trans- fer performance is particularly effective when pre- training includes a language that shares typological features with the target languages. 9 Limitations Some Language families in Africa not covered For example, Khoisan and Austronesian (like Mala- gasy). We performed extensive analysis and exper- iments on Niger-Congo languages but we only cov- ered one language each in the Afro-asiatic (Hausa) and Nilo-Saharan (Dholuo) families. News domain Our annotated dataset belong to the news domain, which is a popular domain in UD. However, the POS dataset and models may not generalize to other domains like speech transcript, conversation data etc. Transfer results may not generalize to all NLP tasks We have only experimented with POS task, the best transfer language e.g for non-Bantu Niger- Congo languages i.e Wolof, may not be the same for other NLP tasks. 10 Ethics Statement or Broader Impact Our work aims to understand linguistic character- istics of African languages, we do not see any po- tential harms when using our POS datasets and models to train ML models, the annotated dataset is based on the news domain, and the articles are publicly available, and we believe the dataset and POS annotation is unlikely to cause unintended harm. Also, we do not see any privacy risks in using our dataset and models because it is based on news domain. Acknowledgements This work was carried out with support from La- cuna Fund, an initiative co-founded by The Rock- efeller Foundation, Google.org, and Canada’s In- ternational Development Research Centre. We are grateful to Sascha Heyer, for extending the ioAn- notator tool to meet our requirements for POS an- notation. We appreciate the early advice from Gra- ham Neubig, Kim Gerdes, and Sylvain Kahane on this project. David Adelani acknowledges the support of DeepMind Academic Fellowship pro- gramme. We appreciate all the POS annotators that contributed to this dataset. Finally, we thank the Masakhane leadership, Melissa Omino, Davor Or- lic and Knowledge4All for their administrative ˇ support throughout the project. References David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajud- deen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beuk- man, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Ben- jamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, An- uoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memd- jokam Koagne, Edwin Munkoh-Buabeng, Valen- cia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022a. A few thousand trans- lations go a long way! leveraging pre-trained mod- els for African news translation. In Proceedings of the 2022 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics. David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen- Michel, Constantine Lignos, Jesujoba Alabi, Sham- suddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Ka- bore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh- Buabeng, Victoire Memdjokam Koagne, Allah- sera Auguste Tapo, Tebogo Macucwa, Vukosi Mari- vate, Mboning Tchiaze Elvis, Tajuddeen Gwad- abe, Tosin Adewumi, Orevaoghene Ahia, Joyce Nakatumba-Nabende, Neo Lerato Mokono, Ig- natius Ezeani, Chiamaka Chukwuneke, Mofetoluwa Oluwaseun Adeyemi, Gilles Quentin Hacheme, Idris Abdulmumin, Odunayo Ogundepo, Oreen Yousuf, Tatiana Moteu, and Dietrich Klakow. 2022b. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. David Ifeoluwa Adelani, Jade Abbott, Graham Neu- big, Daniel D’souza, Julia Kreutzer, Constantine Lig- nos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Is- rael Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yi- mam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Ver- rah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chuk- wuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akin- ode, Deborah Nabagereka, Maurice Katusiime, Ayo- dele Awokoya, Mouhamadane MBOUP, Dibora Ge- breyohannes, Henok Tilaye, Kelechi Nwaike, De- gaga Wolde, Abdoulaye Faye, Blessing Sibanda, Ore- vaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Sa- lomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions https://doi.org/10.18653/v1/2022.naacl-main.223 https://doi.org/10.18653/v1/2022.naacl-main.223 https://doi.org/10.18653/v1/2022.naacl-main.223 https://aclanthology.org/2022.emnlp-main.298 https://aclanthology.org/2022.emnlp-main.298 https://doi.org/10.1162/tacl_a_00416 https://doi.org/10.1162/tacl_a_00416 of the Association for Computational Linguistics, 9:1116–1131. David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lam- bebo Tonja, Christine Mwase, Odunayo Ogun- depo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Sabah al azzawi, Blessing K. Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Oluwaseyi Ajayi, Tatiana Moteu Ngoli, Brian Odhiambo, Abraham Toluwase Owodunni, Nnae- meka C. Obiefuna, Shamsuddeen Hassan Muham- mad, Saheed Salahudeen Abdullahi, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye Bame, Oluwabusayo Olufunke Awoy- omi, Iyanuoluwa Shode, Tolulope Anu Adelani, Habiba Abdulganiy Kailani, Abdul-Hakeem Omo- tayo, Adetola Adeeko, Afolabi Abeeb, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wan- gari Kimotho, Onyekachi Raphael Ogbu, Chinedu E. Mbonu, Chiamaka I. Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola F. Awosan, Tadesse Kebede Guge, Sakayo Toadoum Sari, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Odu- wole, Ussen Kimanuka, Kanda Patrick Tshinu, Thina Diko, Siyanda Nxakama, Abdulmejid Tuni Johar, Sinodos Gebre, Muhidin Mohamed, Shafie Abdi Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, and Pontus Stenetorp. 2023. Masakhanews: News topic classification for african languages. Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. Adapting pre- trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vulić. 2022. Composable sparse fine-tuning for cross- lingual transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1778–1796, Dublin, Ireland. Association for Computational Lin- guistics. Ekaterina Aplonova and Francis Tyers. 2017. Towards a dependency-annotated treebank for bambara. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 138–145. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of mono- lingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. Cheikh Anta Babou and Michele Loporcaro. 2016. Noun classes and grammatical gender in wolof. Jour- nal of African Languages and Linguistics, 37(1):1– 57. Adams Bodomo and Charles Marfo. 2002. The mor- phophonology of noun classes in dagaare and akan. Joan Bresnan and Sam A Mchombo. 1987. Topic, pro- noun, and agreement in chicheŵa. Language, pages 741–782. Ronald Cardenas, Ying Lin, Heng Ji, and Jonathan May. 2019. A grounded unsupervised universal part-of- speech tagger for low-resource languages. arXiv preprint arXiv:1904.05426. Emmanuel Chabata. 2000. The shona corpus and the problem of tagging. Lexikos, 10(10):76–85. Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2021. Rethinking em- bedding coupling in pre-trained language models. In International Conference on Learning Representa- tions. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Lin- guistics. Guy De Pauw, Naomi Maajabu, and Peter Waiganjo Wagacha. 2010. A knowledge-light approach to luo machine translation and part-of-speech tagging. In Proceedings of the Second Workshop on African Language Technology (AfLaT 2010). Valletta, Malta: European Language Resources Association (ELRA), pages 15–20. Wietse de Vries, Martijn Wieling, and Malvina Nissim. 2022. Make the best of cross-lingual transfer: Ev- idence from POS tagging with over 100 languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7676–7685, Dublin, Ireland. Association for Computational Linguistics. Xolani Delman. 2016. Development of Part-of-speech Tagger for Xhosa. Ph.D. thesis, University of Fort Hare. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Cheikh M Bamba Dione. 2019. Developing universal dependencies for wolof. In Proceedings of the Third Workshop on Universal Dependencies (UDW, Syn- taxFest 2019), pages 12–23. http://arxiv.org/abs/2304.09972 http://arxiv.org/abs/2304.09972 https://aclanthology.org/2022.coling-1.382 https://aclanthology.org/2022.coling-1.382 https://aclanthology.org/2022.coling-1.382 https://doi.org/10.18653/v1/2022.acl-long.125 https://doi.org/10.18653/v1/2022.acl-long.125 https://doi.org/10.18653/v1/2020.acl-main.421 https://doi.org/10.18653/v1/2020.acl-main.421 https://doi.org/doi:10.1515/jall-2016-0001 https://openreview.net/forum?id=xpFFI_NtgpW https://openreview.net/forum?id=xpFFI_NtgpW https://doi.org/10.18653/v1/2020.acl-main.747 https://doi.org/10.18653/v1/2020.acl-main.747 https://doi.org/10.18653/v1/2022.acl-long.529 https://doi.org/10.18653/v1/2022.acl-long.529 https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.18653/v1/N19-1423 https://doi.org/10.18653/v1/N19-1423 Cheikh M Bamba Dione, Jonas Kuhn, and Sina Zarrieß. 2010. Design and development of part-of-speech- tagging resources for wolof (niger-congo, spoken in senegal). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyan- uoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris C. Emezue. 2022. Afrolm: A self- active learning-based multilingual pretrained lan- guage model for 23 african languages. ArXiv, abs/2211.03263. Tom Güldemann. 2008. Quotative Indexes in African Languages. A Synchronic and Diachronic Survey. De Gruyter Mouton, Berlin, New York. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR. Olájídé Ishola and Daniel Zeman. 2020. Yorùbá de- pendency treebank (YTB). In Proceedings of the 12th Language Resources and Evaluation Confer- ence, pages 5178–5186, Marseille, France. European Language Resources Association. Mariya Koleva. 2013. Towards adaptation of nlp tools for closely-related bantu languages: Building a part- of-speech tagger for zulu. Master’s thesis, Saarland University, Germany. Adenike Lawal. 1991. Yoruba pe and ki verbs or complementizers. Studies in African Linguistics, 22(1):74–84. Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junx- ian He, Zhisong Zhang, Xuezhe Ma, Antonios Anas- tasopoulos, Patrick Littell, and Graham Neubig. 2019. Choosing transfer languages for cross-lingual learn- ing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Compu- tational Linguistics. Gabofetswe Malema, Boago Okgetheng, and Moffat Motlhanka. 2017. Setswana part of speech tagging. International Journal on Natural Language Comput- ing, 6(6):15–20. Gabofetswe Malema, Boago Okgetheng, Bopaki Tebalo, Moffat Motlhanka, and Goaletsa Rammidi. 2020. Complex setswana parts of speech tagging. In Pro- ceedings of the first workshop on Resources for African Indigenous Languages, pages 21–24. Fiona McLaughlin. 2004. Is there an adjective class in wolof. Adjective classes: A cross-linguistic typology, 1:242–262. Josh Meyer, David Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack, Julian Weber, Salomon KABONGO KABENAMUALU, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chi- nenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete AGBOLO, Victor Akinode, Bernard Opoku, Olanrewaju Samuel, Jesujoba Alabi, and Shamsud- deen Hassan Muhammad. 2022. BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus. In Proc. Interspeech 2022, pages 2383–2387. Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ife- oluwa Adelani, Seid Muhie Yimam, Ibrahim Sa’id Ahmad, Meriem Beloucif, Saif Mohammad, Sebas- tian Ruder, et al. 2023. Afrisenti: A twitter sentiment analysis benchmark for african languages. arXiv preprint arXiv:2302.08956. Shamsuddeen Hassan Muhammad, David Ifeoluwa Ade- lani, Sebastian Ruder, Ibrahim Sa’id Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choud- hury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alípio Jorge, and Pavel Brazdil. 2022. NaijaSenti: A Nigerian Twitter sentiment corpus for multilingual sentiment analy- sis. In Proceedings of the Thirteenth Language Re- sources and Evaluation Conference, pages 590–602, Marseille, France. European Language Resources Association. Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajic, Christopher D Man- ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceed- ings of the Tenth International Conference on Lan- guage Resources and Evaluation (LREC’16), pages 1659–1666. Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, and Li Huang. 2020. KINNEWS and KIRNEWS: Benchmarking cross-lingual text classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Lin- guistics, pages 5507–5521, Barcelona, Spain (On- line). International Committee on Computational Lin- guistics. NLLB-Team, Marta Ruiz Costa-jussà, James Cross, Onur cCelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Alison Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonza- lez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shan- non L. Spruit, C. Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzm’an, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and https://doi.org/doi:10.1515/9783110211450 https://doi.org/doi:10.1515/9783110211450 https://proceedings.mlr.press/v97/houlsby19a.html https://aclanthology.org/2020.lrec-1.637 https://aclanthology.org/2020.lrec-1.637 https://doi.org/10.18653/v1/P19-1301 https://doi.org/10.18653/v1/P19-1301 https://doi.org/10.21437/Interspeech.2022-10850 https://doi.org/10.21437/Interspeech.2022-10850 https://doi.org/10.21437/Interspeech.2022-10850 https://arxiv.org/abs/2302.08956 https://arxiv.org/abs/2302.08956 https://aclanthology.org/2022.lrec-1.63 https://aclanthology.org/2022.lrec-1.63 https://aclanthology.org/2022.lrec-1.63 https://doi.org/10.18653/v1/2020.coling-main.480 https://doi.org/10.18653/v1/2020.coling-main.480 https://doi.org/10.18653/v1/2020.coling-main.480 Jeff Wang. 2022. No language left behind: Scal- ing human-centered machine translation. ArXiv, abs/2207.04672. Derek Nurse and Gerard Philippson, editors. 2006. The Bantu Languages. Routledge Language Family Se- ries. Routledge, London, England. Perez Ogayo, Graham Neubig, and Alan W Black. 2022. Building African Voices. In Proc. Interspeech 2022, pages 1263–1267. Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small data? no problem! exploring the viability of pretrained multilingual language models for low- resourced languages. In Proceedings of the 1st Work- shop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Associa- tion for Computational Linguistics. Ikechukwu E Onyenwe, Chinedu Uchechukwu, and Mark Hepple. 2014. Part-of-speech tagset and cor- pus development for igbo, an african. In Proceedings of LAW VIII-The 8th Linguistic Annotation Workshop, pages 93–98. Association for Computational Linguis- tics and Dublin City University. Olasope O Oyelaran. 1982. On the scope of the se- rial verb construction in yoruba. Studies in African Linguistics, 13(2):109. Chester Palen-Michel, June Kim, and Constantine Lig- nos. 2022. Multilingual open text release 1: Public domain news in 44 languages. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2080–2089, Marseille, France. Eu- ropean Language Resources Association. Doris L. Payne, Sara Pacchiarotti, and Mokaya Bosire, editors. 2017. Diversity in African languages. Num- ber 1 in Contemporary African Linguistics. Language Science Press, Berlin. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2089– 2096, Istanbul, Turkey. European Language Re- sources Association (ELRA). Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se- bastian Ruder. 2020. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computa- tional Linguistics. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebas- tian Ruder. 2021. UNKs everywhere: Adapting mul- tilingual language models to new scripts. In Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10186–10203, Online and Punta Cana, Dominican Republic. Asso- ciation for Computational Linguistics. Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal common- sense reasoning. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. As- sociation for Computational Linguistics. Sandy Ritchie, You-Chi Cheng, Mingqing Chen, Ra- jiv Mathews, Daan van Esch, Bo Li, and Khe Chai Sim. 2022. Large vocabulary speech recognition for languages of africa: multilingual modeling and self-supervised learning. ArXiv, abs/2208.03067. Adedjouma A. Sèmiyou, John OR Aoga, and Mamoud A Igue. 2012. Part-of-speech tagging of yoruba standard, language of niger-congo family. Re- search Journal of Computer and Information Tech- nology Sciences, 1:2–5. Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Z. Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Bhakta Neupane, David Ifeoluwa Adelani, Amelia Taylor, Jamiil Toure Ali, Kevin Degila, Momboladji Balogoun, Thierno Ibrahima Diop, Davis David, Chayma Fourati, Hatem Had- dad, and Malek Naski. 2021. Ai4d - african language program. ArXiv, abs/2104.02516. Aminu Tukur, Kabir Umar, and SAS Muhammad. 2020. Parts-of-speech tagging of hausa-based texts using hidden markov model. vol, 6:303–313. Valentin Vydrin. 2018. Where corpus methods hit their limits: the case of separable adjectives in bambara. Rhema, (4):34–48. Wm E Welmers. 2018. African language structures. University of California Press. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. A Language Characteristics Table 5 provides the details about the language characteristics. B Annotation Agreement Table 6 provides POS annotation agreements at the sentence level for 13 out of the 20 focus languages. https://doi.org/10.21437/Interspeech.2022-152 https://doi.org/10.18653/v1/2021.mrl-1.11 https://doi.org/10.18653/v1/2021.mrl-1.11 https://doi.org/10.18653/v1/2021.mrl-1.11 https://aclanthology.org/2022.lrec-1.224 https://aclanthology.org/2022.lrec-1.224 http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf https://doi.org/10.18653/v1/2020.emnlp-main.617 https://doi.org/10.18653/v1/2020.emnlp-main.617 https://doi.org/10.18653/v1/2021.emnlp-main.800 https://doi.org/10.18653/v1/2021.emnlp-main.800 https://doi.org/10.18653/v1/2020.emnlp-main.185 https://doi.org/10.18653/v1/2020.emnlp-main.185 https://doi.org/10.18653/v1/2020.emnlp-demos.6 https://doi.org/10.18653/v1/2020.emnlp-demos.6 No. of Latin Letters Morphological Inflectional Noun Language Letters Omitted Letters added Tonality diacritics Word Order typology Morphology (WALS) Classes Bambara (bam) 27 q,v,x E, O, ñ, N yes, 2 tones yes SVO & SOV isolating strong suffixing absent Ghomálá’ (bbj) 40 q, w, x, y bv, dz, @, a@, E, gh, ny, nt, N, Nk, O, pf, mpf, sh, ts, 0, zh, ’ yes, 5 tones yes SVO agglutinative strong prefixing active, 6 Éwé (ewe) 35 c, j, q ã, dz, E, ƒ, gb, G, kp, ny, N, O, ts, V yes, 3 tones yes SVO isolating equal prefixing and suffixing vestigial Fon (fon) 33 q ã, E,gb, hw, kp, ny, O, xw yes, 3 tones yes SVO isolating little affixation vestigial Hausa (hau) 44 p,q,v,x á, â, Î, ¯, kw, Îw, gw, ky, Îy, gy, sh, ts yes, 2 tones no SVO agglutinative little affixation absent Igbo (ibo) 34 c, q, x ch, gb, gh, gw, kp, kw, nw, ny, o. , ȯ, sh, u. yes, 2 tones yes SVO agglutinative little affixation vestigial Kinyarwanda (kin) 30 q, x cy, jy, nk, nt, ny, sh yes, 2 tones no SVO agglutinative strong prefixing active, 16 Luganda (lug) 25 h, q, x N, ny yes, 3 tones no SVO agglutinative strong prefixing active, 20 Luo (luo) 31 c, q, x, v, z ch, dh, mb, nd, ng’, ng, ny, nj, th, sh yes, 4 tones no SVO agglutinative equal prefixing and suffixing absent Mossi (mos) 26 c, j, q, x ’, E, Ì, V yes, 2 tones yes SVO isolating strongly suffixing active, 11 Chichewa (nya) 31 q, x, y ch, kh, ng, N, ph, tch, th, ŵ yes, 2 tones no SVO agglutinative strong prefixing active, 17 Naija (pcm) 26 – – no no SVO mostly analytic strongly suffixing absent chiShona (sna) 29 c, l, q, x bh, ch, dh, nh, sh, vh, zh yes, 2 tones no SVO agglutinative strong prefixing active, 20 Swahili (swa) 33 x, q ch, dh, gh, kh, ng’, ny, sh, th, ts no yes SVO agglutinative strong suffixing active, 18 Setswana (tsn) 36 c, q, v, x, z ê, kg, kh, ng, ny, ô, ph, š, th, tl, tlh, ts, tsh, tš, tšh yes, 2 tones no SVO agglutinative strong prefixing active, 18 Akan/Twi (twi) 22 c,j,q,v,x,z E, O yes, 5 tones no SVO isolating strong prefixing active, 6 Wolof (wol) 29 h,v,z N, à, é, ë, ó, ñ no yes SVO agglutinative strong suffixing active, 10 isiXhosa (xho) 68 – bh, ch, dl, dy, dz, gc, gq, gr, gx, hh, hl, kh, kr, lh, mh, ng, ngc, ngh, ngq, ngx, nkq, nkx, nh, nkc, nx, ny, nyh, ph, qh, rh, sh, th, ths, thsh, ts, tsh, ty, tyh, wh, xh, yh, zh yes, 2 tones no SVO agglutinative strong prefixing active, 17 Yorùbá (yor) 25 c, q, v, x, z e. , gb, s. , o. yes, 3 tones yes SVO isolating little affixation vestigial, 2 isiZulu (zul) 55 – nx, ts, nq, ph, hh, ny, gq, hl, bh, nj, ch, ngc, ngq, th, ngx, kl, ntsh, sh, kh, tsh, ng, nk, gx, xh, gc, mb, dl, nc, qh yes, 3 tones no SVO agglutinative strong prefixing active, 17 Table 5: Linguistic Characteristics of the Languages No. agreed agreed No. agreed agreed Lang. annotation annotation (%) Lang. annotation annotation (%) bam 1,091 77.9 pcm 1,073 76.6 ewe 616 44.0 tsn 1,058 24.4 hau 1,079 77.1 twi 1,306 93.2 kin 1,127 80.5 xho 1,378 98.4 lug 937 66.9 yor 1,059 75.6 luo 564 40.3 zul 905 64.6 mos 829 49.2 Table 6: Number of sentences with agreed annotations and their percentages Language Data Source # Train/# dev/ # test Afrikaans (afr) UD_Afrikaans-AfriBooms 1,315/ 194/ 425 Arabic (ara) UD_Arabic-PADT 6,075/ 909/ 680 English (eng) UD_English-EWT 12,544/ 2001/ 2077 French (fra) UD_French-GSD 14,450/ 1,476/ 416 Naija (pcm) UD_Naija-NSC 7,279/ 991/ 972 Romanian (ron) UD_Romanian-RRT 8,043/ 752/ 729 Wolof (wol) UD_Wolof-WTB 1,188/ 449/ 470 Table 7: Data Splits for UD POS datasets used as source languages for cross-lingual transfer. C UD POS data split Table 7 provides the UD POS corpus found online that we make use for determining the best transfer languages D Hyper-parameters for Experiments Hyper-parameters for Baseline Models The PLMs were trained for 20 epochs with a learning rate of 5e-5 using huggingface transformers (Wolf et al., 2020). We make use of a batch size of 16 Hyper-parameters for adapters We train the task adapter using the following hyper-parameters: batch size of 8, 20 epochs, “pfeiffer” adapter con- fig, adapter reduction factor of 4 (except for Wolof, where we make use of adapter reduction factor of 1), and learning rate of 5e-5. For the language adapters, we make use of 100 epochs or maxi- mum steps of 100K, minimum number of steps is 30K, batch size of 8, “pfeiffer+inv” adapter con- fig, adapter reduction factor of 2, learning rate of 5e-5, and maximum sequence length of 256. Hyper-parameters for LT-SFT We make use of the default setting used by the Ansell et al. (2022) paper. E Monolingual data for Adapter/SFTs language adaptation Table 8 provides the UD POS corpus found online that we make use for determining the best transfer languages F MAD-X multi-source fine-tuning Figure 2 provides the result of MAD-X with dif- ferent source languages, and multi-source fine- tuning using either eng, ron or wol as language adapter for task adaptation prior to zero-shot trans- fer. Our result shows that making of wol as lan- Language Source Size (MB) Bambara (bam) MAFAND-MT (Adelani et al., 2022a) 0.8MB Ghomálá’ (bbj) MAFAND-MT (Adelani et al., 2022a) 0.4MB Éwé (ewe) MAFAND-MT (Adelani et al., 2022a) 0.5MB Fon (fon) MAFAND-MT (Adelani et al., 2022a) 1.0MB Hausa (hau) VOA (Palen-Michel et al., 2022) 46.1MB Igbo (ibo) BBC Igbo (Ogueji et al., 2021) 16.6MB Kinyarwanda (kin) KINNEWS (Niyongabo et al., 2020) 35.8MB Luganda (lug) Bukedde (Alabi et al., 2022) 7.9MB Luo (luo) Ramogi FM news (Adelani et al., 2021) and MAFAND-MT (Adelani et al., 2022a) 1.4MB Mossi (mos) MAFAND-MT (Adelani et al., 2022a) 0.7MB Naija (pcm) BBC (Alabi et al., 2022) 50.2MB Chichewa (nya) Nation Online Malawi (Siminyu et al., 2021) 4.5MB chiShona (sna) VOA (Palen-Michel et al., 2022) 28.5MB Kiswahili (swa) VOA (Palen-Michel et al., 2022) 17.1MB Setswana (tsn) Daily News (Adelani et al., 2021), MAFAND-MT (Adelani et al., 2022a) 1.9MB Twi (twi) MAFAND-MT (Adelani et al., 2022a) 0.8KB Wolof (wol) Lu Defu Waxu, Saabal, Wolof Online, and MAFAND-MT (Adelani et al., 2022a) 2.3MB isiXhosa (xho) Isolezwe Newspaper 17.3MB Yorùbá (yor) BBC Yorùbá (Alabi et al., 2022) 15.0MB isiZulu (zul) Isolezwe Newspaper 34.3MB Romanian (ron) Wikipedia 500MB French (fra) Wikipedia (a subset) 500MB Table 8: Monolingual News Corpora used for language adapter and SFT training, and their sources and size (MB) bam bbj ewe fon hau ibo kin lug luo mos nya pcm sna swa tsn twi wol xho yor zul ave ave* TARGET LANGUAGE afr ara eng fra pcm ron wol eng-ron-wol (wo) eng-ron-wol (en) eng-ron-wol (ro) wol_news SO UR CE L AN GU AG E 61 55 64 52 64 74 63 56 68 66 61 80 60 58 53 52 76 40 66 41 60 58 41 32 46 43 48 53 53 50 58 46 51 67 54 62 39 35 52 39 50 43 48 47 66 58 67 54 67 77 71 68 76 72 69 83 71 63 54 52 75 58 66 62 66 65 63 56 66 58 68 79 71 66 73 68 71 83 72 66 52 48 78 47 69 51 65 64 46 44 52 37 58 68 54 60 60 54 59 76 60 54 44 35 58 33 53 40 52 51 64 62 68 59 65 78 72 61 75 68 65 83 70 64 62 54 78 47 70 56 66 64 62 63 70 66 67 78 63 65 73 72 62 77 70 54 52 56 91 43 70 45 65 63 66 65 70 62 69 82 71 68 74 72 69 84 72 62 52 58 90 52 70 56 68 66 62 61 69 61 70 82 72 70 75 72 68 84 71 63 58 58 87 53 72 58 68 66 65 63 71 62 70 81 70 62 75 73 65 85 69 63 60 58 88 46 72 51 67 65 47 42 47 38 54 52 41 39 46 44 38 40 44 39 45 40 86 39 46 36 45 43 Heat Map of MAD-X Transfer Accuracy 40 50 60 70 80 90 Figure 2: MAD-X: Cross-lingual Experiments on MasakhaPOS . Zero-shot Evaluation using afr, ara, eng, fra, ron, pcm and wol as source languages. Experiments based on AfroXLMR-base. ave* excludes pcm and wol from the average since they are also source languages. guage adapters leads to slightly better accuracy (69.1%) over eng (68.7%) and ron (67.8%). But in general, either one can be used, and they all give an impressive performance over LT-SFT, as shown in Table 9. G Cross-lingual transfer from all source languages Table 9 shows the result of cross-lingual transfer from each source language (afr, ara, eng, fra, pcm, ron, and wol) to each of the African lan- guages. We extended the evaluation to include sna (since it was recommended as the best transfer language for a related task – named entity recogni- tion by (Adelani et al., 2022b)) by using the newly created POS corpus. We also tried other Bantu lan- guages like kin and swa, but their performance was worse than sna. Our evaluation shows that sna results in better transfer to Bantu languages be- cause of it’s rich morphology. We achieved the best result for all languages using multi-source transfer from (eng, ron, wol, sna) languages. Method bam bbj ewe fon hau ibo kin lug luo mos nya pcm sna swa tsn twi wol xho yor zul AVG AVG* ara as a source language FT-Eval 26.4 10.0 16.0 14.2 47.7 62.5 57.1 35.4 15.3 17.0 53.7 66.4 56.0 58.4 42.9 14.1 13.5 39.0 46.9 44.8 36.9 37.1 LT-SFT 41.0 30.7 41.2 45.0 47.3 62.9 54.0 48.7 56.2 43.2 54.4 63.3 53.6 59.4 44.8 39.9 51.0 36.8 50.6 44.8 48.4 48.0 MAD-X 44.5 36.5 50.9 45.9 48.5 59.5 55.5 51.1 60.5 46.7 53.4 66.8 53.8 59.1 40.4 37.9 52.3 40.3 52.3 44.6 50.0 49.7 pcm as a source language FT-Eval 16.0 8.6 14.3 4.9 58.0 64.9 48.9 35.9 13.0 11.0 47.5 74.6 51.9 50.9 32.8 5.3 7.3 25.9 46.9 30.9 32.8 33.2 LT-SFT 44.4 39.4 51.1 38.1 59.2 66.6 47.9 53.5 61.3 52.3 49.3 75.3 48.9 50.6 40.8 35.3 63.9 25.1 58.3 30.6 49.6 48.8 MAD-X 42.1 43.6 53.5 39.4 57.3 68.2 55.7 58.1 60.1 51.9 59.6 75.8 57.5 55.7 44.8 36.9 58.9 32.9 57.1 40.6 52.5 51.8 afr as a source language FT-Eval 54.8 25.4 38.3 31.3 61.4 73.6 67.1 48.6 29.4 35.2 56.1 77.3 56.0 57.5 49.0 32.9 32.5 43.8 63.8 44.3 48.9 49.4 LT-SFT 69.2 55.6 64.0 52.5 62.8 74.7 66.1 59.0 69.4 63.4 54.4 79.7 58.4 57.1 48.5 49.0 79.3 41.0 64.3 41.5 60.5 59.6 MAD-X 61.9 56.1 63.9 53.0 63.0 75.2 68.2 60.2 68.1 63.4 62.0 80.8 61.1 60.6 50.4 48.6 75.7 43.8 65.2 46.0 61.4 60.6 fra as a source language FT-Eval 41.0 15.2 27.5 16.1 64.1 73.0 67.7 53.4 21.9 21.3 65.2 77.9 64.4 62.2 51.8 16.8 17.7 45.8 61.6 46.5 45.6 46.1 LT-SFT 60.6 52.2 63.3 60.2 63.9 75.6 63.4 57.6 69.0 65.2 66.4 79.7 63.0 61.2 52.4 48.6 78.3 43.9 64.7 44.3 61.7 60.7 MAD-X 62.0 57.9 64.2 59.4 66.9 78.7 71.3 64.1 74.0 67.7 70.2 83.4 68.6 65.4 53.0 48.1 78.3 46.0 67.8 50.2 64.9 63.9 eng as a source language FT-Eval 52.1 31.9 47.8 32.5 67.1 74.5 63.9 57.8 38.4 45.3 59.0 82.1 63.7 56.9 52.6 35.9 35.9 45.9 63.3 48.8 52.6 52.9 LT-SFT 67.9 57.6 67.9 55.5 69.0 76.3 64.2 61.0 74.5 70.3 59.4 82.4 64.6 56.9 49.5 52.1 78.2 45.9 65.3 49.8 63.4 62.5 MAD-X 62.9 58.5 68.7 55.8 67.0 77.8 70.9 65.7 73.0 71.8 70.1 83.2 69.8 61.2 49.8 53.0 75.2 57.1 66.9 60.9 66.0 65.2 ron as a source language FT-Eval 46.5 30.5 37.6 30.9 67.3 77.7 73.3 56.9 36.7 40.6 62.2 78.9 66.3 61.0 55.8 35.7 33.8 49.6 63.5 56.3 53.1 53.4 LT-SFT 60.6 57.0 64.9 60.4 67.5 77.4 68.2 58.5 70.2 67.9 58.2 78.1 64.6 59.7 57.4 55.7 81.9 46.3 64.8 51.2 63.5 62.4 MAD-X 63.5 62.2 66.6 61.8 66.5 80.0 73.5 62.7 76.5 71.8 66.0 83.7 71.1 64.5 61.2 53.5 79.5 48.6 69.5 57.8 67.0 66.1 wol as a source language FT-Eval 40.8 36.5 39.8 37.4 55.1 58.6 49.2 51.8 35.1 44.9 49.0 51.6 53.8 42.9 45.0 38.4 88.6 46.0 52.5 45.5 48.1 45.6 LT-SFT (N) 64.4 64.3 69.8 63.0 67.0 79.7 63.7 64.0 74.1 72.2 56.5 72.7 67.7 53.0 51.3 56.2 92.5 46.0 69.8 47.7 64.8 63.1 MAD-X (N) 46.6 41.8 47.2 37.8 53.9 51.8 41.0 39.0 46.5 44.0 38.3 40.2 44.3 38.8 44.6 40.1 85.6 39.2 46.4 45.2 43.0 43.3 MAD-X (N+W) 61.7 63.6 68.9 63.1 66.8 77.0 67.8 69.1 73.7 71.3 63.2 75.1 68.9 55.8 50.7 54.9 90.4 49.6 70.0 51.7 65.7 64.1 sna as a source language FT-Eval 42.6 26.2 41.7 29.5 60.5 68.2 73.7 75.0 42.2 34.9 69.3 65.7 89.2 63.4 48.9 33.3 35.8 59.5 59.2 67.9 54.3 53.4 LT-SFT 52.2 57.5 66.0 55.4 60.5 71.9 69.0 80.1 75.7 58.1 70.4 60.2 89.9 63.5 50.6 65.8 71.6 62.7 62.2 72.9 65.8 64.2 MAD-X 50.3 57.0 65.3 56.3 64.1 71.9 75.0 79.2 75.9 59.8 70.6 68.6 89.7 63.2 52.7 61.0 75.3 61.8 57.8 69.8 66.3 64.5 multi-source: eng-ron-wol FT-Eval 44.2 36.3 39.3 39.3 69.4 78.5 70.6 59.2 35.5 46.8 60.9 81.4 65.8 58.5 53.8 38.8 89.1 48.8 65.2 53.5 56.7 54.4 LT-SFT 67.4 64.6 70.0 64.2 70.4 81.1 68.7 63.9 76.4 73.9 58.8 83.0 69.6 57.3 52.7 57.2 93.1 45.8 69.8 48.3 66.8 65.2 MAD-X 66.2 65.5 70.3 64.9 69.1 82.3 73.1 68.0 75.1 74.2 69.2 83.9 69.4 62.6 53.6 55.2 90.1 52.3 70.8 59.4 68.8 67.5 multi-source: eng-ron-wol-sna FT-Eval 45.1 35.9 39.6 41.0 69.5 78.7 76.9 71.7 37.4 46.8 71.9 82.4 88.9 63.8 51.7 38.8 89.2 59.6 65.6 67.3 61.1 58.0 LT-SFT 66.7 64.7 68.5 65.1 71.0 81.2 75.3 80.2 79.3 73.5 73.6 83.6 89.1 64.3 51.1 60.9 93.2 61.8 69.1 70.2 72.1 70.0 MAD-X 59.0 64.3 70.9 64.3 69.8 82.5 76.9 80.9 78.8 70.1 74.2 85.1 89.1 65.7 55.0 60.7 86.5 60.7 71.0 69.6 71.8 70.0 Table 9: Cross-lingual transfer to MasakhaPOS . Zero-shot Evaluation using FT-Eval, LT-SFT, and MAD-X, with ron, eng, wol and sna as source languages. Experiments are based on AfroXLMR-base. Non-Bantu Niger-Congo languages highlighted with gray (except for Bambara that is often disputed as a different language family — Mande) while those of Bantu Niger-Congo languages are highlighted with cyan . AVG* excludes sna and wol from the average since they are source languages.