bonadossou/afrolm_active_learning_dataset
收藏数据集概述
名称: AfroLM Dataset
描述: 该数据集是为论文《AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages》所创建,用于支持23种非洲语言的预训练语言模型研究。
语言: 数据集涵盖以下语言:Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, Zulu。
许可证: CC-BY-4.0
多语言性: 单语种
大小: 1M<n<10M
来源: 原始数据
标签: afrolm, active learning, language modeling, research papers, natural language processing, self-active learning
任务类别: fill-mask
任务ID: masked-language-modeling
数据集使用
模型: AfroLM-Large
数据集访问: AfroLM Dataset
使用示例: python from transformers import XLMRobertaModel, XLMRobertaTokenizer model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning") tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning") tokenizer.model_max_length = 256
评估结果
AfroLM在MasakhaNER1.0和MasakhaNER2.0数据集上的表现优于AfriBERTa, mBERT, 和XLMR-base,并与AfroXLMR竞争激烈。此外,AfroLM在数据效率上表现出色,其预训练数据集大小仅为竞争对手的1/14。
| 模型 | MasakhaNER | MasakhaNER2.0* | 文本分类(Yoruba/Hausa) | 情感分析(YOSM) | OOD情感分析(Twitter -> YOSM) |
|---|---|---|---|---|---|
AfroLM-Large |
80.13 | 83.26 | 82.90/91.00 | 85.40 | 68.70 |
AfriBERTa |
79.10 | 81.31 | 83.22/90.86 | 82.70 | 65.90 |
mBERT |
71.55 | 80.68 | --- | --- | --- |
XLMR-base |
79.16 | 83.09 | --- | --- | --- |
AfroXLMR-base |
81.90 |
84.55 |
--- | --- | --- |
(*): 评估是在数据集的11种额外语言上进行的。
引用信息
@inproceedings{dossou-etal-2022-afrolm, title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages", author = "Dossou, Bonaventure F. P. and Tonja, Atnafu Lambebo and Yousuf, Oreen and Osei, Salomey and Oppong, Abigail and Shode, Iyanuoluwa and Awoyomi, Oluwabusayo Olufunke and Emezue, Chris", booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.sustainlp-1.11", pages = "52--64", }



