five

Hyperparameters used for our models.

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Hyperparameters_used_for_our_models_/29248522
下载链接
链接失效反馈
官方服务:
资源简介:
While sentiment analysis systems excel in high-resource languages, most African languages facing limited resources, remain under-represented. This gap leaves a significant portion of the world’s population without access to technologies in their native languages. However, multilingual pre-trained language models (PLM) offer a promising approach for sentiment analysis in low-resource languages. Although the absence of large data in African languages poses a challenge for developing PLMs, fine-tuning and task adaptation of existing multilingual PLMs is an alternative solution. This paper explores the use of multilingual PLMs for sentiment analysis in five Southern African languages: Sepedi, Sesotho, Setswana, isiXhosa, and isiZulu. We leverage existing PLMs and fine-tune them for this specific task, avoiding training the models from scratch. Our work expands on the SAfriSenti corpus, a Twitter sentiment dataset for these languages. We employ various annotation techniques to create a labelled dataset and perform benchmark experiments utilising various multilingual PLMs. Our findings demonstrate the effectiveness of multilingual PLM, particularly for closely-related languages (Sotho-Tswana), where the ensemble PLMs method achieved an average weighted F1 score above 63%. In particular, Nguni closely-related languages achieved an even higher average weighted F1 score, exceeding 77%, highlighting the potential of PLMs for sentiment analysis in South African languages.

尽管情感分析系统在高资源语言中表现优异,但多数资源匮乏的非洲语言仍未得到充分覆盖。这一差距使得全球相当比例的人口无法使用适配其母语的技术。 不过,多语言预训练语言模型(multilingual pre-trained language models,PLM)为低资源语言的情感分析任务提供了极具前景的解决方案。尽管非洲语言缺乏大规模数据给预训练语言模型的开发带来了挑战,但对现有多语言预训练语言模型进行微调与任务适配是一种可行的替代方案。 本文探讨了多语言预训练语言模型在五种南部非洲语言情感分析任务中的应用,分别为北索托语(Sepedi)、南索托语(Sesotho)、茨瓦纳语(Setswana)、科萨语(isiXhosa)以及祖鲁语(isiZulu)。 我们依托现有预训练语言模型,针对该特定任务对其进行微调,无需从零开始训练模型。 本研究拓展了SAfriSenti语料库(SAfriSenti corpus)——一款针对上述语言的Twitter情感分析数据集。 我们采用多种标注技术构建了标注数据集,并基于多款多语言预训练语言模型开展了基准测试实验。 研究结果证实了多语言预训练语言模型的有效性,尤其在亲缘关系较近的语言(索托-茨瓦纳语族)中,集成预训练语言模型方法的加权平均F1值超过了63%。 具体而言,恩古尼语族亲缘语言的加权平均F1值更高,突破了77%,这凸显了预训练语言模型在南非语言情感分析任务中的应用潜力。
创建时间:
2025-06-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作