shmuhammad/AfriSenti-twitter-sentiment

Name: shmuhammad/AfriSenti-twitter-sentiment
Creator: shmuhammad
Published: 2023-09-03 09:59:15
License: 暂无描述

Hugging Face2023-09-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shmuhammad/AfriSenti-twitter-sentiment

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-classification task_ids: - sentiment-analysis - sentiment-classification - sentiment-scoring - semantic-similarity-classification - semantic-similarity-scoring tags: - sentiment analysis, Twitter, tweets - sentiment multilinguality: - monolingual - multilingual size_categories: - 100K<n<1M language: - amh - ary - ar - arq - hau - ibo - kin - por - pcm - eng - oro - swa - tir - twi - tso - yor pretty_name: AfriSenti --- # Dataset Card for AfriSenti Dataset <p align="center"> <img src="https://raw.githubusercontent.com/afrisenti-semeval/afrisent-semeval-2023/main/images/afrisenti-twitter.png", width="700" height="500"> -------------------------------------------------------------------------------- ## Dataset Description - **Homepage:** https://github.com/afrisenti-semeval/afrisent-semeval-2023 - **Repository:** [GitHub](https://github.com/afrisenti-semeval/afrisent-semeval-2023) - **Paper:** [AfriSenti: AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages](https://arxiv.org/pdf/2302.08956.pdf) - **Paper:** [SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)](https://arxiv.org/pdf/2304.06845.pdf) - **Paper:** [NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis](https://arxiv.org/pdf/2201.08277.pdf) - **Leaderboard:** N/A - **Point of Contact:** [shamsuddeen Muhammad](shamsuddeen2004@gmail.com) ### Dataset Summary AfriSenti is the largest sentiment analysis dataset for under-represented African languages, covering 110,000+ annotated tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba). The datasets are used in the first Afrocentric SemEval shared task, SemEval 2023 Task 12: Sentiment analysis for African languages (AfriSenti-SemEval). AfriSenti allows the research community to build sentiment analysis systems for various African languages and enables the study of sentiment and contemporary language use in African languages. ### Supported Tasks and Leaderboards The AfriSenti can be used for a wide range of sentiment analysis tasks in African languages, such as sentiment classification, sentiment intensity analysis, and emotion detection. This dataset is suitable for training and evaluating machine learning models for various NLP tasks related to sentiment analysis in African languages. [SemEval 2023 Task 12 : Sentiment Analysis for African Languages](https://codalab.lisn.upsaclay.fr/competitions/7320) ### Languages 14 African languages (Amharic (amh), Algerian Arabic (ary), Hausa(hau), Igbo(ibo), Kinyarwanda(kin), Moroccan Arabic/Darija(arq), Mozambican Portuguese(por), Nigerian Pidgin (pcm), Oromo (oro), Swahili(swa), Tigrinya(tir), Twi(twi), Xitsonga(tso), and Yoruba(yor)). ## Dataset Structure ### Data Instances For each instance, there is a string for the tweet and a string for the label. See the AfriSenti [dataset viewer](https://huggingface.co/datasets/shmuhammad/AfriSenti/viewer/shmuhammad--AfriSenti/train) to explore more examples. ``` { "tweet": "string", "label": "string" } ``` ### Data Fields The data fields are: ``` tweet: a string feature. label: a classification label, with possible values including positive, negative and neutral. ``` ### Data Splits The AfriSenti dataset has 3 splits: train, validation, and test. Below are the statistics for Version 1.0.0 of the dataset. | | ama | arq | hau | ibo | ary | orm | pcm | pt-MZ | kin | swa | tir | tso | twi | yo | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | train | 5,982 | 1,652 | 14,173 | 10,193 | 5,584| - | 5,122 | 3,064 | 3,303 | 1,811 | - | 805 | 3,482| 8,523 | | dev | 1,498 | 415 | 2,678 | 1,842 | 1,216 | 397 | 1,282 | 768 | 828 | 454 | 399 | 204 | 389 | 2,091 | | test | 2,000 | 959 | 5,304 | 3,683 | 2,962 | 2,097 | 4,155 | 3,663 | 1,027 | 749 | 2,001 | 255 | 950 | 4,516 | | total | 9,483 | 3,062 | 22,155 | 15,718 | 9,762 | 2,494 | 10,559 | 7,495 | 5,158 | 3,014 | 2,400 | 1,264 | 4,821 | 15,130 | ### How to use it ```python from datasets import load_dataset # you can load specific languages (e.g., Amharic). This download train, validation and test sets. ds = load_dataset("shmuhammad/AfriSenti-twitter-sentiment", "amh") # train set only ds = load_dataset("shmuhammad/AfriSenti-twitter-sentiment", "amh", split = "train") # test set only ds = load_dataset("shmuhammad/AfriSenti-twitter-sentiment", "amh", split = "test") # validation set only ds = load_dataset("shmuhammad/AfriSenti-twitter-sentiment", "amh", split = "validation") ``` ## Dataset Creation ### Curation Rationale AfriSenti Version 1.0.0 aimed to be used in the first Afrocentric SemEval shared task **[SemEval 2023 Task 12: Sentiment analysis for African languages (AfriSenti-SemEval)](https://afrisenti-semeval.github.io)**. ### Source Data Twitter #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information We anonymized the tweets by replacing all *@mentions* by *@user* and removed all URLs. ## Considerations for Using the Data ### Social Impact of Dataset The Afrisenti dataset has the potential to improve sentiment analysis for African languages, which is essential for understanding and analyzing the diverse perspectives of people in the African continent. This dataset can enable researchers and developers to create sentiment analysis models that are specific to African languages, which can be used to gain insights into the social, cultural, and political views of people in African countries. Furthermore, this dataset can help address the issue of underrepresentation of African languages in natural language processing, paving the way for more equitable and inclusive AI technologies. [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators AfriSenti is an extension of NaijaSenti, a dataset consisting of four Nigerian languages: Hausa, Yoruba, Igbo, and Nigerian-Pidgin. This dataset has been expanded to include other 10 African languages, and was curated with the help of the following: | Language | Dataset Curators | |---|---| | Algerian Arabic (arq) | Nedjma Ousidhoum, Meriem Beloucif | | Amharic (ama) | Abinew Ali Ayele, Seid Muhie Yimam | | Hausa (hau) | Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Ibrahim Said, Bello Shehu Bello | | Igbo (ibo) | Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Ibrahim Said, Bello Shehu Bello | | Kinyarwanda (kin)| Samuel Rutunda | | Moroccan Arabic/Darija (ary) | Oumaima Hourrane | | Mozambique Portuguese (pt-MZ) | Felermino Dário Mário António Ali | | Nigerian Pidgin (pcm) | Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Ibrahim Said, Bello Shehu Bello | | Oromo (orm) | Abinew Ali Ayele, Seid Muhie Yimam, Hagos Tesfahun Gebremichael, Sisay Adugna Chala, Hailu Beshada Balcha, Wendimu Baye Messell, Tadesse Belay | | Swahili (swa) | Davis Davis | | Tigrinya (tir) | Abinew Ali Ayele, Seid Muhie Yimam, Hagos Tesfahun Gebremichael, Sisay Adugna Chala, Hailu Beshada Balcha, Wendimu Baye Messell, Tadesse Belay | | Twi (twi) | Salomey Osei, Bernard Opoku, Steven Arthur | | Xithonga (tso) | Felermino Dário Mário António Ali | | Yoruba (yor) | Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Ibrahim Said, Bello Shehu Bello | ### Licensing Information This AfriSenti is licensed under a Creative Commons Attribution 4.0 International License ### Citation Information ``` @inproceedings{Muhammad2023AfriSentiAT, title={AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages}, author={Shamsuddeen Hassan Muhammad and Idris Abdulmumin and Abinew Ali Ayele and Nedjma Ousidhoum and David Ifeoluwa Adelani and Seid Muhie Yimam and Ibrahim Sa'id Ahmad and Meriem Beloucif and Saif Mohammad and Sebastian Ruder and Oumaima Hourrane and Pavel Brazdil and Felermino D'ario M'ario Ant'onio Ali and Davis Davis and Salomey Osei and Bello Shehu Bello and Falalu Ibrahim and Tajuddeen Gwadabe and Samuel Rutunda and Tadesse Belay and Wendimu Baye Messelle and Hailu Beshada Balcha and Sisay Adugna Chala and Hagos Tesfahun Gebremichael and Bernard Opoku and Steven Arthur}, year={2023} } ``` ``` @article{muhammad2023semeval, title={SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)}, author={Muhammad, Shamsuddeen Hassan and Abdulmumin, Idris and Yimam, Seid Muhie and Adelani, David Ifeoluwa and Ahmad, Ibrahim Sa'id and Ousidhoum, Nedjma and Ayele, Abinew and Mohammad, Saif M and Beloucif, Meriem}, journal={arXiv preprint arXiv:2304.06845}, year={2023} } ``` ### Contributions [More Information Needed]

提供机构：

shmuhammad

原始信息汇总

AfriSenti 数据集概述

数据集描述

名称: AfriSenti
类型: 情感分析数据集
规模: 包含超过110,000条标注的推文
语言: 涵盖14种非洲语言，包括Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, Yoruba
用途: 用于构建针对非洲语言的情感分析系统，支持情感分类、情感强度分析和情绪检测等任务
相关任务: 情感分析、语义相似度分类
相关活动: 用于SemEval 2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

数据集结构

数据实例: 每个实例包含一条推文及其对应的情感标签（正、负、中立）
数据字段:
- tweet: 推文内容，字符串类型
- label: 情感标签，字符串类型，可能的值包括positive, negative, neutral
数据分割: 数据集分为训练集、验证集和测试集，具体统计数据如下：

语言	训练集	验证集	测试集	总计
amh	5,982	1,498	2,000	9,483
arq	1,652	415	959	3,062
hau	14,173	2,678	5,304	22,155
ibo	10,193	1,842	3,683	15,718
...	...	...	...	...

使用方法

数据集可通过Hugging Face的load_dataset函数加载，支持按语言和数据分割类型（训练、验证、测试）进行加载。

数据集创建

目的: 用于支持非洲语言的情感分析研究，特别是在SemEval 2023 Task 12中作为共享任务的一部分。
数据来源: 推特（Twitter）
数据处理: 推文已进行匿名化处理，@提及替换为@user，URL已被移除。

许可证

数据集遵循Creative Commons Attribution 4.0 International License。

引用信息

数据集的引用信息包含在提供的README文件中，包括相关的学术论文和会议报告。

5,000+

优质数据集

54 个

任务类型

进入经典数据集