five

HausaNLP/Naija-Stopwords

收藏
Hugging Face2023-06-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HausaNLP/Naija-Stopwords
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 tags: - sentiment analysis, Twitter, tweets - stopwords multilinguality: - monolingual - multilingual language: - hau - ibo - pcm - yor pretty_name: NaijaStopwords --- # Naija-Stopwords Naija-Stopwords is a part of the [Naija-Senti](https://huggingface.co/datasets/HausaNLP/NaijaSenti-Twitter) project. It is a list of collected stopwords from the four most widely spoken languages in Nigeria — Hausa, Igbo, Nigerian-Pidgin, and Yorùbá. -------------------------------------------------------------------------------- ## Dataset Description - **Homepage:** https://github.com/hausanlp/NaijaSenti/tree/main/data/stopwords - **Repository:** [GitHub](https://github.com/hausanlp/NaijaSenti/tree/main/data/stopwords) - **Paper:** [NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis](https://aclanthology.org/2022.lrec-1.63/) - **Leaderboard:** N/A - **Point of Contact:** [Shamsuddeen Hassan Muhammad](shamsuddeen2004@gmail.com) ### Languages 4 most spoken Nigerian languages * Hausa (hau) * Igbo (ibo) * Nigerian Pidgin (pcm) * Yoruba (yor) ## Dataset Structure ### Data Instances List of stopwords instances in each of the four language. ``` { "word": "string" } ``` ### How to use it ```python from datasets import load_dataset # you can load specific languages (e.g., Hausa). This download train, validation and test sets. ds = load_dataset("HausaNLP/Naija-Stopwords", "hau") ``` ## Additional Information ### Dataset Curators * Shamsuddeen Hassan Muhammad * Idris Abdulmumin * Ibrahim Said Ahmad * Bello Shehu Bello ### Licensing Information This Naija-Stopwords dataset is licensed under a Creative Commons Attribution BY-NC-SA 4.0 International License ### Citation Information ``` @inproceedings{muhammad-etal-2022-naijasenti, title = "{N}aija{S}enti: A {N}igerian {T}witter Sentiment Corpus for Multilingual Sentiment Analysis", author = "Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Ruder, Sebastian and Ahmad, Ibrahim Sa{'}id and Abdulmumin, Idris and Bello, Bello Shehu and Choudhury, Monojit and Emezue, Chris Chinenye and Abdullahi, Saheed Salahudeen and Aremu, Anuoluwapo and Jorge, Al{\'\i}pio and Brazdil, Pavel", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.63", pages = "590--602", } ``` ### Contributions > This work was carried out with support from Lacuna Fund, an initiative co-founded by The Rockefeller Foundation, Google.org, and Canada’s International Development Research Centre. The views expressed herein do not necessarily represent those of Lacuna Fund, its Steering Committee, its funders, or Meridian Institute.
提供机构:
HausaNLP
原始信息汇总

数据集概述

数据集名称

  • Naija-Stopwords

数据集描述

  • Naija-StopwordsNaija-Senti 项目的一部分,收集了尼日利亚四种最广泛使用的语言的停用词:Hausa, Igbo, Nigerian-Pidgin, 和 Yorùbá。

数据集详情

  • 许可证: cc-by-nc-sa-4.0
  • 标签: sentiment analysis, Twitter, tweets, stopwords
  • 多语言性: monolingual, multilingual
  • 语言: hau, ibo, pcm, yor
  • 美观名称: NaijaStopwords

数据集结构

  • 数据实例: 包含四种语言的停用词列表。
  • 数据格式: json { "word": "string" }

如何使用

python from datasets import load_dataset

加载特定语言(例如 Hausa)的数据集

ds = load_dataset("HausaNLP/Naija-Stopwords", "hau")

数据集语言

  • Hausa (hau)
  • Igbo (ibo)
  • Nigerian Pidgin (pcm)
  • Yoruba (yor)

许可证信息

  • Naija-Stopwords 数据集根据 Creative Commons Attribution BY-NC-SA 4.0 International License 授权。

引用信息

bibtex @inproceedings{muhammad-etal-2022-naijasenti, title = "{N}aija{S}enti: A {N}igerian {T}witter Sentiment Corpus for Multilingual Sentiment Analysis", author = "Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Ruder, Sebastian and Ahmad, Ibrahim Sa{}id and Abdulmumin, Idris and Bello, Bello Shehu and Choudhury, Monojit and Emezue, Chris Chinenye and Abdullahi, Saheed Salahudeen and Aremu, Anuoluwapo and Jorge, Al{\i}pio and Brazdil, Pavel", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.63", pages = "590--602", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作