ccosme/SentiTaglishProductsAndServices
收藏Hugging Face2024-05-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/ccosme/SentiTaglishProductsAndServices
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
- zero-shot-classification
language:
- en
- tl
tags:
- code-switching
- sentiment analysis
- low-resource languages
- Taglish
- Tagalog
- Filipino
size_categories:
- 10K<n<100K
---
# Dataset Card for Sentiment-Annotated Taglish Product and Service Reviews (SentiTaglish: Products and Services)
## Dataset Summary
Sentiment-Annotated Taglish Product and Service Reviews (SentiTaglish: Products and Services) is a gold standard, sentiment-annotated corpus for the Tagalog-English language pair. It contains 10,510 product and service reviews which were manually labeled by three human annotators according to four sentiment classes: Positive, Negative, Neutral, and Mixed.
## Supported Tasks and Leaderboards
Sentiment classification of multilingual text with code-switching / code-mixing.
## Languages
- Tagalog
- English
## Dataset Structure
### Data Fields
* `review`: a string containing the body of the review
* `sentiment`: an integer containing the label encoding of the gold-truth label provided by the human annotators
### Label encoding
* 1 - Negative
* 2 - Neutral
* 3 - Positive
* 4 - Mixed
## Dataset Creation and Annotation
The data set was created using publicly available online service and product reviews from Google Maps Reviews and Shopee Philippines. Only the rating and review fields were collected and stored.
Three annotators were tasked to manually label the data set. The first two annotators labeled the same full set of reviews. Any disagreements were sent to a third annotator.
The validity and reliability of the dataset is supported by a strong agreement among the human annotators. Fleiss' kappa was used to measure the agreement between the annotations of the first two taggers while Krippendorff’s alpha was used to measure the agreement among all three annotators. Both reported strong agreement at 0.82 and 0.83, respectively. In addition, a strong positive correlation between user ratings and manual annotations was noted.
## Personal and Sensitive Information
No personal information were collected and stored.
## Licensing Information
The SentiTaglish: Products and Services data set version 1.0 is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
## Citation Information
TBA
提供机构:
ccosme
原始信息汇总
数据集概述
数据集名称
Sentiment-Annotated Taglish Product and Service Reviews (SentiTaglish: Products and Services)
数据集简介
SentiTaglish: Products and Services 是一个针对Tagalog-English语言对的黄金标准情感标注语料库,包含10,510条产品和服务评论,由三名人工标注者根据四个情感类别(Positive, Negative, Neutral, Mixed)手动标注。
支持的任务
- 多语言文本的情感分类,特别是代码切换/代码混合的情况。
语言
- Tagalog
- English
数据集结构
数据字段
review: 包含评论主体的字符串sentiment: 包含由人工标注者提供的黄金标准标签的编码的整数
标签编码
- 1 - Negative
- 2 - Neutral
- 3 - Positive
- 4 - Mixed
数据集创建与标注
数据集通过收集Google Maps Reviews和Shopee Philippines的公开在线服务和产品评论创建。三名标注者手动标注数据集,前两名标注者标注同一完整集合的评论,任何分歧由第三名标注者解决。使用Fleiss kappa和Krippendorff’s alpha测量标注者之间的协议,分别报告了0.82和0.83的强协议。
许可信息
数据集版本1.0根据CC-BY-4.0许可发布。



