fewshot-goes-multilingual/cs_mall-product-reviews
收藏Hugging Face2022-12-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fewshot-goes-multilingual/cs_mall-product-reviews
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language:
- cs
language_creators:
- found
license:
- cc-by-nc-sa-3.0
multilinguality:
- monolingual
pretty_name: Mall.cz Product Reviews
size_categories:
- 10K<n<100K
source_datasets:
- original
tags: []
task_categories:
- text-classification
task_ids:
- sentiment-classification
---
# Dataset Card for Mall.cz Product Reviews (Czech)
## Dataset Description
The dataset contains user reviews from Czech eshop <mall.cz>
Each review contains text, sentiment (positive/negative/neutral), and automatically-detected language (mostly Czech, occasionaly Slovak) using [lingua-py](https://github.com/pemistahl/lingua-py)
The dataset has in total (train+validation+test) 30,000 reviews. The data is balanced.
Train set has 8000 positive, 8000 neutral and 8000 negative reviews.
Validation and test set each have 1000 positive, 1000 neutral and 1000 negative reviews.
## Dataset Features
Each sample contains:
- `review_id`: unique string identifier of the review.
- `rating_str`: string representation of the rating - "pozitivní" / "neutrální" / "negativní"
- `rating_int`: integer representation of the rating (1=positive, 0=neutral, -1=negative)
- `comment_language`: language of the review (mostly "cs", occasionaly "sk")
- `comment`: the string of the review
## Dataset Source
The data is a processed adaptation of [Mall CZ corpus](https://liks.fav.zcu.cz/sentiment/).
The adaptation is label-balanced and adds automatically-detected language
提供机构:
fewshot-goes-multilingual
原始信息汇总
数据集概述
基本信息
- 名称: Mall.cz Product Reviews
- 语言: 捷克语 (主要), 斯洛伐克语 (偶尔)
- 许可证: CC-BY-NC-SA-3.0
- 多语言性: 单语种
- 大小: 10K<n<100K
数据集描述
- 来源: 用户评论数据来自捷克电商网站<mall.cz>
- 内容: 包含文本、情感(正面/负面/中性)和自动检测的语言(主要为捷克语,偶尔为斯洛伐克语)
- 总量: 总计30,000条评论,数据平衡
- 分割:
- 训练集: 8000条正面, 8000条中性, 8000条负面
- 验证集: 1000条正面, 1000条中性, 1000条负面
- 测试集: 1000条正面, 1000条中性, 1000条负面
数据集特征
- 样本包含:
review_id: 评论的唯一字符串标识符rating_str: 评级字符串表示 - "pozitivní" / "neutrální" / "negativní"rating_int: 评级整数表示 (1=正面, 0=中性, -1=负面)comment_language: 评论语言 (主要为"cs", 偶尔为"sk")comment: 评论文本
数据集来源
- 原始数据: Mall CZ corpus
- 处理: 数据经过平衡标签处理,并添加了自动检测的语言信息



