five

fewshot-goes-multilingual/cs_mall-product-reviews

收藏
Hugging Face2022-12-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fewshot-goes-multilingual/cs_mall-product-reviews
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - cs language_creators: - found license: - cc-by-nc-sa-3.0 multilinguality: - monolingual pretty_name: Mall.cz Product Reviews size_categories: - 10K<n<100K source_datasets: - original tags: [] task_categories: - text-classification task_ids: - sentiment-classification --- # Dataset Card for Mall.cz Product Reviews (Czech) ## Dataset Description The dataset contains user reviews from Czech eshop <mall.cz> Each review contains text, sentiment (positive/negative/neutral), and automatically-detected language (mostly Czech, occasionaly Slovak) using [lingua-py](https://github.com/pemistahl/lingua-py) The dataset has in total (train+validation+test) 30,000 reviews. The data is balanced. Train set has 8000 positive, 8000 neutral and 8000 negative reviews. Validation and test set each have 1000 positive, 1000 neutral and 1000 negative reviews. ## Dataset Features Each sample contains: - `review_id`: unique string identifier of the review. - `rating_str`: string representation of the rating - "pozitivní" / "neutrální" / "negativní" - `rating_int`: integer representation of the rating (1=positive, 0=neutral, -1=negative) - `comment_language`: language of the review (mostly "cs", occasionaly "sk") - `comment`: the string of the review ## Dataset Source The data is a processed adaptation of [Mall CZ corpus](https://liks.fav.zcu.cz/sentiment/). The adaptation is label-balanced and adds automatically-detected language
提供机构:
fewshot-goes-multilingual
原始信息汇总

数据集概述

基本信息

  • 名称: Mall.cz Product Reviews
  • 语言: 捷克语 (主要), 斯洛伐克语 (偶尔)
  • 许可证: CC-BY-NC-SA-3.0
  • 多语言性: 单语种
  • 大小: 10K<n<100K

数据集描述

  • 来源: 用户评论数据来自捷克电商网站<mall.cz>
  • 内容: 包含文本、情感(正面/负面/中性)和自动检测的语言(主要为捷克语,偶尔为斯洛伐克语)
  • 总量: 总计30,000条评论,数据平衡
  • 分割:
    • 训练集: 8000条正面, 8000条中性, 8000条负面
    • 验证集: 1000条正面, 1000条中性, 1000条负面
    • 测试集: 1000条正面, 1000条中性, 1000条负面

数据集特征

  • 样本包含:
    • review_id: 评论的唯一字符串标识符
    • rating_str: 评级字符串表示 - "pozitivní" / "neutrální" / "negativní"
    • rating_int: 评级整数表示 (1=正面, 0=中性, -1=负面)
    • comment_language: 评论语言 (主要为"cs", 偶尔为"sk")
    • comment: 评论文本

数据集来源

  • 原始数据: Mall CZ corpus
  • 处理: 数据经过平衡标签处理,并添加了自动检测的语言信息
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作