five

katossky/multi-domain-sentiment

收藏
Hugging Face2022-11-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/katossky/multi-domain-sentiment
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: unknown --- This sentiment dataset was used in the paper: John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. The author asks, if you use this data for your research or a publication, to cite the above paper as the reference for the data, and to inform him about the reuse. The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. The directory contains 3 files called positive.review, negative.review and unlabeled.review. While the positive and negative files contain positive and negative reviews, these aren't necessarily the splits the authors used in the experiments. They randomly drew from the three files ignoring the file names. Each file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self explanatory. The reviews have a "unique ID" field that isn't very unique. If it has two unique id fields, ignore the one containing only a number.
提供机构:
katossky
原始信息汇总

数据集概述

数据集名称

Multi-Domain Sentiment Dataset

数据集内容

该数据集包含来自Amazon.com的产品评论,涵盖四个产品类型(领域):厨房用品、书籍、DVD和电子产品。每个领域包含数千条评论,具体数量因领域而异。评论包含1至5星的评分,可根据需要转换为二元标签。

数据集文件

数据集目录包含三个文件:

  • positive.review:包含正面评论。
  • negative.review:包含负面评论。
  • unlabeled.review:包含未标记的评论。

数据集使用

作者建议,若将此数据用于研究或出版物,应引用上述论文作为数据来源,并通知作者数据的重用情况。

数据集格式

每个文件中的评论采用伪XML格式编码。大多数字段自解释,评论包含一个“唯一ID”字段,但该字段可能不唯一。若存在两个唯一ID字段,应忽略仅包含数字的字段。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作