five

Yelp/yelp_review_full

收藏
Hugging Face2024-01-04 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Yelp/yelp_review_full
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - other multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: - sentiment-classification pretty_name: YelpReviewFull license_details: yelp-licence dataset_info: config_name: yelp_review_full features: - name: label dtype: class_label: names: '0': 1 star '1': 2 star '2': 3 stars '3': 4 stars '4': 5 stars - name: text dtype: string splits: - name: train num_bytes: 483811554 num_examples: 650000 - name: test num_bytes: 37271188 num_examples: 50000 download_size: 322952369 dataset_size: 521082742 configs: - config_name: yelp_review_full data_files: - split: train path: yelp_review_full/train-* - split: test path: yelp_review_full/test-* default: true train-eval-index: - config: yelp_review_full task: text-classification task_id: multi_class_classification splits: train_split: train eval_split: test col_mapping: text: text label: target metrics: - type: accuracy name: Accuracy - type: f1 name: F1 macro args: average: macro - type: f1 name: F1 micro args: average: micro - type: f1 name: F1 weighted args: average: weighted - type: precision name: Precision macro args: average: macro - type: precision name: Precision micro args: average: micro - type: precision name: Precision weighted args: average: weighted - type: recall name: Recall macro args: average: macro - type: recall name: Recall micro args: average: micro - type: recall name: Recall weighted args: average: weighted --- --- # Dataset Card for YelpReviewFull ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Yelp](https://www.yelp.com/dataset) - **Repository:** [Crepe](https://github.com/zhangxiangxiao/Crepe) - **Paper:** [Character-level Convolutional Networks for Text Classification](https://arxiv.org/abs/1509.01626) - **Point of Contact:** [Xiang Zhang](mailto:xiang.zhang@nyu.edu) ### Dataset Summary The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. ### Supported Tasks and Leaderboards - `text-classification`, `sentiment-classification`: The dataset is mainly used for text classification: given the text, predict the sentiment. ### Languages The reviews were mainly written in english. ## Dataset Structure ### Data Instances A typical data point, comprises of a text and the corresponding label. An example from the YelpReviewFull test set looks as follows: ``` { 'label': 0, 'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!' } ``` ### Data Fields - 'text': The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n". - 'label': Corresponds to the score associated with the review (between 1 and 5). ### Data Splits The Yelp reviews full star dataset is constructed by randomly taking 130,000 training samples and 10,000 testing samples for each review star from 1 to 5. In total there are 650,000 trainig samples and 50,000 testing samples. ## Dataset Creation ### Curation Rationale The Yelp reviews full star dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the Yelp Dataset Challenge 2015. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information You can check the official [yelp-dataset-agreement](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/bea5c1e92bf3/assets/vendor/yelp-dataset-agreement.pdf). ### Citation Information Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). ### Contributions Thanks to [@hfawaz](https://github.com/hfawaz) for adding this dataset.

annotations_creators: - 众包(crowdsourced) language_creators: - 众包(crowdsourced) language: - 英语(en) license: - 其他(other) multilinguality: - 单语言(monolingual) size_categories: - 100K<n<1M source_datasets: - 原创数据集(original) task_categories: - 文本分类(text-classification) task_ids: - 情感分类(sentiment-classification) pretty_name: 完整Yelp评论数据集(YelpReviewFull) license_details: Yelp许可协议(yelp-licence) dataset_info: config_name: yelp_review_full features: - name: label dtype: class_label: names: '0': 1星 '1': 2星 '2': 3星 '3': 4星 '4': 5星 - name: text dtype: 字符串(string) splits: - name: 训练集(train) num_bytes: 483811554 num_examples: 650000 - name: 测试集(test) num_bytes: 37271188 num_examples: 50000 download_size: 322952369 dataset_size: 521082742 configs: - config_name: yelp_review_full data_files: - split: 训练集(train) path: yelp_review_full/train-* - split: 测试集(test) path: yelp_review_full/test-* default: true train-eval-index: - config: yelp_review_full task: 文本分类(text-classification) task_id: 多类别分类(multi_class_classification) splits: train_split: 训练集(train) eval_split: 测试集(test) col_mapping: text: text label: 目标标签(target) metrics: - type: 准确率(accuracy) name: 准确率(Accuracy) - type: F1值(f1) name: 宏平均F1值(F1 macro) args: average: 宏平均(macro) - type: F1值(f1) name: 微平均F1值(F1 micro) args: average: 微平均(micro) - type: F1值(f1) name: 加权平均F1值(F1 weighted) args: average: 加权平均(weighted) - type: 精确率(precision) name: 宏平均精确率(Precision macro) args: average: 宏平均(macro) - type: 精确率(precision) name: 微平均精确率(Precision micro) args: average: 微平均(micro) - type: 精确率(precision) name: 加权平均精确率(Precision weighted) args: average: 加权平均(weighted) - type: 召回率(recall) name: 宏平均召回率(Recall macro) args: average: 宏平均(macro) - type: 召回率(recall) name: 微平均召回率(Recall micro) args: average: 微平均(micro) - type: 召回率(recall) name: 加权平均召回率(Recall weighted) args: average: 加权平均(weighted) --- # 完整Yelp评论数据集(YelpReviewFull)数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建理据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集概述 - **主页**:[Yelp](https://www.yelp.com/dataset) - **代码仓库**:[Crepe](https://github.com/zhangxiangxiao/Crepe) - **相关论文**:[《字符级卷积网络用于文本分类》](https://arxiv.org/abs/1509.01626)(原标题:Character-level Convolutional Networks for Text Classification) - **联系人**:[张翔(Xiang Zhang)](mailto:xiang.zhang@nyu.edu) ### 数据集摘要 完整Yelp评论数据集包含来自Yelp平台的用户评论,其数据源自2015年Yelp数据集挑战赛的公开数据。 ### 支持任务与基准榜单 - `文本分类(text-classification)`、`情感分类(sentiment-classification)`:本数据集主要用于文本分类任务,即给定评论文本,预测其对应的情感星级评分。 ### 使用语言 本数据集的评论主体均以英语撰写。 ## 数据集结构 ### 数据实例 一条典型的数据样本由评论文本与对应的标签组成。以下为来自测试集的示例样本: json { 'label': 0, 'text': '我从他们这里购买了“全新”轮胎,但仅两周后就出现了爆胎。我将车辆开到当地汽修店,尝试修补轮胎上的孔洞,但维修人员表示爆胎原因是之前的补丁脱落了——等等,什么?我刚买的轮胎,之前根本不需要修补?这明明应该是全新轮胎。 我把轮胎送到Flynn汽修店,他们告诉我有人先扎破了我的轮胎,之后又试图进行修补。难道存在心怀不满的轮胎破坏者?这实在令人难以置信。与店员争执后,我指出他的逻辑完全站不住脚,他最终表示“这次”会给我更换全新轮胎。 由于这名店员的服务态度,以及他们给我更换了二手轮胎的事实,我绝不会再光顾Flynn汽修店!' } ### 数据字段 - `text`:评论文本使用双引号进行转义,内部的双引号会通过两个连续双引号转义;换行符则通过反斜杠加字符`n`(即` `)进行转义。 - `label`:对应评论的星级评分,取值范围为1至5星。 ### 数据划分 完整Yelp评论数据集的构建方式为:针对1至5星的每一类评论,随机选取13万条作为训练样本,1万条作为测试样本。最终总训练样本数为65万条,测试样本数为5万条。 ## 数据集构建 ### 数据集构建理据 完整Yelp评论数据集由张翔(xiang.zhang@nyu.edu)从2015年Yelp数据集挑战赛数据中整理而来,首次作为文本分类基准数据集出现在以下论文中:Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言创作者信息 [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员信息 [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差分析 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 可查阅官方的[Yelp数据集许可协议](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/bea5c1e92bf3/assets/vendor/yelp-dataset-agreement.pdf)。 ### 引用信息 Xiang Zhang, Junbo Zhao, Yann LeCun. 字符级卷积网络用于文本分类. 神经信息处理系统进展 28 (NIPS 2015). ### 贡献者 感谢[@hfawaz](https://github.com/hfawaz)为本数据集的收录提供支持。
提供机构:
Yelp
原始信息汇总

数据集卡片 for YelpReviewFull

数据集描述

数据集概要

Yelp reviews 数据集包含来自 Yelp 的评论。它从 Yelp Dataset Challenge 2015 数据中提取。

支持的任务和排行榜

  • text-classification, sentiment-classification: 该数据集主要用于文本分类:给定文本,预测情感。

语言

评论主要以英语撰写。

数据集结构

数据实例

一个典型的数据点包含文本和相应的标签。

YelpReviewFull 测试集中的一个示例如下: json { label: 0, text: I got ew tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \nI took the tire over to Flynns and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said hed give me a new tire "this time". \nI will never go back to Flynns b/c of the way this guy treated me and the simple fact that they gave me a used tire! }

数据字段

  • text: 评论文本使用双引号(")进行转义,任何内部双引号通过两个双引号("")进行转义。换行通过反斜杠后跟 "n" 字符进行转义,即 " "。
  • label: 对应于与评论相关的评分(1 到 5 之间)。

数据分割

Yelp reviews full star 数据集通过从 1 到 5 的每个评论星随机抽取 130,000 个训练样本和 10,000 个测试样本构建。总共包含 650,000 个训练样本和 50,000 个测试样本。

数据集创建

策划理由

Yelp reviews full star 数据集由 Xiang Zhang (xiang.zhang@nyu.edu) 从 Yelp Dataset Challenge 2015 构建。它首次在以下论文中用作文本分类基准:Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

您可以查看官方 yelp-dataset-agreement

引用信息

Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

贡献

感谢 @hfawaz 添加此数据集。

AI搜集汇总
数据集介绍
main_image_url
构建方式
YelpReviewFull数据集由纽约大学的研究者Xiang Zhang构建,从Yelp Dataset Challenge 2015中随机选取了130,000条训练样本和10,000条测试样本,针对每个星级评价。数据集总共包含了650,000条训练样本和50,000条测试样本。数据通过字符级别的卷积网络进行文本分类,旨在对评论文本进行情感倾向预测。
特点
该数据集的特点在于其专注于评论文本的情感分类任务,涵盖了从1星到5星的完整评价体系。评论文本经过适当的转义处理,以适应数据格式的要求。此外,数据集采用单语种英文构建,保证了语言的一致性和处理的简便性。
使用方法
使用YelpReviewFull数据集时,用户需遵循Yelp提供的官方使用协议。数据集提供了训练和测试两个部分,可以通过标准的文本分类模型进行训练和评估。在模型评估方面,数据集支持多种指标,包括准确率、F1分数(宏观、微观和加权平均)以及精确度和召回率等,以全面衡量模型性能。
背景与挑战
背景概述
Yelp/yelp_review_full数据集,源于2015年Yelp数据集挑战赛,由纽约大学的研究员Xiang Zhang构建并首次应用于其研究论文《Character-level Convolutional Networks for Text Classification》中。该数据集包含了从Yelp网站收集的消费者评论,旨在用于文本分类任务,尤其是情感分析,即根据评论内容预测用户给出的星级评分。数据集涵盖了65万条训练样本和5万条测试样本,覆盖了1星至5星的所有评分等级,对自然语言处理领域的研究和实践具有重要的参考价值。
当前挑战
在数据集构建过程中,面临的主要挑战包括数据的质量控制和隐私信息的处理。数据标注的质量直接影响到模型的训练效果,而评论中的个人敏感信息需要被妥善处理以保护用户隐私。在研究领域问题方面,该数据集的使用者需要解决如何提高情感分类的准确性和鲁棒性,以及如何减少模型对噪声数据和异常值的敏感性等挑战。
常用场景
经典使用场景
在自然语言处理领域,Yelp/yelp_review_full数据集的经典使用场景是进行文本分类任务,尤其是情感分析。该数据集提供了海量的用户评价文本及其对应的星级标签,研究者可以基于此训练模型以识别文本中的情感倾向,从而实现自动化情感分析。
衍生相关工作
基于Yelp/yelp_review_full数据集,衍生出了众多经典工作,如字符级卷积神经网络在文本分类中的应用研究。这些工作推动了深度学习技术在文本处理领域的进展,并促进了情感分析、自然语言理解等相关领域的理论研究和技术发展。
数据集最近研究
最新研究方向
Yelp/yelp_review_full数据集作为文本分类领域的基石,近期研究方向主要聚焦于深度学习模型的微调与多模态融合。研究者们致力于通过细粒度的情感分析,不仅识别出正面或负面评价,还能准确区分情感的微妙差异。此外,结合用户画像和评论时间序列的分析,为情感预测提供了新的视角。这些研究对于提升在线服务质量评估、精细化用户服务具有重要的实践意义。
以上内容由AI搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作