hzang97/tomtat

Name: hzang97/tomtat
Creator: hzang97
Published: 2024-06-04 17:54:25
License: 暂无描述

Hugging Face2024-06-04 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/hzang97/tomtat

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown --- --- annotations_creators: - found language_creators: - found language: - vi license: - other multilinguality: - monolingual size_categories: - 100K<n<1M - 1M<n<10M source_datasets: - original task_categories: - summarization - text-generation - fill-mask - text-classification task_ids: - text-scoring - language-modeling - masked-language-modeling - sentiment-classification - sentiment-scoring - topic-classification paperswithcode_id: null pretty_name: The Vietnamese Amazon Reviews Corpus dataset_info: - config_name: vi features: - name: review_id dtype: string - name: product_id dtype: string - name: reviewer_id dtype: string - name: stars dtype: int32 - name: review_body dtype: string - name: review_title dtype: string - name: language dtype: string - name: product_category dtype: string splits: - name: train num_bytes: 364405048 num_examples: 1200000 - name: validation num_bytes: 9047533 num_examples: 30000 - name: test num_bytes: 9099141 num_examples: 30000 download_size: 640320386 dataset_size: 382551722 config_names: - vi viewer: false --- # Dataset Card for The Vietnamese Amazon Reviews Corpus ## Table of Contents - [Dataset Card for The Vietnamese Amazon Reviews Corpus](#dataset-card-for-the-vietnamese-amazon-reviews-corpus) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [plain_text](#plain_text) - [Data Fields](#data-fields) - [plain_text](#plain_text-1) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Webpage:** https://registry.opendata.aws/amazon-reviews-ml/ - **Paper:** https://arxiv.org/abs/2010.02573 - **Point of Contact:** [multilingual-reviews-dataset@amazon.com](mailto:multilingual-reviews-dataset@amazon.com) ### Dataset Summary We provide an Amazon product reviews dataset for text classification in Vietnamese. The dataset contains reviews in Vietnamese, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews. For Vietnamese, there are 1,200,000, 30,000 and 30,000 reviews in the training, development, and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset contains reviews in Vietnamese. ## Dataset Structure ### Data Instances Each data instance corresponds to a review. The original JSON for an instance looks like so (Vietnamese example): ```json { "review_id": "vi_1234567", "product_id": "product_vi_1234567", "reviewer_id": "reviewer_vi_1234567", "stars": "5", "review_body": "Sản phẩm rất tốt, tôi rất hài lòng với chất lượng.", "review_title": "Rất hài lòng", "language": "vi", "product_category": "electronics" }

提供机构：

hzang97

原始信息汇总

数据集概述

数据集名称

名称: The Vietnamese Amazon Reviews Corpus
别名: 越南语亚马逊评论语料库

数据集描述

语言: 越南语
内容: 包含越南语的亚马逊产品评论，收集时间为2015年11月1日至2019年11月1日。每个记录包括评论文本、评论标题、星级评分、匿名化的评论者ID、匿名化的产品ID及粗粒度的产品类别。
数据平衡: 各星级评分占比20%。

数据集结构

数据实例: 每个实例对应一条评论，包含以下字段：
- review_id: 字符串
- product_id: 字符串
- reviewer_id: 字符串
- stars: 整数
- review_body: 字符串
- review_title: 字符串
- language: 字符串
- product_category: 字符串
数据分割:
- train: 1,200,000条评论，364,405,048字节
- validation: 30,000条评论，9,047,533字节
- test: 30,000条评论，9,099,141字节

数据集大小

下载大小: 640,320,386字节
数据集大小: 382,551,722字节

许可信息

许可: 其他

多语言性

多语言性: 单语种（越南语）

任务类别

任务:
- 摘要生成
- 文本生成
- 填空
- 文本分类
任务ID:
- 文本评分
- 语言建模
- 掩码语言建模
- 情感分类
- 情感评分
- 主题分类

数据集创建

源数据: 原始数据
注释创建者: 发现
语言创建者: 发现

5,000+

优质数据集

54 个

任务类型

进入经典数据集