hzang97/tomtat
收藏Hugging Face2024-06-04 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/hzang97/tomtat
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
---
---
annotations_creators:
- found
language_creators:
- found
language:
- vi
license:
- other
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
- 1M<n<10M
source_datasets:
- original
task_categories:
- summarization
- text-generation
- fill-mask
- text-classification
task_ids:
- text-scoring
- language-modeling
- masked-language-modeling
- sentiment-classification
- sentiment-scoring
- topic-classification
paperswithcode_id: null
pretty_name: The Vietnamese Amazon Reviews Corpus
dataset_info:
- config_name: vi
features:
- name: review_id
dtype: string
- name: product_id
dtype: string
- name: reviewer_id
dtype: string
- name: stars
dtype: int32
- name: review_body
dtype: string
- name: review_title
dtype: string
- name: language
dtype: string
- name: product_category
dtype: string
splits:
- name: train
num_bytes: 364405048
num_examples: 1200000
- name: validation
num_bytes: 9047533
num_examples: 30000
- name: test
num_bytes: 9099141
num_examples: 30000
download_size: 640320386
dataset_size: 382551722
config_names:
- vi
viewer: false
---
# Dataset Card for The Vietnamese Amazon Reviews Corpus
## Table of Contents
- [Dataset Card for The Vietnamese Amazon Reviews Corpus](#dataset-card-for-the-vietnamese-amazon-reviews-corpus)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [plain_text](#plain_text)
- [Data Fields](#data-fields)
- [plain_text](#plain_text-1)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Webpage:** https://registry.opendata.aws/amazon-reviews-ml/
- **Paper:** https://arxiv.org/abs/2010.02573
- **Point of Contact:** [multilingual-reviews-dataset@amazon.com](mailto:multilingual-reviews-dataset@amazon.com)
### Dataset Summary
We provide an Amazon product reviews dataset for text classification in Vietnamese. The dataset contains reviews in Vietnamese, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews.
For Vietnamese, there are 1,200,000, 30,000 and 30,000 reviews in the training, development, and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The dataset contains reviews in Vietnamese.
## Dataset Structure
### Data Instances
Each data instance corresponds to a review. The original JSON for an instance looks like so (Vietnamese example):
```json
{
"review_id": "vi_1234567",
"product_id": "product_vi_1234567",
"reviewer_id": "reviewer_vi_1234567",
"stars": "5",
"review_body": "Sản phẩm rất tốt, tôi rất hài lòng với chất lượng.",
"review_title": "Rất hài lòng",
"language": "vi",
"product_category": "electronics"
}
提供机构:
hzang97
原始信息汇总
数据集概述
数据集名称
- 名称: The Vietnamese Amazon Reviews Corpus
- 别名: 越南语亚马逊评论语料库
数据集描述
- 语言: 越南语
- 内容: 包含越南语的亚马逊产品评论,收集时间为2015年11月1日至2019年11月1日。每个记录包括评论文本、评论标题、星级评分、匿名化的评论者ID、匿名化的产品ID及粗粒度的产品类别。
- 数据平衡: 各星级评分占比20%。
数据集结构
-
数据实例: 每个实例对应一条评论,包含以下字段:
review_id: 字符串product_id: 字符串reviewer_id: 字符串stars: 整数review_body: 字符串review_title: 字符串language: 字符串product_category: 字符串
-
数据分割:
train: 1,200,000条评论,364,405,048字节validation: 30,000条评论,9,047,533字节test: 30,000条评论,9,099,141字节
数据集大小
- 下载大小: 640,320,386字节
- 数据集大小: 382,551,722字节
许可信息
- 许可: 其他
多语言性
- 多语言性: 单语种(越南语)
任务类别
- 任务:
- 摘要生成
- 文本生成
- 填空
- 文本分类
- 任务ID:
- 文本评分
- 语言建模
- 掩码语言建模
- 情感分类
- 情感评分
- 主题分类
数据集创建
- 源数据: 原始数据
- 注释创建者: 发现
- 语言创建者: 发现



