smartcat/Amazon_Luxury_Beauty_2018
收藏Hugging Face2024-10-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/smartcat/Amazon_Luxury_Beauty_2018
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: metadata
features:
- name: asin
dtype: string
- name: title
dtype: string
- name: description
dtype: string
- name: brand
dtype: string
- name: main_cat
dtype: string
- name: category
sequence: 'null'
- name: also_buy
sequence: string
- name: also_view
sequence: string
- name: imageURL
sequence: string
- name: imageURLHighRes
sequence: string
splits:
- name: train
num_bytes: 24898313
num_examples: 12299
download_size: 9490355
dataset_size: 24898313
- config_name: reviews
features:
- name: reviewerID
dtype: string
- name: reviewerName
dtype: string
- name: overall
sequence: int64
- name: reviewTime
sequence: timestamp[us]
- name: asin
sequence: string
- name: reviewText
sequence: string
- name: summary
sequence: string
splits:
- name: train
num_bytes: 24663121
num_examples: 6107
download_size: 11955594
dataset_size: 24663121
configs:
- config_name: metadata
data_files:
- split: train
path: metadata/train-*
- config_name: reviews
data_files:
- split: train
path: reviews/train-*
---
# Amazon Luxury Beauty Dataset
## Directory Structure
- **metadata**: Contains product information.
- **reviews**: Contains user reviews about the products.
- **filtered**:
- **e5-base-v2_embeddings.jsonl**: Contains "asin" and "embeddings" created with [e5-base-v2](https://huggingface.co/intfloat/e5-base-v2).
- **metadata.jsonl**: Contains "asin" and "text", where text is created from the title, description, brand, main category, and category.
- **reviews.jsonl**: Contains "reviewerID", "reviewTime", and "asin". Reviews are filtered to include only perfect 5-star ratings with a minimum of 5 ratings.
## Usage
### Download metadata
```python
metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="metadata", split="train")
```
### Download reviews
```python
metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="reviews", split="train")
```
### Download filtered files
```
filtered_reviews = load_dataset(
path="smartcat/Amazon_Luxury_Beauty_2018",
data_files="filtered/reviews.jsonl",
split="train",
)
```
**📎 Note:** You can set any file or list of files from the "filtered" directory as the "data_files" argument.
数据集信息:
- 配置名称:元数据(metadata)
特征字段:
- 字段名:asin(亚马逊标准识别码,Amazon Standard Identification Number),数据类型:字符串(string)
- 字段名:title,数据类型:字符串(string)
- 字段名:description,数据类型:字符串(string)
- 字段名:brand,数据类型:字符串(string)
- 字段名:main_cat,数据类型:字符串(string)
- 字段名:category,序列类型,元素为null
- 字段名:also_buy,字符串序列
- 字段名:also_view,字符串序列
- 字段名:imageURL,图片链接序列
- 字段名:imageURLHighRes,高清图片链接序列
拆分集:
- 名称:训练集(train),字节数:24898313,样本数:12299
下载大小:9490355,数据集总大小:24898313
- 配置名称:评论数据(reviews)
特征字段:
- 字段名:reviewerID,数据类型:字符串(string)
- 字段名:reviewerName,数据类型:字符串(string)
- 字段名:overall,64位整型(int64)序列
- 字段名:reviewTime,微秒级时间戳(timestamp[us])序列
- 字段名:asin,字符串序列
- 字段名:reviewText,评论正文序列
- 字段名:summary,评论摘要序列
拆分集:
- 名称:训练集(train),字节数:24663121,样本数:6107
下载大小:11955594,数据集总大小:24663121
配置项:
- 配置名称:元数据(metadata),数据文件:
- 拆分集:训练集(train),路径:metadata/train-*
- 配置名称:评论数据(reviews),数据文件:
- 拆分集:训练集(train),路径:reviews/train-*
---
# 亚马逊奢侈品美妆数据集(Amazon Luxury Beauty Dataset)
## 目录结构
- **元数据(metadata)**:存储商品相关信息。
- **评论数据(reviews)**:存储用户对商品的评论内容。
- **filtered(过滤后数据)**:
- **e5-base-v2_embeddings.jsonl**:存储asin与使用[e5-base-v2](https://huggingface.co/intfloat/e5-base-v2)生成的嵌入向量(embeddings)。
- **metadata.jsonl**:存储asin与拼接后的文本,该文本由商品标题、描述、品牌、主分类及分类信息拼接而成。
- **reviews.jsonl**:存储reviewerID、reviewTime与asin。评论已经过过滤,仅保留5星满分评分且至少有5条评分的评论。
## 使用方法
### 下载元数据
python
metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="metadata", split="train")
### 下载评论数据
python
metadata = load_dataset(path="smartcat/Amazon_Luxury_Beauty_2018", name="reviews", split="train")
### 下载过滤后数据文件
filtered_reviews = load_dataset(
path="smartcat/Amazon_Luxury_Beauty_2018",
data_files="filtered/reviews.jsonl",
split="train",
)
**📎 注意:** 您可将"filtered"目录下的任意文件或文件列表作为`data_files`参数传入。
提供机构:
smartcat



