ryan1232100/Amazon-C4
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ryan1232100/Amazon-C4
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- instruction-following
- recommendation
- product search
size_categories:
- 10K<n<100K
---
# Amazon-C4
A **complex product search** dataset built based on [Amazon Reviews 2023 dataset](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023).
C4 is short for **C**omplex **C**ontexts **C**reated by **C**hatGPT.
## Quick Start
### Loading Queries
```python
from datasets import load_dataset
dataset = load_dataset('McAuley-Lab/Amazon-C4')['test']
```
```python
>>> dataset
Dataset({
features: ['qid', 'query', 'item_id', 'user_id', 'ori_rating', 'ori_review'],
num_rows: 21223
})
```
```python
>>> dataset[288]
{'qid': 288, 'query': 'I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.', 'item_id': 'B07DKNN87F', 'user_id': 'AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ', 'ori_rating': 5, 'ori_review': 'Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.'}
```
### Loading Item Pool
If you would like to use the same item pool used for our [BLaIR](https://arxiv.org/abs/2403.03952) paper, you can follow these steps:
```python
import json
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(
repo_id='McAuley-Lab/Amazon-C4',
filename='sampled_item_metadata_1M.jsonl',
repo_type='dataset'
)
item_pool = []
with open(filepath, 'r') as file:
for line in file:
item_pool.append(json.loads(line.strip()))
```
```python
>>> len(item_pool)
1058417
```
```python
>>> item_pool[0]
{'item_id': 'B0778XR2QM', 'category': 'Care', 'metadata': 'Supergoop! Super Power Sunscreen Mousse SPF 50, 7.1 Fl Oz. Product Description Kids, moms, and savvy sun-seekers will flip for this whip! Formulated with nourishing Shea butter and antioxidant packed Blue Sea Kale, this one-of-a kind mousse formula is making sunscreen super FUN! The refreshing light essence of cucumber and citrus has become an instant hit at Super goop! HQ where we’ve been known to apply gobs of it just for the uplifting scent. Water resistant for up to 80 minutes too! Brand Story Supergoop! is the first and only prestige skincare brand completely dedicated to sun protection. Supergoop! has Super Broad Spectrum protection, which means it protects skin from UVA rays, UVB rays and IRA rays.'}
```
## Dataset Description
- **Repository:** https://github.com/hyp1231/AmazonReviews2023
- **Paper:** https://arxiv.org/abs/2403.03952
- **Point of Contact:** Yupeng Hou @ [yphou@ucsd.edu](mailto:yphou@ucsd.edu)
### Dataset Summary
Amazon-C4 is designed to assess a model's ability to comprehend complex language contexts and retrieve relevant items.
In conventional product search, users may input short, straightforward keywords to retrieve desired items. In the new product search task with complex contexts, the input is longer and more detailed, but not always directly relevant to the item metadata. Examples of such input include multiround dialogues and complex user instructions.
### Dataset Processing
Amazon-C4 is created by prompting ChatGPT to generate complex contexts as queries.
During data construction:
* 5-star-rated user reviews on items are treated as satisfactory interactions.
* reviews with at least 100 characters are considered valid for conveying sufficient information to be rewritten as complex contextual queries.
We uniformly sample around
22,000 of user reviews from the test set of [Amazon Reviews 2023 dataset](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) that meet the rating and review length requirements. ChatGPT rephrases user reviews as complex contexts with a first-person tone, serving as queries in the constructed Amazon-C4 dataset.
## Dataset Structure
### Data Fields
- `test.csv` are query-item pairs that can be used for evaluating the complex product search task. There are 6 columns in this file:
- `qid (int64)`: Query ID. Unique ID for each query, ranging from 0 to 21222. An example of `conv_id` is:
```
288
```
- `query (string)`: Complex query. For example:
```
I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.
```
- `item_id (string)`: Unique ID for the ground truth item. This ID corresponds to `parent_asin` in the original Amazon Reviews 2023 dataset. For example:
```
B07DKNN87F
```
- `user_id (string)`: The unique user ID. For example:
```
AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ
```
- `ori_rating (float)`: Rating score of the original user review before rewritten by ChatGPT. Note that this field should not be used for solving this task, but just remained for reference. For example:
```
5
```
- `ori_review (string)`: Original review text before rewritten by ChatGPT. Note that this field should not be used for solving this task, but just remained for reference. For example:
```
Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.
```
- `sampled_item_metadata_1M.jsonl` contains ~1M items sampled from the Amazon Reviews 2023 dataset. For each <query, item> pairs, we randomly sample 50 items from the domain of the ground-truth item. This sampled item pool is used for evaluation of the [BLaIR paper](https://arxiv.org/abs/2403.03952). Each line is a json:
- `item_id (string)`: Unique ID for the ground truth item. This ID corresponds to `parent_asin` in the original Amazon Reviews 2023 dataset. For example:
```
B07DKNN87F
```
- `category (string)`: Category of this item. This attribute can be used to evaluate the model performance under certain category. For example:
```
Pet
```
- `metadata (string)`: We concatenate `title` and `description` from the original item metadata of the Amazon Reviews 2023 dataset together into this attribute.
### Data Statistic
|#Queries|#Items|Avg.Len.q|Avg.Len.t|
|-|-|-|-|
|21,223|1,058,417|229.89|538.97|
Where `Avg.Len.q` denotes the average
number of characters in the queries, `Avg.Len.t` denotes the average number of characters in the item metadata.
### Citation
Please cite the following paper if you use this dataset, thanks!
```bibtex
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}
```
Please [raise a issue](https://github.com/hyp1231/AmazonReviews2023/issues/new) at our GitHub repo, or [start a discussion here](https://huggingface.co/datasets/McAuley-Lab/Amazon-C4/discussions/new), or directly contact Yupeng Hou @ [yphou@ucsd.edu](mailto:yphou@ucsd.edu) if you have any questions or suggestions.
---
语言:
- 英语(en)
标签:
- 指令遵循(instruction-following)
- 推荐(recommendation)
- 商品搜索(product search)
数据规模类别:
- 10K<n<100K
---
# Amazon-C4
**复杂商品搜索(complex product search)** 数据集,基于 [Amazon Reviews 2023 数据集(Amazon Reviews 2023 dataset)](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) 构建。
C4 是 **ChatGPT 生成的复杂上下文(Complex Contexts Created by ChatGPT)** 的缩写。
## 快速入门
### 加载查询
python
from datasets import load_dataset
dataset = load_dataset('McAuley-Lab/Amazon-C4')['test']
python
>>> dataset
Dataset({
features: ['qid', 'query', 'item_id', 'user_id', 'ori_rating', 'ori_review'],
num_rows: 21223
})
python
>>> dataset[288]
{'qid': 288, 'query': 'I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.', 'item_id': 'B07DKNN87F', 'user_id': 'AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ', 'ori_rating': 5, 'ori_review': 'Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.'}
### 加载商品池
若需使用与我们的 [BLaIR](https://arxiv.org/abs/2403.03952) 论文中相同的商品池,可按照以下步骤操作:
python
import json
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(
repo_id='McAuley-Lab/Amazon-C4',
filename='sampled_item_metadata_1M.jsonl',
repo_type='dataset'
)
item_pool = []
with open(filepath, 'r') as file:
for line in file:
item_pool.append(json.loads(line.strip()))
python
>>> len(item_pool)
1058417
python
>>> item_pool[0]
{'item_id': 'B0778XR2QM', 'category': 'Care', 'metadata': 'Supergoop! Super Power Sunscreen Mousse SPF 50, 7.1 Fl Oz. Product Description Kids, moms, and savvy sun-seekers will flip for this whip! Formulated with nourishing Shea butter and antioxidant packed Blue Sea Kale, this one-of-a kind mousse formula is making sunscreen super FUN! The refreshing light essence of cucumber and citrus has become an instant hit at Super goop! HQ where we’ve been known to apply gobs of it just for the uplifting scent. Water resistant for up to 80 minutes too! Brand Story Supergoop! is the first and only prestige skincare brand completely dedicated to sun protection. Supergoop! has Super Broad Spectrum protection, which means it protects skin from UVA rays, UVB rays and IRA rays.'}
## 数据集说明
- **仓库地址:** https://github.com/hyp1231/AmazonReviews2023
- **论文地址:** https://arxiv.org/abs/2403.03952
- **联系方式:** 侯宇鹏(Yupeng Hou)[yphou@ucsd.edu](mailto:yphou@ucsd.edu)
### 数据集概述
Amazon-C4 旨在评估模型理解复杂语言上下文并检索相关商品的能力。在传统商品搜索场景中,用户通常输入简短直白的关键词以获取目标商品;而在基于复杂上下文的商品搜索任务中,输入内容更长、更详尽,但未必与商品元数据直接相关,此类输入的示例包括多轮对话与复杂用户指令。
### 数据集构建流程
Amazon-C4 通过提示 ChatGPT 生成复杂上下文作为查询来构建。在数据构建过程中:
* 商品的5星用户评价被视为满意的用户交互。
* 长度不少于100字符的评价被视为可传递足够信息、可改写为复杂上下文查询的有效数据。
我们从 [Amazon Reviews 2023 数据集(Amazon Reviews 2023 dataset)](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) 的测试集中,均匀采样约22000条符合评分与评价长度要求的用户评价,由 ChatGPT 以第一人称语气将这些评价改写为复杂上下文,作为构建的 Amazon-C4 数据集中的查询。
## 数据集结构
### 数据字段
- `test.csv` 为可用于复杂商品搜索任务评估的查询-商品对文件,该文件包含6列:
- `qid (int64)`:查询ID,每个查询的唯一标识,取值范围为0至21222。查询ID的示例为:
288
- `query (string)`:复杂查询,示例:
I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.
- `item_id (string)`:真实商品的唯一ID,该ID对应原始 Amazon Reviews 2023 数据集中的 `parent_asin` 字段。示例:
B07DKNN87F
- `user_id (string)`:用户唯一ID。示例:
AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ
- `ori_rating (float)`:ChatGPT 改写前的原始用户评价评分。注意:该字段不应用于本任务的求解,仅作参考。示例:
5
- `ori_review (string)`:ChatGPT 改写前的原始评价文本。注意:该字段不应用于本任务的求解,仅作参考。示例:
Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.
- `sampled_item_metadata_1M.jsonl` 包含约100万条从 Amazon Reviews 2023 数据集采样的商品。对于每个<查询,商品>对,我们从真实商品所属领域随机采样50个商品,该采样商品池用于 [BLaIR](https://arxiv.org/abs/2403.03952) 论文中的评估。每行均为一个JSON对象:
- `item_id (string)`:商品唯一ID,对应原始数据集中的 `parent_asin` 字段。示例:
B07DKNN87F
- `category (string)`:商品类别,可用于评估模型在特定类别下的性能。示例:
Pet
- `metadata (string)`:我们将原始 Amazon Reviews 2023 数据集中的商品标题与描述拼接后,存入该字段。
### 数据统计
|#查询数|#商品数|查询平均字符数|商品元数据平均字符数|
|-|-|-|-|
|21,223|1,058,417|229.89|538.97|
其中 `Avg.Len.q` 表示查询的平均字符数,`Avg.Len.t` 表示商品元数据的平均字符数。
### 引用
若使用本数据集,请引用以下论文,感谢!
bibtex
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}
若有任何问题或建议,请在我们的GitHub仓库 [提交Issue](https://github.com/hyp1231/AmazonReviews2023/issues/new)、[在Hugging Face数据集页面发起讨论](https://huggingface.co/datasets/McAuley-Lab/Amazon-C4/discussions/new),或直接联系侯宇鹏(Yupeng Hou)[yphou@ucsd.edu](mailto:yphou@ucsd.edu)。
提供机构:
ryan1232100



