Amazon-C4
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/McAuley-Lab/Amazon-C4
下载链接
链接失效反馈官方服务:
资源简介:
# Amazon-C4
A **complex product search** dataset built based on [Amazon Reviews 2023 dataset](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023).
C4 is short for **C**omplex **C**ontexts **C**reated by **C**hatGPT.
## Quick Start
### Loading Queries
```python
from datasets import load_dataset
dataset = load_dataset('McAuley-Lab/Amazon-C4')['test']
```
```python
>>> dataset
Dataset({
features: ['qid', 'query', 'item_id', 'user_id', 'ori_rating', 'ori_review'],
num_rows: 21223
})
```
```python
>>> dataset[288]
{'qid': 288, 'query': 'I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.', 'item_id': 'B07DKNN87F', 'user_id': 'AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ', 'ori_rating': 5, 'ori_review': 'Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.'}
```
### Loading Item Pool
If you would like to use the same item pool used for our [BLaIR](https://arxiv.org/abs/2403.03952) paper, you can follow these steps:
```python
import json
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(
repo_id='McAuley-Lab/Amazon-C4',
filename='sampled_item_metadata_1M.jsonl',
repo_type='dataset'
)
item_pool = []
with open(filepath, 'r') as file:
for line in file:
item_pool.append(json.loads(line.strip()))
```
```python
>>> len(item_pool)
1058417
```
```python
>>> item_pool[0]
{'item_id': 'B0778XR2QM', 'category': 'Care', 'metadata': 'Supergoop! Super Power Sunscreen Mousse SPF 50, 7.1 Fl Oz. Product Description Kids, moms, and savvy sun-seekers will flip for this whip! Formulated with nourishing Shea butter and antioxidant packed Blue Sea Kale, this one-of-a kind mousse formula is making sunscreen super FUN! The refreshing light essence of cucumber and citrus has become an instant hit at Super goop! HQ where we’ve been known to apply gobs of it just for the uplifting scent. Water resistant for up to 80 minutes too! Brand Story Supergoop! is the first and only prestige skincare brand completely dedicated to sun protection. Supergoop! has Super Broad Spectrum protection, which means it protects skin from UVA rays, UVB rays and IRA rays.'}
```
## Dataset Description
- **Repository:** https://github.com/hyp1231/AmazonReviews2023
- **Paper:** https://arxiv.org/abs/2403.03952
- **Point of Contact:** Yupeng Hou @ [yphou@ucsd.edu](mailto:yphou@ucsd.edu)
### Dataset Summary
Amazon-C4 is designed to assess a model's ability to comprehend complex language contexts and retrieve relevant items.
In conventional product search, users may input short, straightforward keywords to retrieve desired items. In the new product search task with complex contexts, the input is longer and more detailed, but not always directly relevant to the item metadata. Examples of such input include multiround dialogues and complex user instructions.
### Dataset Processing
Amazon-C4 is created by prompting ChatGPT to generate complex contexts as queries.
During data construction:
* 5-star-rated user reviews on items are treated as satisfactory interactions.
* reviews with at least 100 characters are considered valid for conveying sufficient information to be rewritten as complex contextual queries.
We uniformly sample around
22,000 of user reviews from the test set of [Amazon Reviews 2023 dataset](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) that meet the rating and review length requirements. ChatGPT rephrases user reviews as complex contexts with a first-person tone, serving as queries in the constructed Amazon-C4 dataset.
## Dataset Structure
### Data Fields
- `test.csv` are query-item pairs that can be used for evaluating the complex product search task. There are 6 columns in this file:
- `qid (int64)`: Query ID. Unique ID for each query, ranging from 0 to 21222. An example of `conv_id` is:
```
288
```
- `query (string)`: Complex query. For example:
```
I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.
```
- `item_id (string)`: Unique ID for the ground truth item. This ID corresponds to `parent_asin` in the original Amazon Reviews 2023 dataset. For example:
```
B07DKNN87F
```
- `user_id (string)`: The unique user ID. For example:
```
AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ
```
- `ori_rating (float)`: Rating score of the original user review before rewritten by ChatGPT. Note that this field should not be used for solving this task, but just remained for reference. For example:
```
5
```
- `ori_review (string)`: Original review text before rewritten by ChatGPT. Note that this field should not be used for solving this task, but just remained for reference. For example:
```
Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.
```
- `sampled_item_metadata_1M.jsonl` contains ~1M items sampled from the Amazon Reviews 2023 dataset. For each <query, item> pairs, we randomly sample 50 items from the domain of the ground-truth item. This sampled item pool is used for evaluation of the [BLaIR paper](https://arxiv.org/abs/2403.03952). Each line is a json:
- `item_id (string)`: Unique ID for the ground truth item. This ID corresponds to `parent_asin` in the original Amazon Reviews 2023 dataset. For example:
```
B07DKNN87F
```
- `category (string)`: Category of this item. This attribute can be used to evaluate the model performance under certain category. For example:
```
Pet
```
- `metadata (string)`: We concatenate `title` and `description` from the original item metadata of the Amazon Reviews 2023 dataset together into this attribute.
### Data Statistic
|#Queries|#Items|Avg.Len.q|Avg.Len.t|
|-|-|-|-|
|21,223|1,058,417|229.89|538.97|
Where `Avg.Len.q` denotes the average
number of characters in the queries, `Avg.Len.t` denotes the average number of characters in the item metadata.
### Citation
Please cite the following paper if you use this dataset, thanks!
```bibtex
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}
```
Please [raise a issue](https://github.com/hyp1231/AmazonReviews2023/issues/new) at our GitHub repo, or [start a discussion here](https://huggingface.co/datasets/McAuley-Lab/Amazon-C4/discussions/new), or directly contact Yupeng Hou @ [yphou@ucsd.edu](mailto:yphou@ucsd.edu) if you have any questions or suggestions.
# Amazon-C4
**复杂商品搜索数据集(complex product search dataset)**,基于[Amazon Reviews 2023数据集(Amazon Reviews 2023 dataset)](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023)构建。
C4全称为**由ChatGPT生成的复杂上下文(Complex Contexts Created by ChatGPT)**。
## 快速开始
### 加载查询集
python
from datasets import load_dataset
dataset = load_dataset('McAuley-Lab/Amazon-C4')['test']
python
>>> dataset
Dataset({
features: ['qid', 'query', 'item_id', 'user_id', 'ori_rating', 'ori_review'],
num_rows: 21223
})
python
>>> dataset[288]
{'qid': 288, 'query': 'I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.', 'item_id': 'B07DKNN87F', 'user_id': 'AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ', 'ori_rating': 5, 'ori_review': 'Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.'}
### 加载候选商品池
若您希望使用与我们的[BLaIR论文(BLaIR)](https://arxiv.org/abs/2403.03952)一致的候选商品池,可按照以下步骤操作:
python
import json
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(
repo_id='McAuley-Lab/Amazon-C4',
filename='sampled_item_metadata_1M.jsonl',
repo_type='dataset'
)
item_pool = []
with open(filepath, 'r') as file:
for line in file:
item_pool.append(json.loads(line.strip()))
python
>>> len(item_pool)
1058417
python
>>> item_pool[0]
{'item_id': 'B0778XR2QM', 'category': 'Care', 'metadata': 'Supergoop! Super Power Sunscreen Mousse SPF 50, 7.1 Fl Oz. Product Description Kids, moms, and savvy sun-seekers will flip for this whip! Formulated with nourishing Shea butter and antioxidant packed Blue Sea Kale, this one-of-a kind mousse formula is making sunscreen super FUN! The refreshing light essence of cucumber and citrus has become an instant hit at Super goop! HQ where we’ve been known to apply gobs of it just for the uplifting scent. Water resistant for up to 80 minutes too! Brand Story Supergoop! is the first and only prestige skincare brand completely dedicated to sun protection. Supergoop! has Super Broad Spectrum protection, which means it protects skin from UVA rays, UVB rays and IRA rays.'}
## 数据集说明
- **代码仓库(Repository)**: https://github.com/hyp1231/AmazonReviews2023
- **关联论文(Paper)**: https://arxiv.org/abs/2403.03952
- **联系方式(Point of Contact)**: Yupeng Hou @ [yphou@ucsd.edu](mailto:yphou@ucsd.edu)
### 数据集概述
Amazon-C4旨在评估模型理解复杂语言上下文并检索相关商品的能力。
在传统商品搜索场景中,用户通常输入简短直白的关键词以检索目标商品;而在该复杂上下文商品搜索任务中,输入内容更长、更详尽,但未必与商品元数据直接相关,此类输入的示例包括多轮对话与复杂用户指令。
### 数据集构建流程
Amazon-C4通过提示ChatGPT生成复杂上下文作为查询语句构建而成。
数据集构建过程如下:
* 将商品的5星用户评论视为满意的交互记录。
* 字符数不少于100的评论被视为具备足够信息,可被改写为复杂上下文查询。
我们从[Amazon Reviews 2023数据集(Amazon Reviews 2023 dataset)](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023)的测试集中,均匀采样约22000条符合评分与评论长度要求的用户评论,再由ChatGPT将这些评论改写为第一人称视角的复杂上下文,作为Amazon-C4数据集中的查询语句。
## 数据集结构
### 数据字段
- `test.csv`包含可用于复杂商品搜索任务评估的查询-商品对,该文件共包含6个字段:
- `qid (int64)`:查询ID(Query ID)。每个查询的唯一标识符,取值范围为0至21222。`conv_id`示例如下:
288
- `query (string)`:复杂查询语句。示例如下:
I need something that can entertain my kids during bath time. It should be able to get messy, like smearing peanut butter on it.
- `item_id (string)`:真实匹配商品的唯一标识符,该ID与原始Amazon Reviews 2023数据集中的`parent_asin`字段相对应。示例如下:
B07DKNN87F
- `user_id (string)`:用户唯一标识符。示例如下:
AEIDF5SU5ZJIQYDAYKYKNJBBOOFQ
- `ori_rating (float)`:ChatGPT改写前的原始用户评论评分。请注意,该字段不应被用于该任务的求解,仅作为参考信息保留。示例如下:
5
- `ori_review (string)`:ChatGPT改写前的原始用户评论文本。请注意,该字段不应被用于该任务的求解,仅作为参考信息保留。示例如下:
Really helps in the bathtub. Smear some pb on there and let them go to town. A great distraction during bath time.
- `sampled_item_metadata_1M.jsonl`包含从Amazon Reviews 2023数据集中采样的约100万条商品数据。针对每个<查询,商品>对,我们从真实匹配商品所属的领域中随机采样50个商品,该候选商品池用于[BLaIR论文(BLaIR)](https://arxiv.org/abs/2403.03952)的评估工作。该文件的每一行均为一条JSON数据:
- `item_id (string)`:真实匹配商品的唯一标识符,该ID与原始Amazon Reviews 2023数据集中的`parent_asin`字段相对应。示例如下:
B07DKNN87F
- `category (string)`:商品所属类别,该属性可用于评估模型在特定类别下的检索性能。示例如下:
Pet
- `metadata (string)`:将Amazon Reviews 2023数据集中原始商品元数据的`title`与`description`字段拼接后得到的内容。
### 数据统计
|查询数量|商品数量|查询平均字符数|商品元数据平均字符数|
| ---- | ---- | ---- | ---- |
|21223|1058417|229.89|538.97|
其中`Avg.Len.q`代表查询语句的平均字符数,`Avg.Len.t`代表商品元数据的平均字符数。
### 引用声明
若使用本数据集,请引用以下论文,感谢您的支持!
bibtex
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}
若您有任何问题或建议,可在我们的GitHub代码仓库提交[issue](https://github.com/hyp1231/AmazonReviews2023/issues/new)、在[Hugging Face数据集页面发起讨论](https://huggingface.co/datasets/McAuley-Lab/Amazon-C4/discussions/new),或直接联系Yupeng Hou,邮箱为[yphou@ucsd.edu](mailto:yphou@ucsd.edu).
提供机构:
maas
创建时间:
2025-09-22



