NingLab/MMECInstruct
收藏Hugging Face2024-10-28 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/NingLab/MMECInstruct
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
size_categories:
- 10K<n<100K
---
## Introduction
MMECInstruct comprises 7 tasks, including
answerability prediction, category classification, product relation prediction,
product substitute identification, multiclass product classification,
sentiment analysis, and sequential recommendation.
MMECInstruct is split into training sets, validation sets, IND
test sets, and OOD test sets.
## Dataset Sources
- **Repository:** [GitHub](https://github.com/ninglab/CASLIE)
- **Homepage:** [CASLIE](https://ninglab.github.io/CASLIE/)
## Data Split
The statistics for the MMECInstruct Dataset are shown in the table below.
| Split | Size |
| --- | --- |
| Train | 56,000 |
| Validation | 7,000 |
| IND Test | 7,000 |
| OOD Test | 5,000 |
| Total | 75,000 |
## Quick Start
Run the following command to get the data:
```python
from datasets import load_dataset
dataset = load_dataset("NingLab/MMECInstruct")
```
Note that the "caption_info" in the dataset contains generated captions for "images." They correspond to the products appearing in "input" one by one in order.
## License
Please check the license of each subset in our curated dataset ECInstruct.
| Dataset | License Type |
| --- | --- |
| [Amazon Review](https://amazon-reviews-2023.github.io/) | Non listed |
| [AmazonQA](https://github.com/amazonqa/amazonqa) | Non listed |
| [MAVE](https://github.com/google-research-datasets/MAVE) | CC-by-4.0 |
| [Shopping Queries Dataset](https://github.com/amazon-science/esci-data) | Apache License 2.0 |
## Citation
```bibtex
@article{ling2024captions,
title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data},
author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia},
journal={arXiv preprint arXiv:2410.17337},
year={2024}
}
```
许可证:CC BY 4.0
语言:
- 英语
规模类别:
- 10000 < 样本量 < 100000
---
## 引言
MMECInstruct 共包含7项任务,涵盖可回答性预测、类别分类、产品关联预测、产品替代物识别、多类别产品分类、情感分析以及序列推荐。该数据集被划分为训练集、验证集、分布内(IND)测试集与分布外(OOD)测试集。
## 数据集来源
- **仓库**:[GitHub](https://github.com/ninglab/CASLIE)
- **主页**:[CASLIE](https://ninglab.github.io/CASLIE/)
## 数据划分
MMECInstruct 数据集的统计信息如下表所示。
| 划分方式 | 样本量 |
| --- | --- |
| 训练集 | 56,000 |
| 验证集 | 7,000 |
| IND测试集 | 7,000 |
| OOD测试集 | 5,000 |
| 总计 | 75,000 |
## 快速入门
运行以下命令即可获取该数据集:
python
from datasets import load_dataset
dataset = load_dataset("NingLab/MMECInstruct")
请注意,数据集中的`caption_info`字段包含针对`images`生成的字幕,它们与`input`字段中依次出现的产品一一对应。
## 许可证
请查阅我们整理的数据集 ECInstruct 中各子集的许可证信息。
| 数据集 | 许可证类型 |
| --- | --- |
| [亚马逊评论数据集](https://amazon-reviews-2023.github.io/) | 未列明 |
| [亚马逊QA数据集](https://github.com/amazonqa/amazonqa) | 未列明 |
| [MAVE数据集](https://github.com/google-research-datasets/MAVE) | CC-by-4.0 |
| [购物查询数据集](https://github.com/amazon-science/esci-data) | Apache 许可证 2.0 |
## 引用
bibtex
@article{ling2024captions,
title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data},
author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia},
journal={arXiv preprint arXiv:2410.17337},
year={2024}
}
提供机构:
NingLab



