NingLab/MMECInstruct

Name: NingLab/MMECInstruct
Creator: NingLab
Published: 2024-10-28 21:07:58
License: 暂无描述

Hugging Face2024-10-28 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/NingLab/MMECInstruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en size_categories: - 10K<n<100K --- ## Introduction MMECInstruct comprises 7 tasks, including answerability prediction, category classification, product relation prediction, product substitute identification, multiclass product classification, sentiment analysis, and sequential recommendation. MMECInstruct is split into training sets, validation sets, IND test sets, and OOD test sets. ## Dataset Sources - **Repository:** [GitHub](https://github.com/ninglab/CASLIE) - **Homepage:** [CASLIE](https://ninglab.github.io/CASLIE/) ## Data Split The statistics for the MMECInstruct Dataset are shown in the table below. | Split | Size | | --- | --- | | Train | 56,000 | | Validation | 7,000 | | IND Test | 7,000 | | OOD Test | 5,000 | | Total | 75,000 | ## Quick Start Run the following command to get the data: ```python from datasets import load_dataset dataset = load_dataset("NingLab/MMECInstruct") ``` Note that the "caption_info" in the dataset contains generated captions for "images." They correspond to the products appearing in "input" one by one in order. ## License Please check the license of each subset in our curated dataset ECInstruct. | Dataset | License Type | | --- | --- | | [Amazon Review](https://amazon-reviews-2023.github.io/) | Non listed | | [AmazonQA](https://github.com/amazonqa/amazonqa) | Non listed | | [MAVE](https://github.com/google-research-datasets/MAVE) | CC-by-4.0 | | [Shopping Queries Dataset](https://github.com/amazon-science/esci-data) | Apache License 2.0 | ## Citation ```bibtex @article{ling2024captions, title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data}, author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia}, journal={arXiv preprint arXiv:2410.17337}, year={2024} } ```

许可证：CC BY 4.0 语言： - 英语规模类别： - 10000 < 样本量 < 100000 --- ## 引言 MMECInstruct 共包含7项任务，涵盖可回答性预测、类别分类、产品关联预测、产品替代物识别、多类别产品分类、情感分析以及序列推荐。该数据集被划分为训练集、验证集、分布内（IND）测试集与分布外（OOD）测试集。 ## 数据集来源 - **仓库**：[GitHub](https://github.com/ninglab/CASLIE) - **主页**：[CASLIE](https://ninglab.github.io/CASLIE/) ## 数据划分 MMECInstruct 数据集的统计信息如下表所示。 | 划分方式 | 样本量 | | --- | --- | | 训练集 | 56,000 | | 验证集 | 7,000 | | IND测试集 | 7,000 | | OOD测试集 | 5,000 | | 总计 | 75,000 | ## 快速入门运行以下命令即可获取该数据集： python from datasets import load_dataset dataset = load_dataset("NingLab/MMECInstruct") 请注意，数据集中的`caption_info`字段包含针对`images`生成的字幕，它们与`input`字段中依次出现的产品一一对应。 ## 许可证请查阅我们整理的数据集 ECInstruct 中各子集的许可证信息。 | 数据集 | 许可证类型 | | --- | --- | | [亚马逊评论数据集](https://amazon-reviews-2023.github.io/) | 未列明 | | [亚马逊QA数据集](https://github.com/amazonqa/amazonqa) | 未列明 | | [MAVE数据集](https://github.com/google-research-datasets/MAVE) | CC-by-4.0 | | [购物查询数据集](https://github.com/amazon-science/esci-data) | Apache 许可证 2.0 | ## 引用 bibtex @article{ling2024captions, title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data}, author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia}, journal={arXiv preprint arXiv:2410.17337}, year={2024} }

提供机构：

NingLab

5,000+

优质数据集

54 个

任务类型

进入经典数据集