中文多模态商品属性抽取数据集

Name: 中文多模态商品属性抽取数据集
Creator: 北京京东尚科信息技术有限公司
License: 暂无描述

国家基础学科公共科学数据中心2024-03-05 收录

下载链接：

https://www.nbsdc.cn/general/dataDetail?id=64ef84b1bb16e0591d02535f&type=1

下载链接

链接失效反馈

官方服务：

资源简介：

商品属性值在电子商务场景中至关重要，例如客服机器人、商品检索与推荐都需要使用到商品属性信息。然而在现实世界中，商品的属性值通常是不完整的，并且会随着时间的推移而变化，这极大地阻碍了实际应用效果。为研究中文多模态商品属性与属性值抽取问题，实现数据从非结构化数据到结构化数据转换，课题通过从电商平台收集关于商品描述的文本及图像数据对，并对非结构化文本信息中的所涉及的商品属性值的起始位置进行标记，并标记相应的属性类别。该数据集共包含87149个文本图像实例，包括衣服、鞋、包、行李箱、连衣裙、靴子、裤子共7种不同的商品类别，其中涉及“材质”、“颜色”、“款式”等26种不同的属性类别。相关数据被随机分成了一个包含71194个实例的训练接、一个包含8000个实例的验证集和一个包含8000个实例的测试集。

Product attribute values are critical in e-commerce scenarios. For instance, customer service robots, product retrieval and recommendation systems all rely on product attribute information. However, in real-world settings, product attribute values are often incomplete and change over time, which greatly hinders the effectiveness of practical applications. To investigate the task of Chinese multimodal product attribute and attribute value extraction and enable the conversion from unstructured data to structured data, this study collected text-image pairs of product descriptions from e-commerce platforms, annotated the starting positions of product attribute values mentioned in the unstructured text, and labeled the corresponding attribute categories. This dataset contains a total of 87,149 text-image instances, covering 7 distinct product categories: clothing, shoes, handbags, suitcases, dresses, boots, and trousers, and includes 26 different attribute categories such as "material", "color", "style", etc. The collected data was randomly split into a training set with 71,194 instances, a validation set with 8,000 instances, and a test set with 8,000 instances.

提供机构：

北京京东尚科信息技术有限公司

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成