淘宝直播多模态视频商品检索数据集
收藏OpenXLab2026-04-18 收录
下载链接:
https://openxlab.org.cn/datasets/OpenDataLab/淘宝直播多模态视频商品检索数据集
下载链接
链接失效反馈官方服务:
资源简介:
Live streaming is an important way for Taobao to connect products and consumers. Through real-time identification and recommendation of products in live videos, more efficient conversion of product purchases can be achieved. Usually, hundreds of products corresponding to live broadcasts are highly similar to each other, and there is a lot of background information and lighting changes in the live broadcast screen, which brings great challenges to the matching and identification of products in the live broadcast screen. In order to improve the effect of product matching and recognition in live broadcasts, we relied on Taobao Live's massive data to build the industry's largest multi-modal video product retrieval data set: Watch and Buy (WAB). The data set is composed of matching pairs of live video clips and corresponding explained products. The video side includes video clips with a fixed frame rate and fixed duration, key frame image frame-level annotations spaced two seconds apart, and the voice transcripts corresponding to the clips; the product side includes multiple product images of the product, frame-level annotations of all images, and Product title description text.
Compared with industry open source data sets, this data set has the following characteristics:
Large scale: The data set includes 70,000 video product matching pairs, 1,042,178 annotated images, 1,654,780 annotated detection frame instances, and 70,000 transcribed and annotated video text segments.
Multi-modal: The data set is oriented to actual live video scenarios, including both the video screen and the corresponding anchor explanation text. The product side includes data in two modalities: product image and product title text.
Diversity: Frame-level annotation information is rich and diverse, including product detection frames, categories, perspectives, display methods, instance numbers, etc. The instance number plays the role of identifying the same style between the image annotation boxes of a video product matching pair.
Multi-function: The data is marked with 23 types of clothing detection categories and detection frame positions, which can be used for object detection algorithm research. The data is labeled with frame-level instance numbers, and approximately 80,000 sets of product sequences of the same style are constructed, which can be used for research on object retrieval and recognition algorithms. In addition, the data set provides fragment corresponding text and product title description text, which can be used for research on visual text multi-modal retrieval algorithms.
提供机构:
OpenDataLab
创建时间:
2024-05-14



