five

viscon-1m

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/viscon-1m
下载链接
链接失效反馈
官方服务:
资源简介:
# VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models ## Overview **VisCon-100K** is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger **VisCon-1M** dataset The dataset has been shown to improve performance for: - Text-only large language models aligned with vision encoders using only image captions (e.g., ShareGPT4V-7b) - Multimodally pretrained language models (e.g., IDEFICS2-8b) using interleaved image-text data Furthermore, our experiments reveal that a "leaky modality mix" (where conversation samples contain questions answerable from both the image and its contextual caption) outperforms non-leaky configurations. ## Dataset Structure The dataset contains two primary columns: - **image:** File path to the corresponding image. Images are provided in a compressed ZIP archive stored in the repository. - **conversation:** The conversation data (captions and Q&A pairs) associated with the image. ## How to Load the Dataset You can load the dataset using Hugging Face's `datasets` library as shown below: ```python from datasets import load_dataset # Load the training split train_ds = load_dataset("tiiuae/viscon-100k", split="train") # Load the test split test_ds = load_dataset("tiiuae/viscon-100k", split="test") # Example: Viewing a sample record print(train_ds[0]['image']) print(train_ds[0]['conversation']) ``` The images are provided as a ZIP archive (images.zip) in this repository. To work with the images locally: ```bash git lfs clone https://huggingface.co/datasets/tiiuae/viscon-100k cd viscon-100k unzip images.zip -d images ``` ## Citation If you use this dataset in your research, please cite [our paper](https://arxiv.org/abs/2502.10250): ``` "VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models", Gokul Karthik Kumar, Iheb Chaabane & Kebin Wu, PAKDD 2025. ```

# VisCon-100K:利用上下文网页数据微调视觉语言模型 ## 数据集概览 **VisCon-100K** 是专为微调视觉语言模型(Vision-Language Model, VLM)设计的数据集,其依托交错式图文网页文档构建。该数据集源自OBELICS数据集的4.5万个网页文档,本次发布包含10万个图像对话样本。研究中使用GPT-4V生成图像上下文标题,再通过OpenChat 3.5将这些标题转换为多样化的自由形式与多项选择题问答对。此方法不仅聚焦细粒度视觉内容,还融入了配套的网页上下文信息,可实现更优异的模型性能。此外,我们采用相同的流水线,仅将GPT-4V替换为自研的上下文标题生成器,还发布了规模更大的**VisCon-1M**数据集。 实验表明,该数据集可有效优化两类模型的性能: - 仅通过图像标题与视觉编码器对齐的纯文本大语言模型(Large Language Model, LLM)(如ShareGPT4V-7b) - 依托交错式图文数据进行多模态预训练的语言模型(如IDEFICS2-8b) 进一步的实验发现,“模态泄漏混合”(即对话样本中的问题可通过图像及其上下文标题共同解答)的配置优于非泄漏配置。 ## 数据集结构 数据集包含两个核心列: - **image**:对应图像的文件路径。图像以压缩ZIP归档形式存储在本仓库中。 - **conversation**:与该图像关联的对话数据(包含标题与问答对)。 ## 数据集加载方式 您可通过Hugging Face的`datasets`库加载该数据集,示例代码如下: python from datasets import load_dataset # 加载训练拆分集 train_ds = load_dataset("tiiuae/viscon-100k", split="train") # 加载测试拆分集 test_ds = load_dataset("tiiuae/viscon-100k", split="test") # 示例:查看单条样本记录 print(train_ds[0]['image']) print(train_ds[0]['conversation']) 图像以本仓库中的ZIP归档文件(images.zip)提供。如需本地使用图像,请执行以下命令: bash git lfs clone https://huggingface.co/datasets/tiiuae/viscon-100k cd viscon-100k unzip images.zip -d images ## 引用声明 若您在研究中使用该数据集,请引用[我们的论文](https://arxiv.org/abs/2502.10250): "VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models", Gokul Karthik Kumar, Iheb Chaabane & Kebin Wu, PAKDD 2025.
提供机构:
maas
创建时间:
2025-10-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作