viscon-1m
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/viscon-1m
下载链接
链接失效反馈官方服务:
资源简介:
# VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models
## Overview
**VisCon-100K** is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger **VisCon-1M** dataset
The dataset has been shown to improve performance for:
- Text-only large language models aligned with vision encoders using only image captions (e.g., ShareGPT4V-7b)
- Multimodally pretrained language models (e.g., IDEFICS2-8b) using interleaved image-text data
Furthermore, our experiments reveal that a "leaky modality mix" (where conversation samples contain questions answerable from both the image and its contextual caption) outperforms non-leaky configurations.
## Dataset Structure
The dataset contains two primary columns:
- **image:** File path to the corresponding image. Images are provided in a compressed ZIP archive stored in the repository.
- **conversation:** The conversation data (captions and Q&A pairs) associated with the image.
## How to Load the Dataset
You can load the dataset using Hugging Face's `datasets` library as shown below:
```python
from datasets import load_dataset
# Load the training split
train_ds = load_dataset("tiiuae/viscon-100k", split="train")
# Load the test split
test_ds = load_dataset("tiiuae/viscon-100k", split="test")
# Example: Viewing a sample record
print(train_ds[0]['image'])
print(train_ds[0]['conversation'])
```
The images are provided as a ZIP archive (images.zip) in this repository. To work with the images locally:
```bash
git lfs clone https://huggingface.co/datasets/tiiuae/viscon-100k
cd viscon-100k
unzip images.zip -d images
```
## Citation
If you use this dataset in your research, please cite [our paper](https://arxiv.org/abs/2502.10250):
```
"VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models", Gokul Karthik Kumar, Iheb Chaabane & Kebin Wu, PAKDD 2025.
```
# VisCon-100K:利用上下文网页数据微调视觉语言模型
## 数据集概览
**VisCon-100K** 是专为微调视觉语言模型(Vision-Language Model, VLM)设计的数据集,其依托交错式图文网页文档构建。该数据集源自OBELICS数据集的4.5万个网页文档,本次发布包含10万个图像对话样本。研究中使用GPT-4V生成图像上下文标题,再通过OpenChat 3.5将这些标题转换为多样化的自由形式与多项选择题问答对。此方法不仅聚焦细粒度视觉内容,还融入了配套的网页上下文信息,可实现更优异的模型性能。此外,我们采用相同的流水线,仅将GPT-4V替换为自研的上下文标题生成器,还发布了规模更大的**VisCon-1M**数据集。
实验表明,该数据集可有效优化两类模型的性能:
- 仅通过图像标题与视觉编码器对齐的纯文本大语言模型(Large Language Model, LLM)(如ShareGPT4V-7b)
- 依托交错式图文数据进行多模态预训练的语言模型(如IDEFICS2-8b)
进一步的实验发现,“模态泄漏混合”(即对话样本中的问题可通过图像及其上下文标题共同解答)的配置优于非泄漏配置。
## 数据集结构
数据集包含两个核心列:
- **image**:对应图像的文件路径。图像以压缩ZIP归档形式存储在本仓库中。
- **conversation**:与该图像关联的对话数据(包含标题与问答对)。
## 数据集加载方式
您可通过Hugging Face的`datasets`库加载该数据集,示例代码如下:
python
from datasets import load_dataset
# 加载训练拆分集
train_ds = load_dataset("tiiuae/viscon-100k", split="train")
# 加载测试拆分集
test_ds = load_dataset("tiiuae/viscon-100k", split="test")
# 示例:查看单条样本记录
print(train_ds[0]['image'])
print(train_ds[0]['conversation'])
图像以本仓库中的ZIP归档文件(images.zip)提供。如需本地使用图像,请执行以下命令:
bash
git lfs clone https://huggingface.co/datasets/tiiuae/viscon-100k
cd viscon-100k
unzip images.zip -d images
## 引用声明
若您在研究中使用该数据集,请引用[我们的论文](https://arxiv.org/abs/2502.10250):
"VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models", Gokul Karthik Kumar, Iheb Chaabane & Kebin Wu, PAKDD 2025.
提供机构:
maas
创建时间:
2025-10-15



