Artanic30/Wiki_R1_Train
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Artanic30/Wiki_R1_Train
下载链接
链接失效反馈官方服务:
资源简介:
# Wiki-R1 Training Dataset
**Dataset Link**: [https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train)
## Overview
This dataset contains training annotations and auxiliary files for the Wiki-R1 project, a reinforcement learning framework for Knowledge-Intensive Visual Question Answering (KIVQA). The dataset combines Infoseek and EVQA data with knowledge base retrieval and label propagation support.
## Dataset Structure
The dataset consists of the following JSON annotation files:
### 📁 Main Training Data
| File | Size | Description |
|------|------|-------------|
| `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **Primary training set** combining 20k Infoseek samples + 20k EVQA samples. Each entry contains: <br>- Question and answer pairs<br>- Top-5 retrieved KB candidates<br>- Image metadata (OVEN IDs)<br>- Entity information for curriculum learning |
### 📁 Knowledge Base & Entity Mapping
| File | Size | Description |
|------|------|-------------|
| `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **Knowledge base entries** containing: <br>- Wikipedia URLs<br>- Entity descriptions and text<br>- Visual-textual alignment metadata<br>Used for entity grounding and answer verification |
| `oven_id2path.json` | 299 M | **Image path mapping** that maps OVEN dataset image IDs to local file paths. Required for loading visual data during training. |
### 📁 Label Propagation & Similarity Matrices
| File | Size | Description |
|------|------|-------------|
| `final_data_v2_kb_sim.json` | 1.2 GB | **KB entity similarity matrix** (primary). Pre-computed pairwise similarities between knowledge base entities using CLIP embeddings. Used for label propagation to improve curriculum learning. |
## Data Format
### Training Data Format (`merge_...json`)
The training data is a **JSON array** where each entry represents a visual question-answering sample. Example structure:
```json
{
"data_id": "infoseek_train_00734350",
"question": "Which country does this sport come from?",
"question_original": "Which country does this sport come from?",
"question_type": "infoseek",
"answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia",
"dataset_name": "infoseek",
"dataset_image_ids": "oven_00404497",
"wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw",
"wikipedia_title": "Hammer throw",
"wikipedia_url_used_in_train": "0",
"encyclopedic_vqa_split": "0",
"dataset_category_id": "0",
"evidence": "0",
"evidence_section_id": "0",
"evidence_section_title": "0",
"ret": {
"0": {
"text": "...",
"url": "Q34357085"
},
"1": {
"text": "...",
"url": "Q24951623"
},
"2": {
"text": "...",
"url": "https://en.wikipedia.org/wiki/Hammer_throw"
},
// ... up to top-10 retrieved candidates
"gt_url": {
"text": "...",
"url": "https://en.wikipedia.org/wiki/Hammer_throw"
}
}
}
```
**Key Fields:**
- `data_id`: Unique identifier (format: `{dataset}_{split}_{id}`)
- `question`: Question text
- `answer`: Answer string (may contain multiple acceptable forms separated by `|`)
- `dataset_image_ids`: OVEN image ID (used to lookup path via `oven_id2path.json`)
- `wikipedia_url`: Ground truth Wikipedia entity URL
- `ret`: Dictionary of retrieved candidate entities with text snippets and URLs
### KB Entity Format (`final_related_KB_reflectiVA_v2.json`)
The KB file is a **JSON array** of entity objects. Example structure:
```json
{
"kb_id": 992707,
"url": "https://en.wikipedia.org/wiki/Khindsi_Lake",
"title": "Khindsi Lake",
"cate": "unseen_kb",
"KB_text": "Khindsi Lake\nKhindsi Lake is a lake near the city of Ramtek...",
"section_titles": ["Khindsi Lake", "References"],
"section_texts": ["Khindsi Lake is a lake near...", "\"Rajkamal Resorts\"."],
"image_reference_descriptions": ["Khindsi Lake"],
"image_section_indices": [0],
"image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"],
"image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"
}
```
**Key Fields:**
- `url`: Wikipedia URL (used as primary key: `kb_dict = {_['url']: _ for _ in kb}`)
- `title`: Wikipedia page title
- `KB_text`: Concatenated text from all sections
- `section_titles` / `section_texts`: Structured content by section
- `image_reference_descriptions`: Image captions from Wikipedia
- `cate`: Category label (e.g., "unseen_kb")
### Similarity Matrix Format (`*_kb_sim.json`)
The similarity matrices are **nested dictionaries** mapping entity URLs to similar entities:
```json
{
"https://en.wikipedia.org/wiki/Khindsi_Lake": {
"https://en.wikipedia.org/wiki/Boating_lake": 0.32811707,
"https://en.wikipedia.org/wiki/Ramtek": 0.31777696,
"https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595,
"https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0,
// ... more similar entities with cosine similarity scores
},
// ... more entities
}
```
**Structure:**
- Outer key: Source entity URL
- Inner keys: Related entity URLs
- Values: Similarity scores (0-1, with 1.0 for self-similarity)
These matrices are used for **label propagation** during curriculum learning to spread confidence scores among similar entities.
### Image Path Mapping (`oven_id2path.json`)
A **flat dictionary** mapping OVEN image IDs to relative paths:
```json
{
"oven_00041623": "oven_images/00/oven_00041623.jpg",
"oven_00041624": "oven_images/00/oven_00041624.jpg",
"oven_00404497": "oven_images/04/oven_00404497.jpg",
// ... ~2 million entries
}
```
**Usage in code:**
```python
img_path = oven_id2path[item['dataset_image_ids']]
full_path = os.path.join('data/source', img_path)
# Results in: data/source/oven_images/00/oven_00041623.jpg
```
## Usage
### Download Instructions
```bash
# Install huggingface_hub
pip install huggingface_hub
# Download all JSON files
huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation
```
### Directory Structure After Download
```
data/annotation/
├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json
├── final_related_KB_reflectiVA_v2.json
├── oven_id2path.json
├── final_data_v2_kb_sim.json
├── final_data_sentbert_kb_sim.json
├── final_data_v2_question_sim.json
├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json
└── KB_sim_5863b7489960bcc4d270c6399b4fe819.json
```
**Note**: The actual image files are not included in this dataset. You need to download OVEN, Infoseek, and EVQA image datasets separately and organize them according to `oven_id2path.json`.
## Citation
If you use this dataset, please cite the Wiki-R1 project:
```bibtex
@article{ning2026wiki,
title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum},
author={Ning, Shan and Qiu, Longtian and He, Xuming},
journal={arXiv preprint arXiv:2603.05256},
year={2026}
}
```
## License
license: apache-2.0
# Wiki-R1 训练数据集
**数据集链接**:[https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train)
## 概览
本数据集为Wiki-R1项目提供训练标注与辅助文件,Wiki-R1是面向知识密集型视觉问答(Knowledge-Intensive Visual Question Answering,KIVQA)的强化学习框架。本数据集整合了Infoseek与EVQA数据,并支持知识库检索与标签传播功能。
## 数据集结构
本数据集包含以下JSON标注文件:
### 📁 主训练数据
| 文件 | 大小 | 描述 |
|------|------|-------------|
| `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **主训练集**:整合20000条Infoseek样本与20000条EVQA样本。每个条目包含:<br>- 问答对<br>- Top-5检索得到的知识库候选项<br>- 图像元数据(OVEN标识符)<br>- 用于课程学习的实体信息 |
### 📁 知识库与实体映射
| 文件 | 大小 | 描述 |
|------|------|-------------|
| `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **知识库条目**:包含:<br>- 维基百科URL<br>- 实体描述与文本<br>- 视觉-文本对齐元数据<br>用于实体接地与答案验证 |
| `oven_id2path.json` | 299 M | **图像路径映射表**:将OVEN数据集的图像ID映射至本地文件路径,训练过程中加载视觉数据需依赖该文件。 |
### 📁 标签传播与相似度矩阵
| 文件 | 大小 | 描述 |
|------|------|-------------|
| `final_data_v2_kb_sim.json` | 1.2 GB | **知识库实体相似度矩阵(主矩阵)**:使用CLIP嵌入预计算的知识库实体间两两相似度,用于标签传播以优化课程学习。 |
## 数据格式
### 训练数据格式(`merge_...json`)
训练数据为**JSON数组**,每个条目代表一个视觉问答样本。示例结构如下:
json
{
"data_id": "infoseek_train_00734350",
"question": "Which country does this sport come from?",
"question_original": "Which country does this sport come from?",
"question_type": "infoseek",
"answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia",
"dataset_name": "infoseek",
"dataset_image_ids": "oven_00404497",
"wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw",
"wikipedia_title": "Hammer throw",
"wikipedia_url_used_in_train": "0",
"encyclopedic_vqa_split": "0",
"dataset_category_id": "0",
"evidence": "0",
"evidence_section_id": "0",
"evidence_section_title": "0",
"ret": {
"0": {
"text": "...",
"url": "Q34357085"
},
"1": {
"text": "...",
"url": "Q24951623"
},
"2": {
"text": "...",
"url": "https://en.wikipedia.org/wiki/Hammer_throw"
},
// ... up to top-10 retrieved candidates
"gt_url": {
"text": "...",
"url": "https://en.wikipedia.org/wiki/Hammer_throw"
}
}
}
**关键字段说明**:
- `data_id`:数据唯一标识符,格式为`{数据集名}_{拆分集}_{编号}`
- `question`:问题文本
- `answer`:答案字符串,可包含多个可接受的答案形式,以`|`分隔
- `dataset_image_ids`:OVEN图像ID(可通过`oven_id2path.json`查找对应路径)
- `wikipedia_url`:真实维基百科实体URL
- `ret`:检索得到的候选实体字典,包含文本片段与URL
### 知识库实体格式(`final_related_KB_reflectiVA_v2.json`)
知识库文件为**实体对象组成的JSON数组**。示例结构如下:
json
{
"kb_id": 992707,
"url": "https://en.wikipedia.org/wiki/Khindsi_Lake",
"title": "Khindsi Lake",
"cate": "unseen_kb",
"KB_text": "Khindsi Lake
Khindsi Lake is a lake near the city of Ramtek...",
"section_titles": ["Khindsi Lake", "References"],
"section_texts": ["Khindsi Lake is a lake near...", ""Rajkamal Resorts"."],
"image_reference_descriptions": ["Khindsi Lake"],
"image_section_indices": [0],
"image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"],
"image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"
}
**关键字段说明**:
- `url`:维基百科URL(作为主键:`kb_dict = {_['url']: _ for _ in kb}`)
- `title`:维基百科页面标题
- `KB_text`:所有章节文本的拼接结果
- `section_titles` / `section_texts`:按章节划分的结构化内容
- `image_reference_descriptions`:维基百科中的图像标题
- `image_section_indices`:图像所属章节索引
- `image_urls` / `image_path`:图像本地路径
- `cate`:类别标签(例如`unseen_kb`)
### 相似度矩阵格式(`*_kb_sim.json`)
相似度矩阵为**嵌套字典**,将实体URL映射至相似实体:
json
{
"https://en.wikipedia.org/wiki/Khindsi_Lake": {
"https://en.wikipedia.org/wiki/Boating_lake": 0.32811707,
"https://en.wikipedia.org/wiki/Ramtek": 0.31777696,
"https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595,
"https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0,
// ... more similar entities with cosine similarity scores
},
// ... more entities
}
**结构说明**:
- 外层键:源实体URL
- 内层键:相关实体URL
- 取值:相似度得分(范围0至1,其中1.0代表自身相似度)
该矩阵用于课程学习阶段的标签传播,以在相似实体间传播置信度分数。
### 图像路径映射(`oven_id2path.json`)
该文件为**扁平字典**,将OVEN图像ID映射至相对路径:
json
{
"oven_00041623": "oven_images/00/oven_00041623.jpg",
"oven_00041624": "oven_images/00/oven_00041624.jpg",
"oven_00404497": "oven_images/04/oven_00404497.jpg",
// ... ~2 million entries
}
**代码使用示例**:
python
img_path = oven_id2path[item['dataset_image_ids']]
full_path = os.path.join('data/source', img_path)
# 结果为:data/source/oven_images/00/oven_00041623.jpg
## 使用方法
### 下载指南
bash
# 安装huggingface_hub库
pip install huggingface_hub
# 下载所有JSON文件
huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation
### 下载后的目录结构
data/annotation/
├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json
├── final_related_KB_reflectiVA_v2.json
├── oven_id2path.json
├── final_data_v2_kb_sim.json
├── final_data_sentbert_kb_sim.json
├── final_data_v2_question_sim.json
├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json
└── KB_sim_5863b7489960bcc4d270c6399b4fe819.json
**注意**:本数据集未包含实际图像文件,需单独下载OVEN、Infoseek及EVQA图像数据集,并按照`oven_id2path.json`的映射规则组织文件路径。
## 引用说明
若使用本数据集,请引用Wiki-R1项目:
bibtex
@article{ning2026wiki,
title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum},
author={Ning, Shan and Qiu, Longtian and He, Xuming},
journal={arXiv preprint arXiv:2603.05256},
year={2026}
}
## 许可证
许可证:Apache-2.0
提供机构:
Artanic30



