Artanic30/Wiki_R1_Train

Name: Artanic30/Wiki_R1_Train
Creator: Artanic30
Published: 2026-04-07 16:31:01
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Artanic30/Wiki_R1_Train

下载链接

链接失效反馈

官方服务：

资源简介：

# Wiki-R1 Training Dataset **Dataset Link**: [https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train) ## Overview This dataset contains training annotations and auxiliary files for the Wiki-R1 project, a reinforcement learning framework for Knowledge-Intensive Visual Question Answering (KIVQA). The dataset combines Infoseek and EVQA data with knowledge base retrieval and label propagation support. ## Dataset Structure The dataset consists of the following JSON annotation files: ### 📁 Main Training Data | File | Size | Description | |------|------|-------------| | `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **Primary training set** combining 20k Infoseek samples + 20k EVQA samples. Each entry contains: - Question and answer pairs - Top-5 retrieved KB candidates - Image metadata (OVEN IDs) - Entity information for curriculum learning | ### 📁 Knowledge Base & Entity Mapping | File | Size | Description | |------|------|-------------| | `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **Knowledge base entries** containing: - Wikipedia URLs - Entity descriptions and text - Visual-textual alignment metadata Used for entity grounding and answer verification | | `oven_id2path.json` | 299 M | **Image path mapping** that maps OVEN dataset image IDs to local file paths. Required for loading visual data during training. | ### 📁 Label Propagation & Similarity Matrices | File | Size | Description | |------|------|-------------| | `final_data_v2_kb_sim.json` | 1.2 GB | **KB entity similarity matrix** (primary). Pre-computed pairwise similarities between knowledge base entities using CLIP embeddings. Used for label propagation to improve curriculum learning. | ## Data Format ### Training Data Format (`merge_...json`) The training data is a **JSON array** where each entry represents a visual question-answering sample. Example structure: ```json { "data_id": "infoseek_train_00734350", "question": "Which country does this sport come from?", "question_original": "Which country does this sport come from?", "question_type": "infoseek", "answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia", "dataset_name": "infoseek", "dataset_image_ids": "oven_00404497", "wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw", "wikipedia_title": "Hammer throw", "wikipedia_url_used_in_train": "0", "encyclopedic_vqa_split": "0", "dataset_category_id": "0", "evidence": "0", "evidence_section_id": "0", "evidence_section_title": "0", "ret": { "0": { "text": "...", "url": "Q34357085" }, "1": { "text": "...", "url": "Q24951623" }, "2": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" }, // ... up to top-10 retrieved candidates "gt_url": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" } } } ``` **Key Fields:** - `data_id`: Unique identifier (format: `{dataset}_{split}_{id}`) - `question`: Question text - `answer`: Answer string (may contain multiple acceptable forms separated by `|`) - `dataset_image_ids`: OVEN image ID (used to lookup path via `oven_id2path.json`) - `wikipedia_url`: Ground truth Wikipedia entity URL - `ret`: Dictionary of retrieved candidate entities with text snippets and URLs ### KB Entity Format (`final_related_KB_reflectiVA_v2.json`) The KB file is a **JSON array** of entity objects. Example structure: ```json { "kb_id": 992707, "url": "https://en.wikipedia.org/wiki/Khindsi_Lake", "title": "Khindsi Lake", "cate": "unseen_kb", "KB_text": "Khindsi Lake\nKhindsi Lake is a lake near the city of Ramtek...", "section_titles": ["Khindsi Lake", "References"], "section_texts": ["Khindsi Lake is a lake near...", "\"Rajkamal Resorts\"."], "image_reference_descriptions": ["Khindsi Lake"], "image_section_indices": [0], "image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"], "image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg" } ``` **Key Fields:** - `url`: Wikipedia URL (used as primary key: `kb_dict = {_['url']: _ for _ in kb}`) - `title`: Wikipedia page title - `KB_text`: Concatenated text from all sections - `section_titles` / `section_texts`: Structured content by section - `image_reference_descriptions`: Image captions from Wikipedia - `cate`: Category label (e.g., "unseen_kb") ### Similarity Matrix Format (`*_kb_sim.json`) The similarity matrices are **nested dictionaries** mapping entity URLs to similar entities: ```json { "https://en.wikipedia.org/wiki/Khindsi_Lake": { "https://en.wikipedia.org/wiki/Boating_lake": 0.32811707, "https://en.wikipedia.org/wiki/Ramtek": 0.31777696, "https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595, "https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0, // ... more similar entities with cosine similarity scores }, // ... more entities } ``` **Structure:** - Outer key: Source entity URL - Inner keys: Related entity URLs - Values: Similarity scores (0-1, with 1.0 for self-similarity) These matrices are used for **label propagation** during curriculum learning to spread confidence scores among similar entities. ### Image Path Mapping (`oven_id2path.json`) A **flat dictionary** mapping OVEN image IDs to relative paths: ```json { "oven_00041623": "oven_images/00/oven_00041623.jpg", "oven_00041624": "oven_images/00/oven_00041624.jpg", "oven_00404497": "oven_images/04/oven_00404497.jpg", // ... ~2 million entries } ``` **Usage in code:** ```python img_path = oven_id2path[item['dataset_image_ids']] full_path = os.path.join('data/source', img_path) # Results in: data/source/oven_images/00/oven_00041623.jpg ``` ## Usage ### Download Instructions ```bash # Install huggingface_hub pip install huggingface_hub # Download all JSON files huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation ``` ### Directory Structure After Download ``` data/annotation/ ├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json ├── final_related_KB_reflectiVA_v2.json ├── oven_id2path.json ├── final_data_v2_kb_sim.json ├── final_data_sentbert_kb_sim.json ├── final_data_v2_question_sim.json ├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json └── KB_sim_5863b7489960bcc4d270c6399b4fe819.json ``` **Note**: The actual image files are not included in this dataset. You need to download OVEN, Infoseek, and EVQA image datasets separately and organize them according to `oven_id2path.json`. ## Citation If you use this dataset, please cite the Wiki-R1 project: ```bibtex @article{ning2026wiki, title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum}, author={Ning, Shan and Qiu, Longtian and He, Xuming}, journal={arXiv preprint arXiv:2603.05256}, year={2026} } ``` ## License license: apache-2.0

# Wiki-R1 训练数据集 **数据集链接**：[https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train) ## 概览本数据集为Wiki-R1项目提供训练标注与辅助文件，Wiki-R1是面向知识密集型视觉问答（Knowledge-Intensive Visual Question Answering，KIVQA）的强化学习框架。本数据集整合了Infoseek与EVQA数据，并支持知识库检索与标签传播功能。 ## 数据集结构本数据集包含以下JSON标注文件： ### 📁 主训练数据 | 文件 | 大小 | 描述 | |------|------|-------------| | `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **主训练集**：整合20000条Infoseek样本与20000条EVQA样本。每个条目包含： - 问答对 - Top-5检索得到的知识库候选项 - 图像元数据（OVEN标识符） - 用于课程学习的实体信息 | ### 📁 知识库与实体映射 | 文件 | 大小 | 描述 | |------|------|-------------| | `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **知识库条目**：包含： - 维基百科URL - 实体描述与文本 - 视觉-文本对齐元数据 用于实体接地与答案验证 | | `oven_id2path.json` | 299 M | **图像路径映射表**：将OVEN数据集的图像ID映射至本地文件路径，训练过程中加载视觉数据需依赖该文件。 | ### 📁 标签传播与相似度矩阵 | 文件 | 大小 | 描述 | |------|------|-------------| | `final_data_v2_kb_sim.json` | 1.2 GB | **知识库实体相似度矩阵（主矩阵）**：使用CLIP嵌入预计算的知识库实体间两两相似度，用于标签传播以优化课程学习。 | ## 数据格式 ### 训练数据格式（`merge_...json`）训练数据为**JSON数组**，每个条目代表一个视觉问答样本。示例结构如下： json { "data_id": "infoseek_train_00734350", "question": "Which country does this sport come from?", "question_original": "Which country does this sport come from?", "question_type": "infoseek", "answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia", "dataset_name": "infoseek", "dataset_image_ids": "oven_00404497", "wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw", "wikipedia_title": "Hammer throw", "wikipedia_url_used_in_train": "0", "encyclopedic_vqa_split": "0", "dataset_category_id": "0", "evidence": "0", "evidence_section_id": "0", "evidence_section_title": "0", "ret": { "0": { "text": "...", "url": "Q34357085" }, "1": { "text": "...", "url": "Q24951623" }, "2": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" }, // ... up to top-10 retrieved candidates "gt_url": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" } } } **关键字段说明**： - `data_id`：数据唯一标识符，格式为`{数据集名}_{拆分集}_{编号}` - `question`：问题文本 - `answer`：答案字符串，可包含多个可接受的答案形式，以`|`分隔 - `dataset_image_ids`：OVEN图像ID（可通过`oven_id2path.json`查找对应路径） - `wikipedia_url`：真实维基百科实体URL - `ret`：检索得到的候选实体字典，包含文本片段与URL ### 知识库实体格式（`final_related_KB_reflectiVA_v2.json`）知识库文件为**实体对象组成的JSON数组**。示例结构如下： json { "kb_id": 992707, "url": "https://en.wikipedia.org/wiki/Khindsi_Lake", "title": "Khindsi Lake", "cate": "unseen_kb", "KB_text": "Khindsi Lake Khindsi Lake is a lake near the city of Ramtek...", "section_titles": ["Khindsi Lake", "References"], "section_texts": ["Khindsi Lake is a lake near...", ""Rajkamal Resorts"."], "image_reference_descriptions": ["Khindsi Lake"], "image_section_indices": [0], "image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"], "image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg" } **关键字段说明**： - `url`：维基百科URL（作为主键：`kb_dict = {_['url']: _ for _ in kb}`） - `title`：维基百科页面标题 - `KB_text`：所有章节文本的拼接结果 - `section_titles` / `section_texts`：按章节划分的结构化内容 - `image_reference_descriptions`：维基百科中的图像标题 - `image_section_indices`：图像所属章节索引 - `image_urls` / `image_path`：图像本地路径 - `cate`：类别标签（例如`unseen_kb`） ### 相似度矩阵格式（`*_kb_sim.json`）相似度矩阵为**嵌套字典**，将实体URL映射至相似实体： json { "https://en.wikipedia.org/wiki/Khindsi_Lake": { "https://en.wikipedia.org/wiki/Boating_lake": 0.32811707, "https://en.wikipedia.org/wiki/Ramtek": 0.31777696, "https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595, "https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0, // ... more similar entities with cosine similarity scores }, // ... more entities } **结构说明**： - 外层键：源实体URL - 内层键：相关实体URL - 取值：相似度得分（范围0至1，其中1.0代表自身相似度）该矩阵用于课程学习阶段的标签传播，以在相似实体间传播置信度分数。 ### 图像路径映射（`oven_id2path.json`）该文件为**扁平字典**，将OVEN图像ID映射至相对路径： json { "oven_00041623": "oven_images/00/oven_00041623.jpg", "oven_00041624": "oven_images/00/oven_00041624.jpg", "oven_00404497": "oven_images/04/oven_00404497.jpg", // ... ~2 million entries } **代码使用示例**： python img_path = oven_id2path[item['dataset_image_ids']] full_path = os.path.join('data/source', img_path) # 结果为：data/source/oven_images/00/oven_00041623.jpg ## 使用方法 ### 下载指南 bash # 安装huggingface_hub库 pip install huggingface_hub # 下载所有JSON文件 huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation ### 下载后的目录结构 data/annotation/ ├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json ├── final_related_KB_reflectiVA_v2.json ├── oven_id2path.json ├── final_data_v2_kb_sim.json ├── final_data_sentbert_kb_sim.json ├── final_data_v2_question_sim.json ├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json └── KB_sim_5863b7489960bcc4d270c6399b4fe819.json **注意**：本数据集未包含实际图像文件，需单独下载OVEN、Infoseek及EVQA图像数据集，并按照`oven_id2path.json`的映射规则组织文件路径。 ## 引用说明若使用本数据集，请引用Wiki-R1项目： bibtex @article{ning2026wiki, title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum}, author={Ning, Shan and Qiu, Longtian and He, Xuming}, journal={arXiv preprint arXiv:2603.05256}, year={2026} } ## 许可证许可证：Apache-2.0

提供机构：

Artanic30

5,000+

优质数据集

54 个

任务类型

进入经典数据集