five

Artanic30/Wiki_R1_Train

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Artanic30/Wiki_R1_Train
下载链接
链接失效反馈
官方服务:
资源简介:
# Wiki-R1 Training Dataset **Dataset Link**: [https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train) ## Overview This dataset contains training annotations and auxiliary files for the Wiki-R1 project, a reinforcement learning framework for Knowledge-Intensive Visual Question Answering (KIVQA). The dataset combines Infoseek and EVQA data with knowledge base retrieval and label propagation support. ## Dataset Structure The dataset consists of the following JSON annotation files: ### 📁 Main Training Data | File | Size | Description | |------|------|-------------| | `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **Primary training set** combining 20k Infoseek samples + 20k EVQA samples. Each entry contains: <br>- Question and answer pairs<br>- Top-5 retrieved KB candidates<br>- Image metadata (OVEN IDs)<br>- Entity information for curriculum learning | ### 📁 Knowledge Base & Entity Mapping | File | Size | Description | |------|------|-------------| | `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **Knowledge base entries** containing: <br>- Wikipedia URLs<br>- Entity descriptions and text<br>- Visual-textual alignment metadata<br>Used for entity grounding and answer verification | | `oven_id2path.json` | 299 M | **Image path mapping** that maps OVEN dataset image IDs to local file paths. Required for loading visual data during training. | ### 📁 Label Propagation & Similarity Matrices | File | Size | Description | |------|------|-------------| | `final_data_v2_kb_sim.json` | 1.2 GB | **KB entity similarity matrix** (primary). Pre-computed pairwise similarities between knowledge base entities using CLIP embeddings. Used for label propagation to improve curriculum learning. | ## Data Format ### Training Data Format (`merge_...json`) The training data is a **JSON array** where each entry represents a visual question-answering sample. Example structure: ```json { "data_id": "infoseek_train_00734350", "question": "Which country does this sport come from?", "question_original": "Which country does this sport come from?", "question_type": "infoseek", "answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia", "dataset_name": "infoseek", "dataset_image_ids": "oven_00404497", "wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw", "wikipedia_title": "Hammer throw", "wikipedia_url_used_in_train": "0", "encyclopedic_vqa_split": "0", "dataset_category_id": "0", "evidence": "0", "evidence_section_id": "0", "evidence_section_title": "0", "ret": { "0": { "text": "...", "url": "Q34357085" }, "1": { "text": "...", "url": "Q24951623" }, "2": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" }, // ... up to top-10 retrieved candidates "gt_url": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" } } } ``` **Key Fields:** - `data_id`: Unique identifier (format: `{dataset}_{split}_{id}`) - `question`: Question text - `answer`: Answer string (may contain multiple acceptable forms separated by `|`) - `dataset_image_ids`: OVEN image ID (used to lookup path via `oven_id2path.json`) - `wikipedia_url`: Ground truth Wikipedia entity URL - `ret`: Dictionary of retrieved candidate entities with text snippets and URLs ### KB Entity Format (`final_related_KB_reflectiVA_v2.json`) The KB file is a **JSON array** of entity objects. Example structure: ```json { "kb_id": 992707, "url": "https://en.wikipedia.org/wiki/Khindsi_Lake", "title": "Khindsi Lake", "cate": "unseen_kb", "KB_text": "Khindsi Lake\nKhindsi Lake is a lake near the city of Ramtek...", "section_titles": ["Khindsi Lake", "References"], "section_texts": ["Khindsi Lake is a lake near...", "\"Rajkamal Resorts\"."], "image_reference_descriptions": ["Khindsi Lake"], "image_section_indices": [0], "image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"], "image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg" } ``` **Key Fields:** - `url`: Wikipedia URL (used as primary key: `kb_dict = {_['url']: _ for _ in kb}`) - `title`: Wikipedia page title - `KB_text`: Concatenated text from all sections - `section_titles` / `section_texts`: Structured content by section - `image_reference_descriptions`: Image captions from Wikipedia - `cate`: Category label (e.g., "unseen_kb") ### Similarity Matrix Format (`*_kb_sim.json`) The similarity matrices are **nested dictionaries** mapping entity URLs to similar entities: ```json { "https://en.wikipedia.org/wiki/Khindsi_Lake": { "https://en.wikipedia.org/wiki/Boating_lake": 0.32811707, "https://en.wikipedia.org/wiki/Ramtek": 0.31777696, "https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595, "https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0, // ... more similar entities with cosine similarity scores }, // ... more entities } ``` **Structure:** - Outer key: Source entity URL - Inner keys: Related entity URLs - Values: Similarity scores (0-1, with 1.0 for self-similarity) These matrices are used for **label propagation** during curriculum learning to spread confidence scores among similar entities. ### Image Path Mapping (`oven_id2path.json`) A **flat dictionary** mapping OVEN image IDs to relative paths: ```json { "oven_00041623": "oven_images/00/oven_00041623.jpg", "oven_00041624": "oven_images/00/oven_00041624.jpg", "oven_00404497": "oven_images/04/oven_00404497.jpg", // ... ~2 million entries } ``` **Usage in code:** ```python img_path = oven_id2path[item['dataset_image_ids']] full_path = os.path.join('data/source', img_path) # Results in: data/source/oven_images/00/oven_00041623.jpg ``` ## Usage ### Download Instructions ```bash # Install huggingface_hub pip install huggingface_hub # Download all JSON files huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation ``` ### Directory Structure After Download ``` data/annotation/ ├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json ├── final_related_KB_reflectiVA_v2.json ├── oven_id2path.json ├── final_data_v2_kb_sim.json ├── final_data_sentbert_kb_sim.json ├── final_data_v2_question_sim.json ├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json └── KB_sim_5863b7489960bcc4d270c6399b4fe819.json ``` **Note**: The actual image files are not included in this dataset. You need to download OVEN, Infoseek, and EVQA image datasets separately and organize them according to `oven_id2path.json`. ## Citation If you use this dataset, please cite the Wiki-R1 project: ```bibtex @article{ning2026wiki, title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum}, author={Ning, Shan and Qiu, Longtian and He, Xuming}, journal={arXiv preprint arXiv:2603.05256}, year={2026} } ``` ## License license: apache-2.0

# Wiki-R1 训练数据集 **数据集链接**:[https://huggingface.co/datasets/Artanic30/Wiki_R1_Train](https://huggingface.co/datasets/Artanic30/Wiki_R1_Train) ## 概览 本数据集为Wiki-R1项目提供训练标注与辅助文件,Wiki-R1是面向知识密集型视觉问答(Knowledge-Intensive Visual Question Answering,KIVQA)的强化学习框架。本数据集整合了Infoseek与EVQA数据,并支持知识库检索与标签传播功能。 ## 数据集结构 本数据集包含以下JSON标注文件: ### 📁 主训练数据 | 文件 | 大小 | 描述 | |------|------|-------------| | `merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json` | 330 M | **主训练集**:整合20000条Infoseek样本与20000条EVQA样本。每个条目包含:<br>- 问答对<br>- Top-5检索得到的知识库候选项<br>- 图像元数据(OVEN标识符)<br>- 用于课程学习的实体信息 | ### 📁 知识库与实体映射 | 文件 | 大小 | 描述 | |------|------|-------------| | `final_related_KB_reflectiVA_v2.json` | 5.9 GB | **知识库条目**:包含:<br>- 维基百科URL<br>- 实体描述与文本<br>- 视觉-文本对齐元数据<br>用于实体接地与答案验证 | | `oven_id2path.json` | 299 M | **图像路径映射表**:将OVEN数据集的图像ID映射至本地文件路径,训练过程中加载视觉数据需依赖该文件。 | ### 📁 标签传播与相似度矩阵 | 文件 | 大小 | 描述 | |------|------|-------------| | `final_data_v2_kb_sim.json` | 1.2 GB | **知识库实体相似度矩阵(主矩阵)**:使用CLIP嵌入预计算的知识库实体间两两相似度,用于标签传播以优化课程学习。 | ## 数据格式 ### 训练数据格式(`merge_...json`) 训练数据为**JSON数组**,每个条目代表一个视觉问答样本。示例结构如下: json { "data_id": "infoseek_train_00734350", "question": "Which country does this sport come from?", "question_original": "Which country does this sport come from?", "question_type": "infoseek", "answer": "SCT|UK-SC|Alba|Scot|UK-SCT|SC|Scotland|Scotland, United Kingdom|Caledonia", "dataset_name": "infoseek", "dataset_image_ids": "oven_00404497", "wikipedia_url": "https://en.wikipedia.org/wiki/Hammer_throw", "wikipedia_title": "Hammer throw", "wikipedia_url_used_in_train": "0", "encyclopedic_vqa_split": "0", "dataset_category_id": "0", "evidence": "0", "evidence_section_id": "0", "evidence_section_title": "0", "ret": { "0": { "text": "...", "url": "Q34357085" }, "1": { "text": "...", "url": "Q24951623" }, "2": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" }, // ... up to top-10 retrieved candidates "gt_url": { "text": "...", "url": "https://en.wikipedia.org/wiki/Hammer_throw" } } } **关键字段说明**: - `data_id`:数据唯一标识符,格式为`{数据集名}_{拆分集}_{编号}` - `question`:问题文本 - `answer`:答案字符串,可包含多个可接受的答案形式,以`|`分隔 - `dataset_image_ids`:OVEN图像ID(可通过`oven_id2path.json`查找对应路径) - `wikipedia_url`:真实维基百科实体URL - `ret`:检索得到的候选实体字典,包含文本片段与URL ### 知识库实体格式(`final_related_KB_reflectiVA_v2.json`) 知识库文件为**实体对象组成的JSON数组**。示例结构如下: json { "kb_id": 992707, "url": "https://en.wikipedia.org/wiki/Khindsi_Lake", "title": "Khindsi Lake", "cate": "unseen_kb", "KB_text": "Khindsi Lake Khindsi Lake is a lake near the city of Ramtek...", "section_titles": ["Khindsi Lake", "References"], "section_texts": ["Khindsi Lake is a lake near...", ""Rajkamal Resorts"."], "image_reference_descriptions": ["Khindsi Lake"], "image_section_indices": [0], "image_urls": ["/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg"], "image_path": "/inspurfs/group/.../image/2e7a095c-604a-3819-af9c-d58b20287a52.jpg" } **关键字段说明**: - `url`:维基百科URL(作为主键:`kb_dict = {_['url']: _ for _ in kb}`) - `title`:维基百科页面标题 - `KB_text`:所有章节文本的拼接结果 - `section_titles` / `section_texts`:按章节划分的结构化内容 - `image_reference_descriptions`:维基百科中的图像标题 - `image_section_indices`:图像所属章节索引 - `image_urls` / `image_path`:图像本地路径 - `cate`:类别标签(例如`unseen_kb`) ### 相似度矩阵格式(`*_kb_sim.json`) 相似度矩阵为**嵌套字典**,将实体URL映射至相似实体: json { "https://en.wikipedia.org/wiki/Khindsi_Lake": { "https://en.wikipedia.org/wiki/Boating_lake": 0.32811707, "https://en.wikipedia.org/wiki/Ramtek": 0.31777696, "https://en.wikipedia.org/wiki/Little_Lake_(Peterborough)": 0.24621595, "https://en.wikipedia.org/wiki/Khindsi_Lake": 1.0, // ... more similar entities with cosine similarity scores }, // ... more entities } **结构说明**: - 外层键:源实体URL - 内层键:相关实体URL - 取值:相似度得分(范围0至1,其中1.0代表自身相似度) 该矩阵用于课程学习阶段的标签传播,以在相似实体间传播置信度分数。 ### 图像路径映射(`oven_id2path.json`) 该文件为**扁平字典**,将OVEN图像ID映射至相对路径: json { "oven_00041623": "oven_images/00/oven_00041623.jpg", "oven_00041624": "oven_images/00/oven_00041624.jpg", "oven_00404497": "oven_images/04/oven_00404497.jpg", // ... ~2 million entries } **代码使用示例**: python img_path = oven_id2path[item['dataset_image_ids']] full_path = os.path.join('data/source', img_path) # 结果为:data/source/oven_images/00/oven_00041623.jpg ## 使用方法 ### 下载指南 bash # 安装huggingface_hub库 pip install huggingface_hub # 下载所有JSON文件 huggingface-cli download Artanic30/Wiki_R1_Train --repo-type dataset --local-dir ./data/annotation ### 下载后的目录结构 data/annotation/ ├── merge_infoseek_train_filtered_balance_sample20k_top5_and_evqa_train_sample20k_top5_w_I2T.json ├── final_related_KB_reflectiVA_v2.json ├── oven_id2path.json ├── final_data_v2_kb_sim.json ├── final_data_sentbert_kb_sim.json ├── final_data_v2_question_sim.json ├── KB_sim_42b8eabda30759e781b4b0b4a3842fc2.json └── KB_sim_5863b7489960bcc4d270c6399b4fe819.json **注意**:本数据集未包含实际图像文件,需单独下载OVEN、Infoseek及EVQA图像数据集,并按照`oven_id2path.json`的映射规则组织文件路径。 ## 引用说明 若使用本数据集,请引用Wiki-R1项目: bibtex @article{ning2026wiki, title={Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum}, author={Ning, Shan and Qiu, Longtian and He, Xuming}, journal={arXiv preprint arXiv:2603.05256}, year={2026} } ## 许可证 许可证:Apache-2.0
提供机构:
Artanic30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作