PassionPrc/hotpot_RL

Name: PassionPrc/hotpot_RL
Creator: PassionPrc
Published: 2026-04-15 16:12:04
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/PassionPrc/hotpot_RL

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - visual-question-answering language: - en tags: - hotpotqa - multi-hop-reasoning - vlm - reinforcement-learning - document-understanding size_categories: - 10K<n<100K --- # HotpotQA-RL: Merged-Context Dataset for Vision-Language Models A reformatted version of [HotpotQA](https://hotpotqa.github.io/) designed for Reinforcement Learning with Vision-Language Models. Every **two** adjacent questions share one merged context, doubling the document length and increasing retrieval difficulty. ## What's Different from Original HotpotQA - **Merged contexts**: Every 2 consecutive items share the same context (20 paragraphs, ~1,800 words), simulating longer documents - **Image format**: Context text is rendered as PNG images (Verdana 9pt, A4 layout) for VLM input - **Embedded binary**: Images are stored directly as `bytes` inside the parquet files — fully self-contained, no external files needed ## Dataset Statistics | Split | Items | Shared-Context Pairs | Avg Words/Context | Avg Image Size | File Size | |-------|-------|---------------------|-------------------|----------------|-----------| | `train` | 90,447 | 45,223 | 1,834 | 365 KB | 31.2 GB | | `dev_distractor` | 7,405 | 3,702 | 1,855 | 369 KB | 2.5 GB | | `dev_fullwiki` | 7,405 | 3,702 | 1,904 | 377 KB | 2.5 GB | ### Context Length Distribution | Metric | train | dev_distractor | dev_fullwiki | |--------|-------|---------------|--------------| | Min words | 250 | 709 | 616 | | Max words | 3,994 | 3,703 | 3,838 | | Median words | 1,814 | 1,840 | 1,884 | | Min chars | 1,598 | 4,448 | 3,871 | | Max chars | 24,987 | 22,878 | 23,620 | ### Answer Distribution - ~93.9% span answers (entity names, dates, etc.) - ~6.1% yes/no answers ### Evidence (Golden Supporting Facts) | Metric | train | dev_distractor | dev_fullwiki | |--------|-------|---------------|--------------| | Min evidence sentences | 2 | 2 | 0 | | Max evidence sentences | 12 | 8 | 7 | | Mean evidence sentences | 2.4 | 2.4 | 1.4 | > Note: `dev_fullwiki` has some empty evidence lists because the supporting facts reference paragraphs that may not exist in the fullwiki-retrieved context. ## Data Format Each parquet file contains the following columns: | Column | Type | Description | |--------|------|-------------| | `id` | `string` | Unique question ID (from original HotpotQA) | | `question` | `string` | The question text | | `context` | `string` | Full context text (title + sentences from all paragraphs, separated by newlines) | | `context_img` | `bytes` | PNG image of the rendered context text (binary) | | `evidence` | `list[string]` | Golden evidence sentences (supporting facts) | | `answer` | `string` | Ground-truth answer | ### Shared Context Items are paired: every two consecutive rows (index 0&1, 2&3, ...) share the **same** `context`, `context_img`. Each item has its own `question`, `answer`, and `evidence`. ### Example ```python import pandas as pd from PIL import Image import io df = pd.read_parquet("hotpot_train.parquet") row = df.iloc[0] print(row['question']) # "Which magazine was started first..." print(row['answer']) # "Arthur's Magazine" print(row['evidence']) # ["Arthur's Magazine (1844–1846)...", "First for Women is..."] # View the context image img = Image.open(io.BytesIO(row['context_img'])) img.show() ``` ## Files ``` ├── hotpot_train.parquet # Training set (90,447 items) ├── hotpot_dev_distractor.parquet # Dev set - distractor setting (7,405 items) └── hotpot_dev_fullwiki.parquet # Dev set - fullwiki setting (7,405 items) ``` ## Source - Original dataset: [HotpotQA](https://hotpotqa.github.io/) (Yang et al., 2018) - Distractor vs Fullwiki: Both dev sets share the same questions but differ in context — distractor provides 10 paragraphs (2 gold + 8 distractors), fullwiki retrieves paragraphs from Wikipedia ## Citation ```bibtex @inproceedings{yang2018hotpotqa, title={HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering}, author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.}, booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year={2018} } ```

提供机构：

PassionPrc

5,000+

优质数据集

54 个

任务类型

进入经典数据集