Orsta-Data-47k
收藏魔搭社区2026-05-15 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/One-RL-to-See-Them-All/Orsta-Data-47k
下载链接
链接失效反馈官方服务:
资源简介:
# Orsta-Data-47k Dataset
* 🐙 **GitHub Repo:** [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All)
* 📜 **Paper (arXiv):** [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)
## Dataset Description 📖
**Orsta-Data-47k** is a specialized dataset curated for the post-training of Vision-Language Models (VLMs) using our [V-Triune](https://github.com/MiniMax-AI/One-RL-to-See-Them-All) unified reinforcement learning system. Its primary purpose is to enable robust joint training across a diverse spectrum of both visual reasoning and visual perception tasks, powering models like [Orsta](https://huggingface.co/collections/One-RL-to-See-Them-All/one-rl-to-see-them-all-6833d27abce23898b2f9815a) to achieve advanced multimodal capabilities.
This dataset is a carefully selected aggregation from 18 publicly available datasets, refined through a rigorous filtering process to ensure high quality and suitability for RL-based fine-tuning.
## Tasks Covered 🎯
The dataset is structured to cover eight principal task categories, balanced between reasoning and perception:
* **Visual Reasoning Tasks 🤔:**
* Mathematics (Math QA)
* Puzzle Solving (Visual Puzzles)
* Science Question Answering (Science QA)
* Chart Understanding (Chart QA)
* **Visual Perception Tasks 👁️:**
* Object Detection
* Visual Grounding
* Object Counting
* Optical Character Recognition (OCR)
## Data Curation Process 🛠️
To create a high-quality corpus for effective RL post-training, we implemented a comprehensive two-stage data curation pipeline:
1. **Rule-based Filtering:** An initial filtering stage applied a set of predefined rules to the source datasets. These rules were tailored differently for reasoning and perception tasks, aiming to remove noisy samples, questions prone to "hacking" (e.g., certain multiple-choice formats), and problematic answer formats. For perception tasks, this also involved standardizing coordinate systems and filtering based on object size or count.
2. **Difficulty-based Filtering:** Following rule-based cleaning, a difficulty filter was applied. This stage removed samples deemed too easy (e.g., already solvable by baseline models) or excessively hard, ensuring that the remaining data provides a meaningful and efficient learning signal for the models.
This meticulous process resulted in a high-quality collection of approximately **47,700 samples**. To address potential dataset imbalances, data for certain tasks (e.g., puzzles) was strategically duplicated to ensure adequate representation.
## Dataset Composition & Structure 📊
* **Total Samples:** ~47.7K
* **Task Categories:** 8 (4 reasoning, 4 perception)
* **Aggregated From:** 18 distinct public datasets
* **Content Breakdown:**
* Visual Perception Samples: ~20.6K
* Visual Reasoning Samples: ~27.1K
* **Interaction Format:** The data primarily consists of single-image, single-turn conversational interactions (e.g., an image paired with a question and its corresponding answer/grounding).
* **Storage Format:** All curated data is stored in the efficient Parquet format.
## Intended Use & Training 🚀
This dataset is designed for use with the [V-Triune](https://github.com/MiniMax-AI/One-RL-to-See-Them-All) framework for reinforcement learning-based post-training of VLMs. In the training of [Orsta](https://huggingface.co/collections/One-RL-to-See-Them-All/one-rl-to-see-them-all-6833d27abce23898b2f9815a) models, all samples from this dataset were uniformly mixed and utilized.
## Dataset Usage
This section outlines how to download and use the Orsta-Data-47k dataset.
### Downloading the Dataset
You can download the dataset directly from the Hugging Face Hub using the `huggingface-cli` tool. Make sure you have `huggingface_hub` installed (`pip install huggingface_hub`).
Execute the following command in your terminal:
```bash
huggingface-cli download --repo-type dataset --resume-download One-RL-to-See-Them-All/Orsta-Data-47k --local-dir Orsta-Data-47k
```
This command will download all dataset files into a local directory named `Orsta-Data-47k`. The `--resume-download` flag is useful for resuming downloads if interrupted.
### Dataset Structure
Once downloaded, the dataset will have the following structure within the `Orsta-Data-47k` directory. All data files are in the Parquet (`.parquet`) format.
```
Orsta-Data-47k/
├── test/
│ ├── test_chart_megabench_176.parquet
......
│ └── test_science_megabench_91.parquet
└── train/
├── train_chart_chartqapro_498.parquet
......
└── train_science_virl39k_2539.parquet
```
### File Naming Convention
The files within the `train/` and `test/` directories follow this naming convention:
`{split}_{task_name}_{source_name}_{num}.parquet`
Where:
* `{split}`: Indicates the dataset split, either `train` or `test`.
* `{task_name}`: Specifies the general task category.
* `{source_name}`: Denotes the specific benchmark or origin of the data.
* `{num}`: Represents the number of samples contained within that Parquet file.
### Purpose of Each Split
* **`train/` directory**: These files constitute the training corpus for the Orsta model.
* **`test/` directory**: These files contain samples specifically curated for online evaluation of the model's performance on different tasks *during* the training process. Analyzing performance on these samples helps in diagnosing the training status and understanding the model's learning progression for each task category.
### Data Format
```python
{
'data_source': Value(dtype='string', id=None),
'images': Sequence(feature=Image(mode=None, decode=True, id=None), length=-1, id=None),
'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}],
'ability': Value(dtype='string', id=None),
'reward_model': {
'answer': Value(dtype='string', id=None),
'ground_truth': Value(dtype='string', id=None),
'accuracy_ratio': Value(dtype='float32', id=None),
'format_ratio': Value(dtype='float32', id=None),
'verifier': Value(dtype='string', id=None),
'verifier_parm': {
'det_verifier_normalized': Value(dtype='bool', id=None),
'det_reward_ratio': {
'iou_max_label_first': Value(dtype='float32', id=None),
'iou_max_iou_first': Value(dtype='float32', id=None),
'iou_completeness': Value(dtype='float32', id=None),
'map': Value(dtype='float32', id=None),
'map50': Value(dtype='float32', id=None),
'map75': Value(dtype='float32', id=None)
}
}
},
'extra_info': {'id': Value(dtype='string', id=None), 'image_path': Value(dtype='string', id=None)}
}
```
## 📊 Data Sources and Composition
The **Orsta-Data-47k** dataset is constructed entirely from publicly available, open-source datasets. These have been aggregated and curated to create a collection suitable for VLM post-training on both visual reasoning and perception tasks.
Orsta-Data-47k is compiled from 18 distinct public datasets. The primary contributing sources for each task category are as follows:
* **Math**: [mm_math](https://huggingface.co/datasets/THU-KEG/MM_Math), [geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k), [mmk12](https://huggingface.co/datasets/FanqingM/MMK12)
* **Puzzle**: [PuzzleVQA](https://huggingface.co/datasets/declare-lab/PuzzleVQA), [AlgoPuzzleVQA](https://huggingface.co/datasets/declare-lab/AlgoPuzzleVQA), [VisualPuzzles](https://huggingface.co/datasets/neulab/VisualPuzzles)
* **Science**: [ScienceQA](https://huggingface.co/datasets/lmms-lab/ScienceQA), [SciVQA](https://huggingface.co/datasets/katebor/SciVQA), [ViRL39K-Science](https://huggingface.co/datasets/TIGER-Lab/ViRL39K)
* **Chart**: [ChartQAPro](https://huggingface.co/datasets/ahmed-masry/ChartQAPro), [ChartX](https://huggingface.co/datasets/U4R/ChartX), [Table-VQA-Bench](https://huggingface.co/datasets/terryoo/TableVQA-Bench), [ViRL39K-Chart](https://huggingface.co/datasets/TIGER-Lab/ViRL39K)
* **Detection**: [V3Det](https://arxiv.org/abs/2307.12813), [Object365](https://www.objects365.org/overview.html)
* **Grounding**: [D^3](https://arxiv.org/abs/2307.12813)
* **Counting**: [CLEVR](https://arxiv.org/abs/1612.06890)
* **OCR**: [LLaVA-OV Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [EST-VQA](https://arxiv.org/abs/2002.10215)
For detailed information and licensing for each source dataset, please refer to their original publications and repositories. Our specific aggregation and curation methodology for Orsta-Data-47k is described in our paper: [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129).
## Citation Information 📜
If you use the Orsta-Data-47k dataset or our V-Triune framework in your research, please cite our accompanying paper:
```bibtex
@article{ma2025one,
title={One RL to See Them All: Visual Triple Unified Reinforcement Learning},
author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
journal={arXiv preprint arXiv:2505.18129},
year={2025}
}
```
# Orsta-Data-47k 数据集
* 🐙 **GitHub 仓库:** [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All)
* 📜 **论文(arXiv):** [V-Triune: One RL to See Them All(arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)
## 数据集描述 📖
**Orsta-Data-47k** 是一款专为视觉语言模型(Vision-Language Models, VLMs)后训练打造的专用数据集,依托我们的[V-Triune](https://github.com/MiniMax-AI/One-RL-to-See-Them-All) 统一强化学习(Reinforcement Learning, RL)系统构建。其核心目标是支持跨多样化视觉推理与视觉感知任务的稳健联合训练,助力[Orsta](https://huggingface.co/collections/One-RL-to-See-Them-All/one-rl-to-see-them-all-6833d27abce23898b2f9815a) 等模型实现顶尖的多模态能力。
本数据集从18个公开数据集精心筛选聚合而来,经过严格的过滤流程以确保数据高质量且适配基于强化学习的微调任务。
## 覆盖任务 🎯
本数据集涵盖8大类核心任务,在推理与感知任务间保持均衡:
* **视觉推理任务 🤔:**
* 数学(Math QA)
* 谜题求解(Visual Puzzles)
* 科学问答(Science QA)
* 图表理解(Chart QA)
* **视觉感知任务 👁️:**
* 目标检测(Object Detection)
* 视觉接地(Visual Grounding)
* 目标计数(Object Counting)
* 光学字符识别(Optical Character Recognition, OCR)
## 数据构建流程 🛠️
为打造适配高效强化学习后训练的高质量语料,我们采用了两阶段的全面数据构建流水线:
1. **基于规则的过滤:** 初始过滤阶段对源数据集应用预设规则。针对推理与感知任务,我们定制了差异化的规则,旨在移除噪声样本、易被“投机取巧”的问题(如特定多项选择题格式)以及存在问题的答案格式。对于感知任务,该流程还包含坐标系统标准化以及基于目标尺寸或数量的过滤步骤。
2. **基于难度的过滤:** 在基于规则的清洗完成后,我们引入了难度过滤器。该阶段会移除过于简单(例如基线模型已可轻松解决)或过于复杂的样本,确保剩余数据能够为模型提供兼具意义与效率的学习信号。
经过上述精细化流程,最终得到约**47700条**高质量样本。为缓解潜在的数据集分布不均衡问题,我们对部分任务(如谜题类任务)的数据进行了针对性复制,以保障其充足的代表性。
## 数据集组成与结构 📊
* **总样本量:** 约47.7K
* **任务类别:** 8类(4类推理任务,4类感知任务)
* **聚合来源:** 18个独立公开数据集
* **内容拆分:**
* 视觉感知样本:约20.6K
* 视觉推理样本:约27.1K
* **交互格式:** 数据主要为单图像、单轮对话交互形式(例如单图像搭配问题与对应答案/标注)。
* **存储格式:** 所有整理后的数据均采用高效的Parquet格式存储。
## 预期用途与训练 🚀
本数据集专为配合[V-Triune](https://github.com/MiniMax-AI/One-RL-to-See-Them-All) 框架,用于视觉语言模型的强化学习后训练。在[Orsta](https://huggingface.co/collections/One-RL-to-See-Them-All/one-rl-to-see-them-all-6833d27abce23898b2f9815a) 模型的训练过程中,本数据集的所有样本均被均匀混合并投入使用。
## 数据集使用说明
本节介绍如何下载并使用Orsta-Data-47k数据集。
### 下载数据集
您可以通过`huggingface-cli`工具直接从Hugging Face Hub下载本数据集。请确保已安装`huggingface_hub`库(执行`pip install huggingface_hub`)。
在终端中执行以下命令:
bash
huggingface-cli download --repo-type dataset --resume-download One-RL-to-See-Them-All/Orsta-Data-47k --local-dir Orsta-Data-47k
该命令会将所有数据集文件下载至名为`Orsta-Data-47k`的本地目录。`--resume-download`参数可在下载中断时恢复任务。
### 数据集结构
下载完成后,`Orsta-Data-47k`目录下的数据集结构如下,所有数据文件均为Parquet(`.parquet`)格式。
Orsta-Data-47k/
├── test/
│ ├── test_chart_megabench_176.parquet
......
│ └── test_science_megabench_91.parquet
└── train/
├── train_chart_chartqapro_498.parquet
......
└── train_science_virl39k_2539.parquet
### 文件命名规则
`train/`与`test/`目录下的文件遵循以下命名规则:
`{split}_{task_name}_{source_name}_{num}.parquet`
其中:
* `{split}`:表示数据集拆分,可选`train`(训练集)或`test`(测试集)。
* `{task_name}`:指定通用任务类别。
* `{source_name}`:表示数据的具体基准测试或来源。
* `{num}`:表示该Parquet文件包含的样本数量。
### 各拆分文件的用途
* **`train/`目录:** 该目录下的文件构成Orsta模型的训练语料。
* **`test/`目录:** 该目录下的样本专为训练过程中对模型在不同任务上的性能进行在线评估而准备。通过分析这些样本上的模型表现,可辅助诊断训练状态并理解模型在各任务类别上的学习进展。
### 数据格式
python
{
'data_source': Value(dtype='string', id=None),
'images': Sequence(feature=Image(mode=None, decode=True, id=None), length=-1, id=None),
'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}],
'ability': Value(dtype='string', id=None),
'reward_model': {
'answer': Value(dtype='string', id=None),
'ground_truth': Value(dtype='string', id=None),
'accuracy_ratio': Value(dtype='float32', id=None),
'format_ratio': Value(dtype='float32', id=None),
'verifier': Value(dtype='string', id=None),
'verifier_parm': {
'det_verifier_normalized': Value(dtype='bool', id=None),
'det_reward_ratio': {
'iou_max_label_first': Value(dtype='float32', id=None),
'iou_max_iou_first': Value(dtype='float32', id=None),
'iou_completeness': Value(dtype='float32', id=None),
'map': Value(dtype='float32', id=None),
'map50': Value(dtype='float32', id=None),
'map75': Value(dtype='float32', id=None)
}
}
},
'extra_info': {'id': Value(dtype='string', id=None), 'image_path': Value(dtype='string', id=None)}
}
## 📊 数据来源与组成
**Orsta-Data-47k** 数据集完全由公开的开源数据集构建而成。我们对这些数据集进行了聚合与整理,以打造适配视觉语言模型在视觉推理与感知任务上的后训练语料。
Orsta-Data-47k 由18个独立公开数据集整合而成,各任务类别的主要贡献来源如下:
* **数学:** [mm_math](https://huggingface.co/datasets/THU-KEG/MM_Math), [geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k), [mmk12](https://huggingface.co/datasets/FanqingM/MMK12)
* **谜题:** [PuzzleVQA](https://huggingface.co/datasets/declare-lab/PuzzleVQA), [AlgoPuzzleVQA](https://huggingface.co/datasets/declare-lab/AlgoPuzzleVQA), [VisualPuzzles](https://huggingface.co/datasets/neulab/VisualPuzzles)
* **科学:** [ScienceQA](https://huggingface.co/datasets/lmms-lab/ScienceQA), [SciVQA](https://huggingface.co/datasets/katebor/SciVQA), [ViRL39K-Science](https://huggingface.co/datasets/TIGER-Lab/ViRL39K)
* **图表:** [ChartQAPro](https://huggingface.co/datasets/ahmed-masry/ChartQAPro), [ChartX](https://huggingface.co/datasets/U4R/ChartX), [Table-VQA-Bench](https://huggingface.co/datasets/terryoo/TableVQA-Bench), [ViRL39K-Chart](https://huggingface.co/datasets/TIGER-Lab/ViRL39K)
* **检测:** [V3Det](https://arxiv.org/abs/2307.12813), [Object365](https://www.objects365.org/overview.html)
* **接地:** [D^3](https://arxiv.org/abs/2307.12813)
* **计数:** [CLEVR](https://arxiv.org/abs/1612.06890)
* **OCR:** [LLaVA-OV Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [EST-VQA](https://arxiv.org/abs/2002.10215)
如需了解各源数据集的详细信息与授权协议,请参阅其原始论文与仓库。我们针对Orsta-Data-47k的具体聚合与整理方法已在论文[V-Triune: One RL to See Them All(arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)中进行了阐述。
## 引用信息 📜
若您在研究中使用Orsta-Data-47k数据集或V-Triune框架,请引用我们的配套论文:
bibtex
@article{ma2025one,
title={One RL to See Them All: Visual Triple Unified Reinforcement Learning},
author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
journal={arXiv preprint arXiv:2505.18129},
year={2025}
}
提供机构:
maas
创建时间:
2025-05-27



