MiroVerse-v0.1
收藏魔搭社区2026-05-17 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/okwinds/MiroVerse-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
本数据集转载自 huggingface 【[miromind-ai](https://huggingface.co/miromind-ai)】
#### 📖 关于项目相关的研究,可阅读公众号“觉察流”文章👇</br>
《[MiroMind-M1:如何用CAMPO算法打造高效且可复现的全栈开源推理模型](https://mp.weixin.qq.com/s/REPzzgsUjDMikg4jIo9KRg)》
#### _本仓库作者在此 👇🏻 扫一扫_
<img src="https://www.modelscope.cn/models/okwinds/GPT-2/resolve/master/qrcode_for_jcl_258.jpg" />
---
数据集文件元信息以及数据文件,请浏览“数据集文件”页面获取。
您可以通过如下GIT Clone命令,或者ModelScope SDK来下载数据集
#### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
# MiroVerse-v0.1 官方简介
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/nfFd3PLPgjXjmhkdNp6_0.png" width="55%" alt="MiroThinker" />
</div>
<div style="text-align: center;">
<h1>MiroVerse: A Reproducible, Full-Trajectory, Ever-Growing Deep Research Dataset</h1>
</div>
<div align="center">
[](https://dr.miromind.ai/)
[](https://huggingface.co/collections/miromind-ai/mirothinker-v01-689301b6d0563321862d44a1)
[](https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1)
[](https://miromind.ai/blog/miromind-open-deep-research)
[](https://github.com/MiroMindAI/MiroThinker)
[](https://discord.com/invite/GPqEnkzQZd)
[](https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/SGK70isvVpeJwk_fny9sb.png)
[](https://www.xiaohongshu.com/user/profile/663098830000000003033edc)
[](https://miromind.ai/)
</div>
## **🔥 News & Updates**
- The initial release of **MiroVerse (v0.1)** is coming very soon—stay tuned!
## **🔥 First Batch of MiroVerse**
✨ **What makes this release special:**
- 📚 **Diverse Verified Open Source Data** — Carefully curated and validated community datasets
- 🧠 **Fresh Large-Scale Deep Research Data** — Generated by our proprietary data engine
- 🔄 **Complete Trajectory Coverage** — Every single sample includes full rollout trajectories
- ✅ **Quality Assurance:** — Each trajectory has been verified, ensuring high-quality training data for your models.
- 🌱 **Always Growing, Always Open** — Regular updates, powered by collaboration with the community
---
## **📦 Dataset Overview**

**MiroVerse-v0.1** is a large-scale agent dataset with **147K+** samples featuring **full rollout trajectories** across diverse AI agent tasks including multi-hop QA, web navigation, and scientific reasoning. Every single sample includes complete execution traces with **1.9B+** tokens and **602K+** tool interactions, providing comprehensive training data for tool-using and web-browsing AI agents.
| **Split** | **#Sample** | **#Main Trace** | **#Browse Trace** | **#Token** | **#Turns** | **#Tools** | **License** |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **MiroVerse-Voyager1.0** | 59097 | 19115 | 39982 | 1129113893 | 444723 | 325537 | CC-BY-NC-4.0 |
| MiroVerse-MuSiQue | 29572 | 10422 | 19150 | 294351053 | 143080 | 90486 | CC-BY-4.0 |
| MiroVerse-HotpotQA | 12942 | 6553 | 6389 | 67352039 | 46320 | 20524 | CC-BY-SA-4.0 |
| MiroVerse-WebWalkerQA-Silver | 10817 | 4961 | 5856 | 107650324 | 67846 | 46215 | Apache 2.0 |
| MiroVerse-MegaScience | 10615 | 8270 | 2345 | 111120264 | 63594 | 42443 | CC-BY-NC-SA-4.0 |
| MiroVerse-TaskCraft | 8890 | 4277 | 4613 | 95518109 | 35013 | 17236 | MIT |
| MiroVerse-QA-Expert-Multi-Hop-V1.0 | 6187 | 2091 | 4096 | 63983151 | 31957 | 19585 | Apache 2.0 |
| MiroVerse-OneGen-TrainDataset-MultiHopQA | 3289 | 1347 | 1942 | 33214386 | 17187 | 11449 | MIT |
| MiroVerse-2WikiMultihopQA | 3001 | 1410 | 1591 | 28977451 | 13982 | 7981 | Apache 2.0 |
| MiroVerse-WikiTables | 1606 | 1288 | 318 | 16461870 | 12089 | 8877 | MIT |
| MiroVerse-WebShaper | 1514 | 486 | 1028 | 31240265 | 12126 | 9578 | MIT |
| MiroVerse-WebDancer | 455 | 192 | 263 | 7817689 | 3170 | 2268 | MIT |
| **MiroVerse-v0.1** | **147985** | **60412** | **87573** | **1993099086** | **891087** | **602179** | / |
> Every sample includes successful MiroFlow rollout trajactories that reached the verified answer—one JSON line, zero secrets.
MiroVerse-v0.1 dataset follows a hybrid licensing model: query and answer data retain their original source licenses, while all trace data is licensed under CC-BY-NC-4.0; for commercial use, please contact us to request a commercial license.
>
---
## **🆚 Why We're Different**
While high-quality data is essential for training advanced models and often kept private, we believe that the path to truly general-purpose agents is still long. That’s why we’re committed to open-sourcing as much of our data as possible—including raw samples and exploration traces—to support and accelerate progress across the community.
| **Org** | **Work** | **Samples** | **Trace Data** | **Reproducible?** |
| --- | --- | --- | --- | --- |
| OpenAI | Deep Research | — | ❌ | ❌ |
| Gemini | Gemini Deep Research | — | ❌ | ❌ |
| Tencent | Cognitive Kernel-Pro | 7 k | ❌ | ❌ |
| Tongyi | WebShaper | 500 | ❌ | ❌ |
| **Miromind** (ours) | *this repo* | **147 k+** | ✅ | ✅ |
---
## **🧩 Examples**
Below are two QA examples synthesized by our data engine (MiroVerse-Voyager1.0).
### **Case 1**
**Q:** A female lead actress received her first major annual Hindi film performance award for best actress for her role in a late-2000s comedy-drama, directed by the filmmaker who later created a sports-themed drama released in 2023 starring an actress known for completing an athletic triathlon event in Berlin. What is the title of the film for which this actress first won that award?
**A:** Paa
### **Case 2**
**Q:** Identify the agricultural practice, unique to a mountain range that forms a border including an independent principality and known for spectacular geologic landforms, that was one of the key reasons for part of the range's inscription as a UNESCO World Heritage Site in the decade before the 21st century. This region's history features a brief early-1800s reorganization of provincial boundaries after a liberal revolution in the southern country, and the northern country is globally recognized as the leading tourist destination with the fourth-largest number of heritage sites. What is this traditional agricultural system called?
**A:** transhumance
## **🛠️ Free Trace Rollout: Let Us Help You Train**
Generating high-quality training trajectories is expensive — on average, **$1.50 per sample** using top-tier commercial models.
To empower the community, we’re offering **free rollout services** for qualifying seed data:
### **How It Works:**
1. **Submit a Request**
Open a ticket via [this template](https://docs.google.com/forms/d/e/1FAIpQLSfN_DjJohfuMls3IjqFbFRX7BSGMHjgbwucspHIw9-ZgA2djQ/viewform?usp=header) and provide the basic info, rollout requirements, and up to 100 sample rows in one go.
2. **Review & Rollout**
We’ll review your submission within 48 hours. Once approved, we’ll reach out to you for the full dataset and then launch the complete trace rollout using top-tier commercial models.
3. **Delivery & Recognition**
Upon completion, we’ll send the augmented dataset to you via email.
With your **explicit consent**, we’ll also publish it publicly and credit you as a **Community Contributor** — with a permanent badge in this README.
## **🤝 License**
This project is released under the CC BY-NC 4.0. Parts of this project contain code and models from other sources, which are subject to their respective licenses. For **commercial use cases**, please contact us at: [talent@miromind.ai](mailto:talent@miromind.ai).
## **📜 Citation**
If you find this project useful in your research, please consider cite:
```latex
@misc{miromind2024opendata,
title={MiroVerse V0.1: Reproducible, Full-Trajectory, Ever-Expanding — A Living Dataset for the Community},
author={Miromind Data Team},
year={2025},
url={https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1}
}
```
---
## Contact Us
MiroVerse is developed by the MiroMind Data Team.
If you would like to leave us a message, feel free to get in touch.
In addition to [GitHub](https://github.com/MiroMindAI/),
[Discord](https://discord.com/invite/GPqEnkzQZd),
[WeChat](https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/SGK70isvVpeJwk_fny9sb.png),
and [RedNote](https://www.xiaohongshu.com/user/profile/663098830000000003033edc),
you can also reach us via email at talent@miromind.ai.
本数据集转载自Hugging Face平台的【[miromind-ai](https://huggingface.co/miromind-ai)】
#### 📖 如需了解项目相关研究,可阅读公众号「觉察流」发布的文章👇
《[MiroMind-M1:如何借助CAMPO算法构建高效且可复现的全栈开源推理模型](https://mp.weixin.qq.com/s/REPzzgsUjDMikg4jIo9KRg)》
#### _本仓库作者的微信二维码在此 👇🏻 扫码添加_
<img src="https://www.modelscope.cn/models/okwinds/GPT-2/resolve/master/qrcode_for_jcl_258.jpg" />
---
数据集的元信息与数据文件,请前往「数据集文件」页面获取。您可通过以下GIT Clone命令,或ModelScope SDK下载该数据集
#### 下载方式
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
# MiroVerse-v0.1 官方简介
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/nfFd3PLPgjXjmhkdNp6_0.png" width="55%" alt="MiroThinker" />
</div>
<div style="text-align: center;">
<h1>MiroVerse:一个可复现、全轨迹、持续迭代的深度研究数据集</h1>
</div>
<div align="center">
[](https://dr.miromind.ai/)
[](https://huggingface.co/collections/miromind-ai/mirothinker-v01-689301b6d0563321862d44a1)
[](https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1)
[](https://miromind.ai/blog/miromind-open-deep-research)
[](https://github.com/MiroMindAI/MiroThinker)
[](https://discord.com/invite/GPqEnkzQZd)
[](https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/SGK70isvVpeJwk_fny9sb.png)
[](https://www.xiaohongshu.com/user/profile/663098830000000003033edc)
[](https://miromind.ai/)
</div>
## **🔥 最新动态**
- **MiroVerse v0.1** 正式版即将上线,敬请期待!
## **🔥 MiroVerse 首发批次**
✨ **本次发布的核心亮点:**
- 📚 **多元经审核开源数据** — 经精心筛选与审核的社区数据集
- 🧠 **全新大规模深度研究数据** — 由我们专属自研的数据引擎生成
- 🔄 **完整轨迹覆盖** — 所有样本均包含完整的执行轨迹(rollout trajectories)
- ✅ **严格质量管控** — 每条轨迹均经过审核,为模型训练提供高质量的训练数据
- 🌱 **持续迭代,永久开源** — 依托社区协作定期更新,数据规模持续增长
---
## **📦 数据集概览**

**MiroVerse-v0.1** 是一款大规模AI智能体(AI Agent)数据集,包含**14.7万+**条样本,覆盖多跳问答、网页导航、科学推理等多种AI智能体任务,且所有样本均包含完整的执行轨迹。数据集总Token数超**19亿**,工具交互次数达**60.2万+**,可为使用工具与网页浏览的AI智能体训练提供全面的支撑数据。
| **数据集分支** | **样本数量** | **主轨迹数** | **浏览轨迹数** | **Token总数** | **交互轮次** | **工具调用次数** | **授权协议** |
| --- | --- | --- | --- | --- | --- | --- | --- |
| MiroVerse-Voyager1.0 | 59097 | 19115 | 39982 | 1129113893 | 444723 | 325537 | CC-BY-NC-4.0 |
| MiroVerse-MuSiQue | 29572 | 10422 | 19150 | 294351053 | 143080 | 90486 | CC-BY-4.0 |
| MiroVerse-HotpotQA | 12942 | 6553 | 6389 | 67352039 | 46320 | 20524 | CC-BY-SA-4.0 |
| MiroVerse-WebWalkerQA-Silver | 10817 | 4961 | 5856 | 107650324 | 67846 | 46215 | Apache 2.0 |
| MiroVerse-MegaScience | 10615 | 8270 | 2345 | 111120264 | 63594 | 42443 | CC-BY-NC-SA-4.0 |
| MiroVerse-TaskCraft | 8890 | 4277 | 4613 | 95518109 | 35013 | 17236 | MIT |
| MiroVerse-QA-Expert-Multi-Hop-V1.0 | 6187 | 2091 | 4096 | 63983151 | 31957 | 19585 | Apache 2.0 |
| MiroVerse-OneGen-TrainDataset-MultiHopQA | 3289 | 1347 | 1942 | 33214386 | 17187 | 11449 | MIT |
| MiroVerse-2WikiMultihopQA | 3001 | 1410 | 1591 | 28977451 | 13982 | 7981 | Apache 2.0 |
| MiroVerse-WikiTables | 1606 | 1288 | 318 | 16461870 | 12089 | 8877 | MIT |
| MiroVerse-WebShaper | 1514 | 486 | 1028 | 31240265 | 12126 | 9578 | MIT |
| MiroVerse-WebDancer | 455 | 192 | 263 | 7817689 | 3170 | 2268 | MIT |
| **MiroVerse-v0.1** | **147985** | **60412** | **87573** | **1993099086** | **891087** | **602179** | / |
> 所有样本均包含成功完成的MiroFlow执行轨迹,可抵达经验证的标准答案——每条样本对应一条JSON行,无任何隐藏信息。
> MiroVerse-v0.1采用混合授权模式:问答数据保留其原始来源的授权协议,所有轨迹数据均采用CC-BY-NC-4.0协议;如需商业授权,请联系我们获取专属商用许可。
---
## **🆚 我们的差异化优势**
尽管高质量数据是训练先进模型的核心要素且常被作为商业机密保护,但我们认为,通往真正通用人工智能(AGI)的道路仍任重道远。正因如此,我们致力于开源尽可能多的数据集——包括原始样本与探索轨迹——以助力全球社区的AI研究进展与技术提速。
| **机构** | **相关项目** | **样本数量** | **是否包含轨迹数据** | **是否可复现** |
| --- | --- | --- | --- | --- |
| OpenAI | Deep Research | — | ❌ | ❌ |
| Gemini | Gemini Deep Research | — | ❌ | ❌ |
| 腾讯 | Cognitive Kernel-Pro | 7000 | ❌ | ❌ |
| 通义千问 | WebShaper | 500 | ❌ | ❌ |
| **Miromind(本团队)** | *本仓库* | **14.7万+** | ✅ | ✅ |
---
## **🧩 示例展示**
以下为两款由我们的数据引擎生成的问答示例(源自MiroVerse-Voyager1.0)。
### **案例1**
**Q:** 某印度知名女演员凭借2000年代末的一部喜剧剧情片,斩获其个人首个印地语电影年度最佳女主角奖;该片导演后续又在2023年推出了一部体育题材剧情片,其中女主角因完成柏林铁人三项赛事而为人熟知。请问该女演员凭借哪部影片首次斩获该奖项?
**A:** 《Paa》
### **案例2**
**Q:** 请找出一种仅在某山脉范围内特有的农业生产方式,该山脉构成了包含某独立公国在内的边境线,并以壮观的地质地貌闻名。该农业方式是该山脉在21世纪前十年被列入联合国教科文组织世界遗产的核心原因之一。该区域历史曾在19世纪初因南部国家的自由革命经历过短暂的省级边界重组,而北部国家则是全球公认的顶级旅游目的地,拥有数量排名第四的世界遗产地。请问这种传统农业系统的名称是什么?
**A:** 转场放牧(transhumance)
## **🛠️ 免费轨迹生成服务:助力您的模型训练**
生成高质量的训练轨迹成本高昂——使用顶级商用模型生成单条轨迹的平均成本约为**1.5美元**。
为助力社区发展,我们可为符合要求的种子数据提供**免费轨迹生成服务**:
### **服务流程:**
1. **提交申请**
请通过[此模板表单](https://docs.google.com/forms/d/e/1FAIpQLSfN_DjJohfuMls3IjqFbFRX7BSGMHjgbwucspHIw9-ZgA2djQ/viewform?usp=header)提交工单,填写基本信息、轨迹生成需求,并一次性提供最多100条样本数据。
2. **审核与生成**
我们将在48小时内审核您的申请。审核通过后,我们将联系您获取完整数据集,并使用顶级商用模型启动全量轨迹生成流程。
3. **交付与荣誉**
服务完成后,我们将通过邮件向您发送增强后的数据集。
若您明确同意,我们还可将该数据集公开上线,并将您列为**社区贡献者**——在本README文档中为您添加永久荣誉徽章。
## **🤝 授权协议**
本项目采用CC BY-NC 4.0协议开源。本项目部分内容包含来自第三方的代码与模型,需遵循其各自的授权协议。如需**商业使用**,请联系我们:[talent@miromind.ai](mailto:talent@miromind.ai)。
## **📜 引用格式**
若您的研究中使用了本项目的相关内容,请引用如下格式:
latex
@misc{miromind2024opendata,
title={MiroVerse V0.1: 可复现、全轨迹、持续迭代——面向社区的动态数据集},
author={Miromind 数据团队},
year={2025},
url={https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1}
}
---
## **📞 联系方式**
MiroVerse 由Miromind数据团队开发。
若您有任何建议或合作意向,欢迎随时联系我们。
除了通过[GitHub](https://github.com/MiroMindAI/)、[Discord社区](https://discord.com/invite/GPqEnkzQZd)、[微信公众号](https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/SGK70isvVpeJwk_fny9sb.png)与[小红书](https://www.xiaohongshu.com/user/profile/663098830000000003033edc)联系外,您也可以通过邮箱[talent@miromind.ai](mailto:talent@miromind.ai)与我们取得联系。
提供机构:
maas
创建时间:
2025-08-10
搜集汇总
数据集介绍

背景与挑战
背景概述
MiroVerse-v0.1是一个包含147K+样本的大规模代理数据集,涵盖多跳QA、网页导航和科学推理等多样化任务,每个样本包含完整的执行轨迹(1.9B+令牌和602K+工具交互),适用于工具使用和网页浏览AI代理的训练。数据集采用混合许可模式,非商业使用需遵循CC-BY-NC-4.0许可。
以上内容由遇见数据集搜集并总结生成



