orbit-ai/orbit-seeds
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/orbit-ai/orbit-seeds
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-retrieval
language:
- en
tags:
- orbit
- seeds
- wikipedia
- multi-hop
- qa-generation
pretty_name: Orbit Seeds
size_categories:
- 10K<n<100K
configs:
- config_name: art
data_files: data/art/train.jsonl
- config_name: code
data_files: data/code/train.jsonl
- config_name: finance
data_files: data/finance/train.jsonl
- config_name: geography
data_files: data/geography/train.jsonl
- config_name: history
data_files: data/history/train.jsonl
- config_name: law
data_files: data/law/train.jsonl
- config_name: mathematics
data_files: data/mathematics/train.jsonl
- config_name: medicine
data_files: data/medicine/train.jsonl
- config_name: music
data_files: data/music/train.jsonl
- config_name: politics
data_files: data/politics/train.jsonl
- config_name: puzzles
data_files: data/puzzles/train.jsonl
- config_name: science_and_technology
data_files: data/science_and_technology/train.jsonl
- config_name: sports
data_files: data/sports/train.jsonl
- config_name: tv_shows_and_movies
data_files: data/tv_shows_and_movies/train.jsonl
- config_name: video_games
data_files: data/video_games/train.jsonl
---
> [!NOTE]
> For more information on the ORBIT dataset, go check out the preprint available at [arxiv.org/abs/2604.01195](https://arxiv.org/abs/2604.01195).
<img src="https://huggingface.co/orbit-ai/orbit-4b-v0.1/resolve/main/orbit-with-name-logo.png" alt="Figure 1" width="500"/>
# ORBIT: A Synthetic Training Dataset for Search Agents
[](https://arxiv.org/abs/2604.01195)
[](https://huggingface.co/datasets/orbit-ai/orbit-20k)
[](https://huggingface.co/orbit-ai/orbit-4b-v0.1)
[](https://github.com/castorini/orbit)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
> **ORBIT** is a reasoning-intensive synthetic dataset with complex queries used for training search agents, generated without relying on any paid API services or manual annotation.
---
# Orbit Seeds
Seed entities collected from English Wikipedia, organised by domain. Each record is a Wikipedia page that was used as a seed for reasoning-intensive question generation in the Orbit project.
## Schema
| Column | Type | Description |
|---|---|---|
| `_id` | string | Unique MD5 hash identifier |
| `seed` | string | Wikipedia page title (seed entity) |
| `seed_url` | string | Full Wikipedia URL |
| `category` | string | Wikipedia category the page belongs to |
## Domains
| Domain | Seeds |
|---|---|
| art | 3,717 |
| code | 3,425 |
| finance | 3,599 |
| geography | 2,737 |
| history | 3,549 |
| law | 3,857 |
| mathematics | 3,997 |
| medicine | 4,253 |
| music | 2,567 |
| politics | 3,703 |
| puzzles | 1,844 |
| science_and_technology | 4,549 |
| sports | 1,965 |
| tv_shows_and_movies | 2,250 |
| video_games | 4,243 |
## Usage
```python
from datasets import load_dataset
# Load a specific domain
ds = load_dataset("orbit-ai/orbit-seeds", "mathematics", split="train")
print(ds[0])
# Load all domains
domains = [
"art", "code", "finance", "geography", "history", "law",
"mathematics", "medicine", "music", "politics", "puzzles",
"science_and_technology", "sports", "tv_shows_and_movies", "video_games",
]
for domain in domains:
ds = load_dataset("orbit-ai/orbit-seeds", domain, split="train")
print(f"{domain}: {len(ds)} seeds")
```
## Citation
If you use ORBIT in your work, please cite our preprint on arXiv:
```
@misc{thakur2026orbit,
title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget},
author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
year={2026},
eprint={2604.01195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01195},
}
```
---
## Links
| Resource | URL |
|---|---|
| Paper | https://arxiv.org/abs/2604.01195 |
| Dataset | https://huggingface.co/datasets/orbit-ai/orbit-20k |
| Model | https://huggingface.co/orbit-ai/orbit-4b-v0.1 |
| Hugging Face | https://huggingface.co/orbit-ai |
| GitHub | https://github.com/castorini/orbit |
提供机构:
orbit-ai



