2796gauravc/agentic-search-data
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/2796gauravc/agentic-search-data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
pretty_name: Agentic Search Multi-hop Retrieval
size_categories:
- 1K<n<10K
task_categories:
- text-retrieval
tags:
- retrieval
- multi-hop
- synthetic
- agentic
---
# Agentic search — synthetic multi-hop retrieval dataset
JSONL artifacts for training and evaluating a retrieval agent across **web**, **finance**, **legal**, **code**, and **science**.
## Files
| File | Description |
|------|-------------|
| `corpus.jsonl` | Unique `doc_id` passages (supporting + distractor docs) with `domain` labels |
| `sft_dataset.jsonl` | Supervised fine-tuning tasks (~60% of tasks) |
| `rl_dataset.jsonl` | RL / GRPO-style prompts (~25%) |
| `eval_dataset.jsonl` | Held-out evaluation (~15%) |
Splits are **disjoint by `task_id`**: no eval task appears in SFT or RL.
## Schema (task rows)
- `task_id`, `domain`, `hops`, `difficulty`, `verified`
- `query`, `clues`, `answer`, `metadata`
- `supporting_docs`, `distractors` (list of objects with `doc_id`, `content`, `source`, …)
## Hub Parquet view
The Hugging Face **Datasets** server may expose a single Parquet split (often aligned with `eval_dataset.jsonl`). Full training data is always in the JSONL files above.
## License / use
Synthetic data generated for research; verify compliance with your model and deployment policies.
提供机构:
2796gauravc



