five

2796gauravc/agentic-search-data

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/2796gauravc/agentic-search-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 pretty_name: Agentic Search Multi-hop Retrieval size_categories: - 1K<n<10K task_categories: - text-retrieval tags: - retrieval - multi-hop - synthetic - agentic --- # Agentic search — synthetic multi-hop retrieval dataset JSONL artifacts for training and evaluating a retrieval agent across **web**, **finance**, **legal**, **code**, and **science**. ## Files | File | Description | |------|-------------| | `corpus.jsonl` | Unique `doc_id` passages (supporting + distractor docs) with `domain` labels | | `sft_dataset.jsonl` | Supervised fine-tuning tasks (~60% of tasks) | | `rl_dataset.jsonl` | RL / GRPO-style prompts (~25%) | | `eval_dataset.jsonl` | Held-out evaluation (~15%) | Splits are **disjoint by `task_id`**: no eval task appears in SFT or RL. ## Schema (task rows) - `task_id`, `domain`, `hops`, `difficulty`, `verified` - `query`, `clues`, `answer`, `metadata` - `supporting_docs`, `distractors` (list of objects with `doc_id`, `content`, `source`, …) ## Hub Parquet view The Hugging Face **Datasets** server may expose a single Parquet split (often aligned with `eval_dataset.jsonl`). Full training data is always in the JSONL files above. ## License / use Synthetic data generated for research; verify compliance with your model and deployment policies.
提供机构:
2796gauravc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作