five

WithinUsAI/CitationGround-1M

收藏
Hugging Face2025-12-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/WithinUsAI/CitationGround-1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: CitationGround-1M (Platinum) language: - en license: apache-2.0 task_categories: - question-answering - text-generation tags: - rag - grounding - citations - retrieval - hallucination-reduction - hard-negatives size_categories: - n<1K # sample pack; replace after scaling dataset_info: creator: "Within US AI" contact: "Within US AI" created: "2025-12-30T16:53:41Z" schema: "See Features section below" --- # CitationGround-1M (Platinum) **Developer/Publisher:** Within US AI **Version:** 0.1.0 (sample pack) **Created:** 2025-12-30T16:53:41Z ## What this dataset is `CitationGround-1M` is a **citation-locked** grounded QA/RAG dataset: - Answer using only the provided `contexts` - Provide **span-level citations** (doc_id + offsets) - Includes `answerable=false` hard negatives for abstention behavior ## Features / schema (JSONL) - `example_id` (string) - `question` (string) - `contexts` (list of docs) - `answer` (string) - `citations` (list of spans) - `answerable` (bool) - `difficulty` (int; 1–5) - `reason` (string) - `language` (string) - `created_utc` (string) - `license_note` (string) ### Context doc format - `doc_id`, `title`, `text`, `source_type`, `provenance` ### Citation span format - `doc_id`, `start`, `end` (character offsets in `text`) ## Splits - `data/train.jsonl` - `data/validation.jsonl` - `data/test.jsonl` ## How to load ```python from datasets import load_dataset ds = load_dataset("json", data_files={"train":"data/train.jsonl","validation":"data/validation.jsonl","test":"data/test.jsonl"}) print(ds["train"][0]) ```
提供机构:
WithinUsAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作