WithinUsAI/CitationGround-1M

Name: WithinUsAI/CitationGround-1M
Creator: WithinUsAI
Published: 2025-12-31 09:04:03
License: 暂无描述

Hugging Face2025-12-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/WithinUsAI/CitationGround-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: CitationGround-1M (Platinum) language: - en license: apache-2.0 task_categories: - question-answering - text-generation tags: - rag - grounding - citations - retrieval - hallucination-reduction - hard-negatives size_categories: - n<1K # sample pack; replace after scaling dataset_info: creator: "Within US AI" contact: "Within US AI" created: "2025-12-30T16:53:41Z" schema: "See Features section below" --- # CitationGround-1M (Platinum) **Developer/Publisher:** Within US AI **Version:** 0.1.0 (sample pack) **Created:** 2025-12-30T16:53:41Z ## What this dataset is `CitationGround-1M` is a **citation-locked** grounded QA/RAG dataset: - Answer using only the provided `contexts` - Provide **span-level citations** (doc_id + offsets) - Includes `answerable=false` hard negatives for abstention behavior ## Features / schema (JSONL) - `example_id` (string) - `question` (string) - `contexts` (list of docs) - `answer` (string) - `citations` (list of spans) - `answerable` (bool) - `difficulty` (int; 1–5) - `reason` (string) - `language` (string) - `created_utc` (string) - `license_note` (string) ### Context doc format - `doc_id`, `title`, `text`, `source_type`, `provenance` ### Citation span format - `doc_id`, `start`, `end` (character offsets in `text`) ## Splits - `data/train.jsonl` - `data/validation.jsonl` - `data/test.jsonl` ## How to load ```python from datasets import load_dataset ds = load_dataset("json", data_files={"train":"data/train.jsonl","validation":"data/validation.jsonl","test":"data/test.jsonl"}) print(ds["train"][0]) ```

提供机构：

WithinUsAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集