mukunda1729/hallucination-risk-cases

Name: mukunda1729/hallucination-risk-cases
Creator: mukunda1729
Published: 2026-04-27 16:20:10
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/mukunda1729/hallucination-risk-cases

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en tags: - hallucination - llm - evaluation - factuality - testing size_categories: - n<1K configs: - config_name: default data_files: - split: train path: data.jsonl --- # hallucination-risk-cases 20 hand-labeled (prompt → response → ground-truth) tuples covering common LLM hallucination failure modes. Each case is rated for hallucination risk so you can evaluate whether your detector / scorer / judge correctly distinguishes the safe responses from the fabricated ones. ## Categories | Category | Count | What it tests | |---|---|---| | `factual` | 4 | Straightforward verifiable facts | | `fabricated-citation` | 1 | Invented academic citations | | `fabricated-api` | 1 | Invented standard-library functions | | `fabricated-place` | 1 | Invented cities / locations | | `fabricated-event` | 1 | Invented historical meetings | | `fabricated-fact` | 1 | Invented chemistry / physics facts | | `fabricated-work` | 1 | Invented books / papers | | `fabricated-quote` | 1 | Invented page-specific quotes | | `arithmetic` | 2 | Simple vs. long-form math | | `statistical` | 2 | Population / GDP figures | | `summary` | 1 | Plot summaries | | `technical` | 2 | Standard-library and protocol facts | | `negative-claim` | 1 | Correctly says "no evidence" | | `future-event` | 1 | Pretends to know future events | ## Schema ```jsonc { "id": "string", "prompt": "string", "response": "string", // what the model said "ground_truth": "string", // the truth (or "does not exist") "hallucination_risk": "low | medium | high", "category": "string", "notes": "string" } ``` ## Suggested use Run your hallucination scorer over `prompt` + `response`, compare its label against `hallucination_risk`. A good detector should: - Mark `low` cases as safe - Flag `high` cases as risky - Be conservative on `medium` (slight numeric drift) ## Quickstart ```python from datasets import load_dataset ds = load_dataset("mukunda1729/hallucination-risk-cases", split="train") risky = [r for r in ds if r["hallucination_risk"] == "high"] print(f"{len(risky)} high-risk cases") ``` ## Related - [The Agent Reliability Stack](https://mukundakatta.github.io/agent-stack/) ## License MIT.

提供机构：

mukunda1729

5,000+

优质数据集

54 个

任务类型

进入经典数据集