five

mukunda1729/hallucination-risk-cases

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/mukunda1729/hallucination-risk-cases
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - hallucination - llm - evaluation - factuality - testing size_categories: - n<1K configs: - config_name: default data_files: - split: train path: data.jsonl --- # hallucination-risk-cases 20 hand-labeled (prompt → response → ground-truth) tuples covering common LLM hallucination failure modes. Each case is rated for hallucination risk so you can evaluate whether your detector / scorer / judge correctly distinguishes the safe responses from the fabricated ones. ## Categories | Category | Count | What it tests | |---|---|---| | `factual` | 4 | Straightforward verifiable facts | | `fabricated-citation` | 1 | Invented academic citations | | `fabricated-api` | 1 | Invented standard-library functions | | `fabricated-place` | 1 | Invented cities / locations | | `fabricated-event` | 1 | Invented historical meetings | | `fabricated-fact` | 1 | Invented chemistry / physics facts | | `fabricated-work` | 1 | Invented books / papers | | `fabricated-quote` | 1 | Invented page-specific quotes | | `arithmetic` | 2 | Simple vs. long-form math | | `statistical` | 2 | Population / GDP figures | | `summary` | 1 | Plot summaries | | `technical` | 2 | Standard-library and protocol facts | | `negative-claim` | 1 | Correctly says "no evidence" | | `future-event` | 1 | Pretends to know future events | ## Schema ```jsonc { "id": "string", "prompt": "string", "response": "string", // what the model said "ground_truth": "string", // the truth (or "does not exist") "hallucination_risk": "low | medium | high", "category": "string", "notes": "string" } ``` ## Suggested use Run your hallucination scorer over `prompt` + `response`, compare its label against `hallucination_risk`. A good detector should: - Mark `low` cases as safe - Flag `high` cases as risky - Be conservative on `medium` (slight numeric drift) ## Quickstart ```python from datasets import load_dataset ds = load_dataset("mukunda1729/hallucination-risk-cases", split="train") risky = [r for r in ds if r["hallucination_risk"] == "high"] print(f"{len(risky)} high-risk cases") ``` ## Related - [The Agent Reliability Stack](https://mukundakatta.github.io/agent-stack/) ## License MIT.
提供机构:
mukunda1729
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作