legal-eval

Name: legal-eval
Creator: maas
Published: 2026-04-28 17:02:45
License: 暂无描述

魔搭社区2026-04-28 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/legal-eval

下载链接

链接失效反馈

官方服务：

资源简介：

# legal-eval A unified evaluation dataset aggregating multiple legal reasoning benchmarks into a single flat schema for cost-efficient LLM evaluation. All samples are pre-formatted as zero-shot prompts — ready to send directly to a model. Total: **~9,769 samples across 202 tasks** from 5 source benchmarks. ## Schema Each row has 5 columns: | Column | Description | |--------|-------------| | `benchmark` | Source benchmark: `legalbench`, `barexam`, `lexam`, `housingqa`, or `legal_hallucinations` | | `task_name` | Specific task within that benchmark | | `input` | The full prompt, ready to send to a model (with all placeholders already filled in) | | `answer` | The gold answer | | `eval_method` | How to score the response: `contained_in_output`, `all_in_output`, `any_in_output`, or `numeric_within_1pct` | The dataset has a single `train` split (it is eval-only). Note: `input` does **not** include a system prompt. At inference time, you may want to prepend something like `"Answer with 'The answer is ' followed by your answer."` to make responses easier to parse. ## Benchmarks ### LegalBench 159 tasks covering contract analysis, privacy policies, statutory reasoning, case law, and more. Sourced from [nguha/legalbench-staging](https://huggingface.co/datasets/nguha/legalbench-staging). - **Sampling:** 50 samples per task (or all if fewer), random seed 42 - **Prompt construction:** Uses the zero-shot `instruction` from `task_metadata.json` and fills `{{placeholders}}` with row columns. MAUD tasks get an "Option A/B/..." suffix. Yes/No tasks have `Answer with "Yes" or "No".` injected. SSLA tasks convert answers to JSON arrays. - **Excluded tasks:** `rule_qa`, `citation_prediction_classification`, `citation_prediction_open` ### BarExam All 117 MBE (Multistate Bar Examination) multiple-choice questions from the test split. Sourced from [reglab/barexam_qa](https://huggingface.co/datasets/reglab/barexam_qa). - **Sampling:** All 117 questions (no subsampling) - **Prompt construction:** Question + four lettered choices + `Answer with A, B, C, or D.` + `Answer:` - **Answer format:** Single letter (A, B, C, or D) ### LEXam 229 English-language multiple-choice questions on Swiss law from the `mcq_32_choices` config. Sourced from [LEXam-Benchmark/LEXam](https://huggingface.co/datasets/LEXam-Benchmark/LEXam). - **Sampling:** All 229 English rows (German questions filtered out) - **Prompt construction:** Question + 32 lettered choices (A through AF) + `Answer with one of: A, B, C, ..., AF.` + `Answer:` - **Answer format:** Letter (A through AF) ### HousingQA US housing and eviction law questions with statutory excerpts. Statutory reasoning regime — the model receives the relevant statute text alongside each question. Sourced from [reglab/housing_qa](https://huggingface.co/datasets/reglab/housing_qa). - **Sampling:** 50 per question type × 33 question types = **~1,200 samples** - **Prompt construction:** Relevant statutes (citations + excerpts) + question + `Answer with "Yes" or "No".` + `Answer:` - **Answer format:** Yes / No - **Filtering:** 8 question types in the source data have 100% Yes answers (e.g., "Is there a state/territory law regulating residential evictions?" — every state has one). These degenerate tasks are excluded since a model that always predicts "Yes" would score 100%. ### Legal Hallucinations Factual recall tasks about US federal court cases. Sourced from [reglab/legal_hallucinations](https://huggingface.co/datasets/reglab/legal_hallucinations). - **Sampling:** 100 random samples per task × 4 tasks = **400 samples** - **Included tasks:** - `affirm_reverse` — Did the court affirm or reverse the lower court's decision? - `case_existence` — Is this a real case? (50 real + 50 fake cases mixed) - `citation_retrieval` — What is the correct citation for this case? - `year_overruled` — What year was this case overruled? - **Prompt construction:** Uses the dataset's `query` field verbatim + `Answer:` - **Note on `case_existence`:** Mixes 50 questions about real cases (answer: Yes) with 50 questions about fabricated cases (answer: No, sourced from `fake_case_existence` in the upstream dataset). This prevents the task from being trivially all-Yes. ## Usage ```python from datasets import load_dataset ds = load_dataset("nguha/legal-eval", split="train") # Filter to a specific benchmark legalbench = ds.filter(lambda x: x["benchmark"] == "legalbench") barexam = ds.filter(lambda x: x["benchmark"] == "barexam") lexam = ds.filter(lambda x: x["benchmark"] == "lexam") housingqa = ds.filter(lambda x: x["benchmark"] == "housingqa") hallucinations = ds.filter(lambda x: x["benchmark"] == "legal_hallucinations") # Filter to a specific task hearsay = ds.filter(lambda x: x["task_name"] == "hearsay") # Example: run a model for row in ds: prompt = row["input"] # Optionally prepend a system message like # "Answer with 'The answer is ' followed by your answer." response = model.generate(prompt) # Score using row["eval_method"] against row["answer"] ``` ## Evaluation Methods | Method | Description | |--------|-------------| | `contained_in_output` | Pass if `answer` appears as a substring of the response | | `all_in_output` | `answer` is a JSON array; pass if all items appear in the response | | `any_in_output` | `answer` is a JSON array; pass if any item appears in the response | | `numeric_within_1pct` | Extract a number from the response; pass if within 1% of `answer` | ## Regenerating the Dataset ```bash python create_dataset.py # all benchmarks python create_dataset.py --benchmarks legalbench # just LegalBench python create_dataset.py --benchmarks barexam lexam # specific subset python create_dataset.py --dry-run # preview without pushing ``` ## Changelog Notable changes to the dataset over time. Older versions are available in the dataset's git history on the HuggingFace Hub. ### 2026-04 — HousingQA degenerate task filter + case_existence balance fix - **HousingQA:** Filtered out 8 question types where 100% of source examples have the same answer (always "Yes"). A model that always predicted "Yes" would score 100% on these tasks, so they don't measure anything useful. Examples removed: `is_there_a_state_territory_law_regulating_residential_evictions`, `does_the_law_require_the_landlord_to_give_the_tenant_notice_to_vacate_...`, etc. HousingQA shrank from 41 → 33 tasks. - **Legal Hallucinations / `case_existence`:** Previously sampled only from real cases (all answers "Yes"). Now samples 50 real cases ("Yes") + 50 fake cases ("No", from the upstream `fake_case_existence` task) and interleaves them. The task now meaningfully tests whether models can distinguish real cases from fabricated ones. ## Citation If you use this dataset, please cite the individual source benchmarks: ``` @misc{guha2023legalbenchcollaborativelybuiltbenchmark, title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models}, author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li}, year={2023}, eprint={2308.11462}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.11462}, } @inproceedings{Zheng_2025, series={CSLAW ’25}, title={A Reasoning-Focused Legal Retrieval Benchmark}, url={http://dx.doi.org/10.1145/3709025.3712219}, DOI={10.1145/3709025.3712219}, booktitle={Proceedings of the Symposium on Computer Science and Law on ZZZ}, publisher={ACM}, author={Zheng, Lucia and Guha, Neel and Arifov, Javokhir and Zhang, Sarah and Skreta, Michal and Manning, Christopher D. and Henderson, Peter and Ho, Daniel E.}, year={2025}, month=mar, pages={169–193}, collection={CSLAW ’25} } @misc{fan2026lexambenchmarkinglegalreasoning, title={LEXam: Benchmarking Legal Reasoning on 340 Law Exams}, author={Yu Fan and Jingwei Ni and Jakob Merane and Yang Tian and Yoan Hermstrüwer and Yinya Huang and Mubashara Akhtar and Etienne Salimbeni and Florian Geering and Oliver Dreyer and Daniel Brunner and Markus Leippold and Mrinmaya Sachan and Alexander Stremitzer and Christoph Engel and Elliott Ash and Joel Niklaus}, year={2026}, eprint={2505.12864}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.12864}, } @article{Dahl_2024, title={Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models}, volume={16}, ISSN={1946-5319}, url={http://dx.doi.org/10.1093/jla/laae003}, DOI={10.1093/jla/laae003}, number={1}, journal={Journal of Legal Analysis}, publisher={Oxford University Press (OUP)}, author={Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E}, year={2024}, month=jan, pages={64–93} } ```

# legal-eval 这是一个统一的评测数据集，将多个法律推理基准测试集整合为单一扁平化架构，用于高效开展大语言模型（LLM）评测。所有样本均已预格式化为零样本（zero-shot）提示词，可直接发送至模型进行推理。总计：**涵盖5个源基准测试集的202项任务，共约9769条样本**。 ## 架构每条数据包含5个字段： | 字段名 | 说明 | |--------|-------------| | `benchmark` | 源基准测试集：可选值为`legalbench`、`barexam`、`lexam`、`housingqa`或`legal_hallucinations` | | `task_name` | 该基准测试集下的具体任务名称 | | `input` | 完整的可直接发送至模型的提示词（所有占位符均已填充完毕） | | `answer` | 标准答案 | | `eval_method` | 模型响应的评分方法：可选值为`contained_in_output`、`all_in_output`、`any_in_output`或`numeric_within_1pct` | 该数据集仅包含一个`train`划分（仅用于评测）。注意：`input`字段**不包含系统提示词**。在推理阶段，可在提示词前添加类似`"请以‘答案为：’后跟具体答案的格式作答。"`的内容，以便更轻松地解析模型输出。 ## 基准测试集 ### LegalBench（法律基准测试集）包含159项任务，涵盖合同分析、隐私政策、法规推理、判例法等领域。数据源自[nguha/legalbench-staging](https://huggingface.co/datasets/nguha/legalbench-staging)。 - **采样策略：** 每个任务抽取50条样本（若任务样本不足50则全量抽取），随机种子设为42 - **提示词构建：** 采用`task_metadata.json`中的零样本（zero-shot）指令，并将行数据列填充至`{{placeholders}}`占位符中。MAUD任务会追加"选项A/B/……"后缀；是非类任务会注入`"请以"Yes"或"No"作答。"`的提示；SSLA任务会将答案转换为JSON数组格式。 - **排除任务：** `rule_qa`、`citation_prediction_classification`、`citation_prediction_open` ### BarExam 包含测试划分下的全部117道MBE（Multistate Bar Examination，美国统一律师考试选择题）题目。数据源自[reglab/barexam_qa](https://huggingface.co/datasets/reglab/barexam_qa)。 - **采样策略：** 全量抽取117道题目（无二次采样） - **提示词构建：** 题目+四个带字母标注的选项+`"请以A、B、C或D作答。"`+`"答案："` - **答案格式：** 单个字母（A、B、C或D） ### LEXam 包含`mcq_32_choices`配置下的229道瑞士法律英语选择题。数据源自[LEXam-Benchmark/LEXam](https://huggingface.co/datasets/LEXam-Benchmark/LEXam)。 - **采样策略：** 全量抽取229道英语题目（过滤掉德语题目） - **提示词构建：** 题目+32个带字母标注的选项（A至AF）+`"请从A、B、C……AF中选择一个作答。"`+`"答案："` - **答案格式：** 单个字母（A至AF） ### HousingQA 包含美国住房与驱逐租客法律相关题目及法规节选。该任务属于法规推理场景——模型会在每道题目旁获取相关法规文本。数据源自[reglab/housing_qa](https://huggingface.co/datasets/reglab/housing_qa)。 - **采样策略：** 每种题型抽取50条样本 × 33种题型 = **约1200条样本** - **提示词构建：** 相关法规（引用内容+节选文本）+题目+`"请以"Yes"或"No"作答。"`+`"答案："` - **答案格式：** 是/否 - **过滤规则：** 源数据中有8种题型的所有样本答案均为"Yes"（例如"是否存在州/领地法律规范住宅驱逐？"——所有州均有此类法律）。此类无区分度的任务被排除，因为始终预测"Yes"的模型也能获得100%的准确率，无法有效评测模型能力。 ### Legal Hallucinations 包含美国联邦法院案件的事实记忆类任务。数据源自[reglab/legal_hallucinations](https://huggingface.co/datasets/reglab/legal_hallucinations)。 - **采样策略：** 每个任务随机抽取100条样本 × 4项任务 = **共400条样本** - **包含任务：** - `affirm_reverse` — 法院是否维持或推翻了下级法院的判决？ - `case_existence` — 该案件是否为真实案件？（混合50个真实案件与50个虚构案件） - `citation_retrieval` — 该案件的正确引用格式是什么？ - `year_overruled` — 该案件在哪一年被推翻？ - **提示词构建：** 直接使用数据集的`query`字段内容+`"答案："` - **关于`case_existence`任务的说明：** 该任务混合了50道真实案件相关题目（答案为"Yes"）与50道虚构案件相关题目（答案为"No"，数据源自上游数据集的`fake_case_existence`任务），避免了所有答案均为"Yes"的无区分度情况，确保任务具备评测价值。 ## 使用方法 python from datasets import load_dataset ds = load_dataset("nguha/legal-eval", split="train") # 按基准测试集筛选 legalbench = ds.filter(lambda x: x["benchmark"] == "legalbench") barexam = ds.filter(lambda x: x["benchmark"] == "barexam") lexam = ds.filter(lambda x: x["benchmark"] == "lexam") housingqa = ds.filter(lambda x: x["benchmark"] == "housingqa") hallucinations = ds.filter(lambda x: x["benchmark"] == "legal_hallucinations") # 按具体任务筛选 hearsay = ds.filter(lambda x: x["task_name"] == "hearsay") # 示例：运行模型推理 for row in ds: prompt = row["input"] # 可选：在提示词前添加系统消息，例如 # "请以‘答案为：’后跟具体答案的格式作答。" response = model.generate(prompt) # 根据row["eval_method"]评分规则，对比row["answer"]对模型响应进行评分 ## 评测方法 | 评分方法 | 说明 | |--------|-------------| | `contained_in_output` | 若`answer`为模型响应的子字符串，则判定通过 | | `all_in_output` | `answer`为JSON数组；若数组中所有元素均出现在模型响应中，则判定通过 | | `any_in_output` | `answer`为JSON数组；若数组中任意一个元素出现在模型响应中，则判定通过 | | `numeric_within_1pct` | 从模型响应中提取数值；若提取的数值与`answer`的偏差在1%以内，则判定通过 | ## 数据集重建 bash python create_dataset.py # 重建所有基准测试集 python create_dataset.py --benchmarks legalbench # 仅重建LegalBench python create_dataset.py --benchmarks barexam lexam # 重建指定基准测试集子集 python create_dataset.py --dry-run # 预览重建结果，不实际推送至仓库 ## 更新日志记录数据集的重要更新，旧版本可在HuggingFace Hub的数据集git历史中获取。 ### 2026-04 — HousingQA无区分度任务过滤 + case_existence任务平衡修复 - **HousingQA：** 过滤掉8种题型，这些题型的所有源样本答案均相同（均为"Yes"）。此类任务会被始终预测"Yes"的模型获得100%准确率，无法有效评测模型能力。被移除的任务示例包括：`is_there_a_state_territory_law_regulating_residential_evictions`、`does_the_law_require_the_landlord_to_give_the_tenant_notice_to_vacate_...`等。HousingQA的任务数量从41个缩减至33个。 - **Legal Hallucinations / `case_existence`任务：** 此前仅从真实案件中采样（所有答案均为"Yes"）。现在该任务混合了50道真实案件题目（答案为"Yes"）与50道虚构案件题目（答案为"No"，数据源自上游数据集的`fake_case_existence`任务）并进行交错排列。该任务现在可有效评测模型区分真实案件与虚构案件的能力。 ## 引用格式若您使用该数据集，请引用其各个源基准测试集的相关文献： bibtex @misc{guha2023legalbenchcollaborativelybuiltbenchmark, title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models}, author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li}, year={2023}, eprint={2308.11462}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.11462}, } @inproceedings{Zheng_2025, series={CSLAW ’25}, title={A Reasoning-Focused Legal Retrieval Benchmark}, url={http://dx.doi.org/10.1145/3709025.3712219}, DOI={10.1145/3709025.3712219}, booktitle={Proceedings of the Symposium on Computer Science and Law on ZZZ}, publisher={ACM}, author={Zheng, Lucia and Guha, Neel and Arifov, Javokhir and Zhang, Sarah and Skreta, Michal and Manning, Christopher D. and Henderson, Peter and Ho, Daniel E.}, year={2025}, month=mar, pages={169–193}, collection={CSLAW ’25} } @misc{fan2026lexambenchmarkinglegalreasoning, title={LEXam: Benchmarking Legal Reasoning on 340 Law Exams}, author={Yu Fan and Jingwei Ni and Jakob Merane and Yang Tian and Yoan Hermstrüwer and Yinya Huang and Mubashara Akhtar and Etienne Salimbeni and Florian Geering and Oliver Dreyer and Daniel Brunner and Markus Leippold and Mrinmaya Sachan and Alexander Stremitzer and Christoph Engel and Elliott Ash and Joel Niklaus}, year={2026}, eprint={2505.12864}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.12864}, } @article{Dahl_2024, title={Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models}, volume={16}, ISSN={1946-5319}, url={http://dx.doi.org/10.1093/jla/laae003}, DOI={10.1093/jla/laae003}, number={1}, journal={Journal of Legal Analysis}, publisher={Oxford University Press (OUP)}, author={Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E}, year={2024}, month=jan, pages={64–93} }

提供机构：

maas

创建时间：

2026-04-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集