nguha/legal-eval
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nguha/legal-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- question-answering
- text-classification
tags:
- legal
- law
- benchmark
pretty_name: Legal Eval
size_categories:
- 1K<n<10K
---
# legal-eval
A unified evaluation dataset aggregating multiple legal reasoning benchmarks into a single flat schema for cost-efficient LLM evaluation. All samples are pre-formatted as zero-shot prompts — ready to send directly to a model.
Total: **~9,769 samples across 202 tasks** from 5 source benchmarks.
## Schema
Each row has 5 columns:
| Column | Description |
|--------|-------------|
| `benchmark` | Source benchmark: `legalbench`, `barexam`, `lexam`, `housingqa`, or `legal_hallucinations` |
| `task_name` | Specific task within that benchmark |
| `input` | The full prompt, ready to send to a model (with all placeholders already filled in) |
| `answer` | The gold answer |
| `eval_method` | How to score the response: `contained_in_output`, `all_in_output`, `any_in_output`, or `numeric_within_1pct` |
The dataset has a single `train` split (it is eval-only).
Note: `input` does **not** include a system prompt. At inference time, you may want to prepend something like `"Answer with 'The answer is ' followed by your answer."` to make responses easier to parse.
## Benchmarks
### LegalBench
159 tasks covering contract analysis, privacy policies, statutory reasoning, case law, and more. Sourced from [nguha/legalbench-staging](https://huggingface.co/datasets/nguha/legalbench-staging).
- **Sampling:** 50 samples per task (or all if fewer), random seed 42
- **Prompt construction:** Uses the zero-shot `instruction` from `task_metadata.json` and fills `{{placeholders}}` with row columns. MAUD tasks get an "Option A/B/..." suffix. Yes/No tasks have `Answer with "Yes" or "No".` injected. SSLA tasks convert answers to JSON arrays.
- **Excluded tasks:** `rule_qa`, `citation_prediction_classification`, `citation_prediction_open`
### BarExam
All 117 MBE (Multistate Bar Examination) multiple-choice questions from the test split. Sourced from [reglab/barexam_qa](https://huggingface.co/datasets/reglab/barexam_qa).
- **Sampling:** All 117 questions (no subsampling)
- **Prompt construction:** Question + four lettered choices + `Answer with A, B, C, or D.` + `Answer:`
- **Answer format:** Single letter (A, B, C, or D)
### LEXam
229 English-language multiple-choice questions on Swiss law from the `mcq_32_choices` config. Sourced from [LEXam-Benchmark/LEXam](https://huggingface.co/datasets/LEXam-Benchmark/LEXam).
- **Sampling:** All 229 English rows (German questions filtered out)
- **Prompt construction:** Question + 32 lettered choices (A through AF) + `Answer with one of: A, B, C, ..., AF.` + `Answer:`
- **Answer format:** Letter (A through AF)
### HousingQA
US housing and eviction law questions with statutory excerpts. Statutory reasoning regime — the model receives the relevant statute text alongside each question. Sourced from [reglab/housing_qa](https://huggingface.co/datasets/reglab/housing_qa).
- **Sampling:** 50 per question type × 41 question types = **1,473 samples**
- **Prompt construction:** Relevant statutes (citations + excerpts) + question + `Answer with "Yes" or "No".` + `Answer:`
- **Answer format:** Yes / No
### Legal Hallucinations
Factual recall tasks about US federal court cases. Sourced from [reglab/legal_hallucinations](https://huggingface.co/datasets/reglab/legal_hallucinations).
- **Sampling:** 100 random samples per task × 4 tasks = **400 samples**
- **Included tasks:**
- `affirm_reverse` — Did the court affirm or reverse the lower court's decision?
- `case_existence` — Is this a real case?
- `citation_retrieval` — What is the correct citation for this case?
- `year_overruled` — What year was this case overruled?
- **Prompt construction:** Uses the dataset's `query` field verbatim + `Answer:`
## Usage
```python
from datasets import load_dataset
ds = load_dataset("nguha/legal-eval", split="train")
# Filter to a specific benchmark
legalbench = ds.filter(lambda x: x["benchmark"] == "legalbench")
barexam = ds.filter(lambda x: x["benchmark"] == "barexam")
lexam = ds.filter(lambda x: x["benchmark"] == "lexam")
housingqa = ds.filter(lambda x: x["benchmark"] == "housingqa")
hallucinations = ds.filter(lambda x: x["benchmark"] == "legal_hallucinations")
# Filter to a specific task
hearsay = ds.filter(lambda x: x["task_name"] == "hearsay")
# Example: run a model
for row in ds:
prompt = row["input"]
# Optionally prepend a system message like
# "Answer with 'The answer is ' followed by your answer."
response = model.generate(prompt)
# Score using row["eval_method"] against row["answer"]
```
## Evaluation Methods
| Method | Description |
|--------|-------------|
| `contained_in_output` | Pass if `answer` appears as a substring of the response |
| `all_in_output` | `answer` is a JSON array; pass if all items appear in the response |
| `any_in_output` | `answer` is a JSON array; pass if any item appears in the response |
| `numeric_within_1pct` | Extract a number from the response; pass if within 1% of `answer` |
## Regenerating the Dataset
```bash
python create_dataset.py # all benchmarks
python create_dataset.py --benchmarks legalbench # just LegalBench
python create_dataset.py --benchmarks barexam lexam # specific subset
python create_dataset.py --dry-run # preview without pushing
```
## Citation
If you use this dataset, please cite the individual source benchmarks:
```
@misc{guha2023legalbenchcollaborativelybuiltbenchmark,
title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models},
author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li},
year={2023},
eprint={2308.11462},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2308.11462},
}
@inproceedings{Zheng_2025, series={CSLAW ’25},
title={A Reasoning-Focused Legal Retrieval Benchmark},
url={http://dx.doi.org/10.1145/3709025.3712219},
DOI={10.1145/3709025.3712219},
booktitle={Proceedings of the Symposium on Computer Science and Law on ZZZ},
publisher={ACM},
author={Zheng, Lucia and Guha, Neel and Arifov, Javokhir and Zhang, Sarah and Skreta, Michal and Manning, Christopher D. and Henderson, Peter and Ho, Daniel E.},
year={2025},
month=mar, pages={169–193},
collection={CSLAW ’25}
}
@misc{fan2026lexambenchmarkinglegalreasoning,
title={LEXam: Benchmarking Legal Reasoning on 340 Law Exams},
author={Yu Fan and Jingwei Ni and Jakob Merane and Yang Tian and Yoan Hermstrüwer and Yinya Huang and Mubashara Akhtar and Etienne Salimbeni and Florian Geering and Oliver Dreyer and Daniel Brunner and Markus Leippold and Mrinmaya Sachan and Alexander Stremitzer and Christoph Engel and Elliott Ash and Joel Niklaus},
year={2026},
eprint={2505.12864},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.12864},
}
@article{Dahl_2024,
title={Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models},
volume={16},
ISSN={1946-5319},
url={http://dx.doi.org/10.1093/jla/laae003},
DOI={10.1093/jla/laae003},
number={1},
journal={Journal of Legal Analysis},
publisher={Oxford University Press (OUP)},
author={Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E},
year={2024},
month=jan, pages={64–93} }
```
提供机构:
nguha



