nguha/legal-eval

Name: nguha/legal-eval
Creator: nguha
Published: 2026-04-07 14:00:18
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/nguha/legal-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit task_categories: - question-answering - text-classification tags: - legal - law - benchmark pretty_name: Legal Eval size_categories: - 1K<n<10K --- # legal-eval A unified evaluation dataset aggregating multiple legal reasoning benchmarks into a single flat schema for cost-efficient LLM evaluation. All samples are pre-formatted as zero-shot prompts — ready to send directly to a model. Total: **~9,769 samples across 202 tasks** from 5 source benchmarks. ## Schema Each row has 5 columns: | Column | Description | |--------|-------------| | `benchmark` | Source benchmark: `legalbench`, `barexam`, `lexam`, `housingqa`, or `legal_hallucinations` | | `task_name` | Specific task within that benchmark | | `input` | The full prompt, ready to send to a model (with all placeholders already filled in) | | `answer` | The gold answer | | `eval_method` | How to score the response: `contained_in_output`, `all_in_output`, `any_in_output`, or `numeric_within_1pct` | The dataset has a single `train` split (it is eval-only). Note: `input` does **not** include a system prompt. At inference time, you may want to prepend something like `"Answer with 'The answer is ' followed by your answer."` to make responses easier to parse. ## Benchmarks ### LegalBench 159 tasks covering contract analysis, privacy policies, statutory reasoning, case law, and more. Sourced from [nguha/legalbench-staging](https://huggingface.co/datasets/nguha/legalbench-staging). - **Sampling:** 50 samples per task (or all if fewer), random seed 42 - **Prompt construction:** Uses the zero-shot `instruction` from `task_metadata.json` and fills `{{placeholders}}` with row columns. MAUD tasks get an "Option A/B/..." suffix. Yes/No tasks have `Answer with "Yes" or "No".` injected. SSLA tasks convert answers to JSON arrays. - **Excluded tasks:** `rule_qa`, `citation_prediction_classification`, `citation_prediction_open` ### BarExam All 117 MBE (Multistate Bar Examination) multiple-choice questions from the test split. Sourced from [reglab/barexam_qa](https://huggingface.co/datasets/reglab/barexam_qa). - **Sampling:** All 117 questions (no subsampling) - **Prompt construction:** Question + four lettered choices + `Answer with A, B, C, or D.` + `Answer:` - **Answer format:** Single letter (A, B, C, or D) ### LEXam 229 English-language multiple-choice questions on Swiss law from the `mcq_32_choices` config. Sourced from [LEXam-Benchmark/LEXam](https://huggingface.co/datasets/LEXam-Benchmark/LEXam). - **Sampling:** All 229 English rows (German questions filtered out) - **Prompt construction:** Question + 32 lettered choices (A through AF) + `Answer with one of: A, B, C, ..., AF.` + `Answer:` - **Answer format:** Letter (A through AF) ### HousingQA US housing and eviction law questions with statutory excerpts. Statutory reasoning regime — the model receives the relevant statute text alongside each question. Sourced from [reglab/housing_qa](https://huggingface.co/datasets/reglab/housing_qa). - **Sampling:** 50 per question type × 41 question types = **1,473 samples** - **Prompt construction:** Relevant statutes (citations + excerpts) + question + `Answer with "Yes" or "No".` + `Answer:` - **Answer format:** Yes / No ### Legal Hallucinations Factual recall tasks about US federal court cases. Sourced from [reglab/legal_hallucinations](https://huggingface.co/datasets/reglab/legal_hallucinations). - **Sampling:** 100 random samples per task × 4 tasks = **400 samples** - **Included tasks:** - `affirm_reverse` — Did the court affirm or reverse the lower court's decision? - `case_existence` — Is this a real case? - `citation_retrieval` — What is the correct citation for this case? - `year_overruled` — What year was this case overruled? - **Prompt construction:** Uses the dataset's `query` field verbatim + `Answer:` ## Usage ```python from datasets import load_dataset ds = load_dataset("nguha/legal-eval", split="train") # Filter to a specific benchmark legalbench = ds.filter(lambda x: x["benchmark"] == "legalbench") barexam = ds.filter(lambda x: x["benchmark"] == "barexam") lexam = ds.filter(lambda x: x["benchmark"] == "lexam") housingqa = ds.filter(lambda x: x["benchmark"] == "housingqa") hallucinations = ds.filter(lambda x: x["benchmark"] == "legal_hallucinations") # Filter to a specific task hearsay = ds.filter(lambda x: x["task_name"] == "hearsay") # Example: run a model for row in ds: prompt = row["input"] # Optionally prepend a system message like # "Answer with 'The answer is ' followed by your answer." response = model.generate(prompt) # Score using row["eval_method"] against row["answer"] ``` ## Evaluation Methods | Method | Description | |--------|-------------| | `contained_in_output` | Pass if `answer` appears as a substring of the response | | `all_in_output` | `answer` is a JSON array; pass if all items appear in the response | | `any_in_output` | `answer` is a JSON array; pass if any item appears in the response | | `numeric_within_1pct` | Extract a number from the response; pass if within 1% of `answer` | ## Regenerating the Dataset ```bash python create_dataset.py # all benchmarks python create_dataset.py --benchmarks legalbench # just LegalBench python create_dataset.py --benchmarks barexam lexam # specific subset python create_dataset.py --dry-run # preview without pushing ``` ## Citation If you use this dataset, please cite the individual source benchmarks: ``` @misc{guha2023legalbenchcollaborativelybuiltbenchmark, title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models}, author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li}, year={2023}, eprint={2308.11462}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.11462}, } @inproceedings{Zheng_2025, series={CSLAW ’25}, title={A Reasoning-Focused Legal Retrieval Benchmark}, url={http://dx.doi.org/10.1145/3709025.3712219}, DOI={10.1145/3709025.3712219}, booktitle={Proceedings of the Symposium on Computer Science and Law on ZZZ}, publisher={ACM}, author={Zheng, Lucia and Guha, Neel and Arifov, Javokhir and Zhang, Sarah and Skreta, Michal and Manning, Christopher D. and Henderson, Peter and Ho, Daniel E.}, year={2025}, month=mar, pages={169–193}, collection={CSLAW ’25} } @misc{fan2026lexambenchmarkinglegalreasoning, title={LEXam: Benchmarking Legal Reasoning on 340 Law Exams}, author={Yu Fan and Jingwei Ni and Jakob Merane and Yang Tian and Yoan Hermstrüwer and Yinya Huang and Mubashara Akhtar and Etienne Salimbeni and Florian Geering and Oliver Dreyer and Daniel Brunner and Markus Leippold and Mrinmaya Sachan and Alexander Stremitzer and Christoph Engel and Elliott Ash and Joel Niklaus}, year={2026}, eprint={2505.12864}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.12864}, } @article{Dahl_2024, title={Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models}, volume={16}, ISSN={1946-5319}, url={http://dx.doi.org/10.1093/jla/laae003}, DOI={10.1093/jla/laae003}, number={1}, journal={Journal of Legal Analysis}, publisher={Oxford University Press (OUP)}, author={Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E}, year={2024}, month=jan, pages={64–93} } ```

提供机构：

nguha

5,000+

优质数据集

54 个

任务类型

进入经典数据集