reglab/legal_rag_hallucinations
收藏Hugging Face2024-11-14 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/reglab/legal_rag_hallucinations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# Dataset Card for _Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools_
This data release contains the queries and raw model outputs we analyze in Magesh, Surani, Dahl, Suzgun, Manning and Ho, [Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools](https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf), Journal of Empirical Legal Studies (2024, forthcoming).
Consistent with emerging understanding of AI
benchmarking and leaderboards, we reserve a random sample of 50% of the dataset to guard against potential model memorization
and ensure that a proper test set remains available for future evaluation ([Haimes et al., 2024](https://arxiv.org/abs/2410.09247); [Li & Flanigan, 2024](https://arxiv.org/abs/2312.16337)).
If you use this data, please cite the paper as follows:
```
@misc{magesh2024hallucinationfreeassessingreliabilityleading,
title={Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools},
author={Varun Magesh and Faiz Surani and Matthew Dahl and Mirac Suzgun and Christopher D. Manning and Daniel E. Ho},
year={2024},
eprint={2405.20362},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.20362},
}
```
Each line represents a query made to a commercially available legal research tool (or GPT-4), its response, and the label of the response.
This is the public dataset so it does not contains information about all queries made. This public release includes 50% of the data analyzed in our work --- 400 responses to 100 different queries.
## Dataset Details
Each line represents a query made to an test tool, its response, and our evaluation of the response. There are 15 different kinds of queries, each created in a different way. Appendix A of the [paper](https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf) contains a detailed description of the method used to construct each kind of query.
Appendix D details our labeling protocol used to evaluate responses.
This public dataset contains 50% of the data analyzed in our work: 400 responses to 100 different queries. We withhold
Questions are stratified by category, with each category releasing half of its data. Additionally, questions are stratified by hallucination status. A question was marked as “hallucinated” if at least one of the four models tested provided a hallucinated answer.
By this measure, the dataset includes 135 hallucinated questions, 66 of which are released here.
- *Created by*: Varun Magesh, Faiz Surani, Matt Dahl, Mirac Suzgun, Christopher Manning, and Daniel E. Ho.
- *Languages*: English\
- *Paper*: Magesh et. al, Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools, Journal of Empirical Legal Studies (2024, forthcoming)
- *Preprint*: https://arxiv.org/abs/2405.20362
## Uses
The statistics reported in the paper can be partially reproduced. This dataset could also be adapted for evaluation as a benchmark.
## Dataset Structure
The dataset includes two files:
dataset.csv, which has 400 lines structured as follows:
- Question ID: Unique identifier for each question.
- Category: Type of question (e.g., General Legal Research, Jurisdiction-Specific).
- Model: The model generating the response (e.g., Lexis+ AI, Westlaw).
- Question: Legal question posed to the model.
- Response: Model’s response to the question.
- Correctness: Label indicating if the response is factually correct.
- Groundedness: Label indicating if the response is supported by legal authority.
- Label: Final classification (e.g., "Accurate," "Hallucination")
questions.csv, which has 100 lines structured as follows:
- Question ID: Unique identifier for each question.
- Category: Type of question (e.g., General Legal Research, Jurisdiction-Specific).
- Question: Legal question posed to the model.
## Dataset Creation
The dataset was created for the paper cited above, with methods described in detail in the Methodology section and Appendix A.
## License
The questions in the dataset are distributed under the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/deed.en).
We believe that release of a small number (10) historical bar exam prep questions (without multiple choice options or answers) is a
transformative fair use, since it is a very small number and proportion of the source material, for public interest educational purposes,
unlikely to affect markets for exams (since they are older and no longer for sale, and do not contain information like mult. choice options and answers necessary to use in test prep),
and much of the data can already be found in other fair use compilations like [MMLU professional law auxiliary training set](https://arxiv.org/pdf/2009.03300), Common Crawl, and others.
## Source Data
The queries were written and curated using several legal datasets: Courtlistener, Shepard's citator data, courtesy of Jim Spriggs, LegalBench, public legal news articles by Bloomberg, and others.
## Personal and Sensitive Information
All queries draw on publicly available legal information. No personal or sensitive information is present.
提供机构:
reglab



