isaacus/legal-rag-qa
收藏Hugging Face2026-03-22 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/isaacus/legal-rag-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Legal RAG QA
task_categories:
- text-retrieval
- question-answering
tags:
- legal
- law
- criminal
language:
- en
language_details: en-US
annotations_creators:
- found
language_creators:
- found
license: cc-by-nc-sa-3.0
size_categories:
- n<1K
configs:
- config_name: corpus
data_files:
- split: test
path: corpus.jsonl
default: true
- config_name: qa
data_files:
- split: test
path: qa.jsonl
---
# Legal RAG QA 🔍
**Legal RAG QA** by [Isaacus](https://isaacus.com/) is a challenging benchmark for evaluating the end-to-end performance of real-world legal RAG applications.
Legal RAG QA consists of 190 passages and external materials and 138 question–answer–relevent-passages triplets sourced from LibreTexts' [_Introduction to Criminal Law_](https://biz.libretexts.org/Bookshelves/Criminal_Law/Introduction_to_Criminal_Law) textbook.
This dataset was originally intended to serve as the [Legal RAG Bench](https://huggingface.co/datasets/isaacus/legal-rag-bench) benchmark but was subsequently replaced with a much larger dataset constructed from the Victorian Criminal Charge Book. Both datasets are equally high quality, however. Accordingly, in the interests of supporting open AI evaluation, we have publicly released Legal RAG QA under the same license as the _Introduction to Criminal Law_ textbook.
## Usage 👩💻
Legal RAG QA may be loaded like so using the Hugging Face 🤗 [`datasets`](https://huggingface.co/docs/datasets/en/index) Python library:
```python
import datasets
# Load passages in Legal RAG QA.
corpus = datasets.load_dataset("isaacus/legal-rag-qa", name="corpus", split="test")
# Load question-answer-passage triplets from Legal RAG QA.
qa = datasets.load_dataset("isaacus/legal-rag-qa", name="qa", split="test")
```
## Structure 🗂️
Passages in the Legal RAG QA corpus are stored in the `corpus` subset, with each entry having the following fields:
- `id (string)`: a unique identifier for the passage.
- `section (string)`: a unique identifier for the section of the textbook from which the passage originates.
- `title (string)`: the title of the section of the textbook from which the passage originates.
- `text (string)`: the text of the passage, formatted in Markdown.
- `is_supplemental (boolean)`: whether the passage comes from the textbook itself or is supplementary material.
Questions, answers, and the IDs of relevant passages are stored in the `qa` subset, with each entry having the following fields:
- `id (string)`: a unique identifier for the question.
- `question (string)`: the text of the question.
- `answer (string)`: the text of the answer to the question.
- `requires_supplemental (boolean)`: whether the question depends on supplementary material.
- `relevant_documents (string)`: the unique identifiers of passages in the `corpus` subset that are most relevant to the question.
The `corpus` and `qa` subsets of Legal RAG QA both currently have only a single split, `test`.
## Methodology 🧪
Legal RAG QA was constructed by downloading the [_Introduction to Criminal Law_](https://biz.libretexts.org/Bookshelves/Criminal_Law/Introduction_to_Criminal_Law) textbook from LibreTexts, extracting question-answer pairs, and then downloading any relevant passages from the textbook or relevant external materials to serve as a corpus.
## License 📜
This dataset is licensed under [CC BY NC SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/).
## Citation 🔖
If you've relied on Legal RAG QA for your work, please cite it alongside Legal RAG Bench:
```bibtex
@misc{butler2026legalragqa,
title={Legal RAG QA},
author={Abdur-Rahman Butler and Umar Butler},
year={2026},
publisher = {Isaacus},
url={https://huggingface.co/datasets/isaacus/legal-rag-qa}
}
@misc{butler2026legalragbench,
title={Legal RAG Bench: an end-to-end benchmark for legal RAG},
author={Abdur-Rahman Butler and Umar Butler},
year={2026},
eprint={2603.01710},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.01710},
}
```
---
pretty_name: 法律检索增强生成问答数据集(Legal RAG QA)
task_categories:
- 文本检索
- 问答
tags:
- 法律
- 法学
- 刑法
language:
- en
language_details: 美式英语(en-US)
annotations_creators:
- 公开获取
language_creators:
- 公开获取
license: 知识共享署名-非商业性使用-相同方式共享3.0协议(CC BY-NC-SA 3.0)
size_categories:
- 样本量小于1000
configs:
- config_name: 语料库
data_files:
- split: test
path: corpus.jsonl
default: true
- config_name: 问答对
data_files:
- split: test
path: qa.jsonl
---
# 法律检索增强生成问答数据集🔍
**法律检索增强生成问答数据集(Legal RAG QA)** 由[Isaacus](https://isaacus.com/)开发,是一款用于评估真实场景下法律检索增强生成(Retrieval-Augmented Generation,RAG)应用端到端性能的高挑战性基准测试集。
该数据集包含190段文本与外部素材,以及138组「问题-答案-相关段落」三元组,数据源自LibreTexts平台的《刑法导论》(Introduction to Criminal Law)教科书。
该数据集最初计划作为[法律检索增强生成基准测试集(Legal RAG Bench)](https://huggingface.co/datasets/isaacus/legal-rag-bench)发布,后续被基于维多利亚州刑法指控手册构建的更大规模数据集取代。不过两款数据集的质量均属上乘。为支持开源AI评估工作,我们按照《刑法导论》教科书的许可协议,将本数据集公开发布。
## 使用方法👩💻
可通过Hugging Face 🤗 [`datasets`](https://huggingface.co/docs/datasets/en/index) Python库加载本数据集,示例代码如下:
python
import datasets
# 加载法律检索增强生成问答数据集的语料库
corpus = datasets.load_dataset("isaacus/legal-rag-qa", name="corpus", split="test")
# 加载法律检索增强生成问答数据集的问答对三元组
qa = datasets.load_dataset("isaacus/legal-rag-qa", name="qa", split="test")
## 数据集结构🗂️
法律检索增强生成问答数据集的语料库存储在`corpus`子集中,每条数据包含以下字段:
- `id(字符串)`:段落的唯一标识符
- `section(字符串)`:该段落所属教科书章节的唯一标识符
- `title(字符串)`:该段落所属教科书章节的标题
- `text(字符串)`:段落文本,采用Markdown格式
- `is_supplemental(布尔值)`:标识该段落是否来自教科书本体或属于补充素材
问答、答案与相关段落ID存储在`qa`子集中,每条数据包含以下字段:
- `id(字符串)`:问题的唯一标识符
- `question(字符串)`:问题文本
- `answer(字符串)`:问题的答案文本
- `requires_supplemental(布尔值)`:标识该问题是否需要依赖补充素材才能解答
- `relevant_documents(字符串)`:`corpus`子集中与该问题最相关的段落的唯一标识符
目前法律检索增强生成问答数据集的`corpus`与`qa`两个子集均仅包含`test`划分。
## 构建方法🧪
法律检索增强生成问答数据集的构建流程为:从LibreTexts平台下载《刑法导论》教科书,从中提取问答对,随后下载教科书中的相关段落或相关外部素材作为语料库。
## 许可协议📜
本数据集采用[CC BY-NC-SA 3.0](https://creativecommons.org/licenses/by-nc-sa/3.0/)许可协议进行授权。
## 引用格式🔖
若您的研究工作使用了本数据集,请同时引用法律检索增强生成问答数据集与法律检索增强生成基准测试集:
bibtex
@misc{butler2026legalragqa,
title={Legal RAG QA},
author={Abdur-Rahman Butler and Umar Butler},
year={2026},
publisher = {Isaacus},
url={https://huggingface.co/datasets/isaacus/legal-rag-qa}
}
@misc{butler2026legalragbench,
title={Legal RAG Bench: an end-to-end benchmark for legal RAG},
author={Abdur-Rahman Butler and Umar Butler},
year={2026},
eprint={2603.01710},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.01710},
}
提供机构:
isaacus



