souvickdascmsa019/GDPR_QA_dataset
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/souvickdascmsa019/GDPR_QA_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
task_categories:
- question-answering
- text-classification
tags:
- GDPR
- civil-law
- legal-ai
- RAG
- hallucination-detection
- retrieval-augmented-generation
- llm-evaluation
- compliance
- multi-domain
libraries:
- mlcroissant
pretty_name: Legal QA Evaluation Dataset (GDPR + Civil Law)
size_categories:
- n<1K
configs:
- config_name: gdpr
data_files:
- split: train
path: GDPR_QA_dataset.json
- config_name: civil
data_files:
- split: train
path: Civil_QA_dataset.json
- config_name: default
data_files:
- split: gdpr
path: GDPR_QA_dataset.json
- split: civil
path: Civil_QA_dataset.json
---
# Legal QA Evaluation Dataset (GDPR + Civil Law)
A **multi-domain Retrieval-Augmented Generation (RAG) evaluation dataset** covering
GDPR and Civil Law provisions, developed at the **University of Luxembourg**.
## Dataset Description
This repository contains two evaluation datasets sharing the same schema:
| Config | File | Domain |
|---|---|---|
| `gdpr` | `GDPR_QA_dataset.json` | European General Data Protection Regulation |
| `civil` | `Civil_QA_dataset.json` | Civil Law provisions |
Each record in both datasets contains:
- **query** — A natural language question grounded in a specific legal provision
- **relevant_chunk** — The retrieved passage from the source legal document that
provides the factual basis for answering the query
- **gt_answer** — An expert-authored ground-truth answer used as the evaluation reference
- **answer_correctness** — A categorical label: `Correct`, `Partially Correct`, or `Incorrect`
## Intended Use
This dataset is designed to:
- Benchmark LLM-based RAG pipelines on multi-domain legal texts
- Evaluate answer correctness and detect hallucinations in legal Q&A systems
- Support regulatory compliance checking research for GDPR and Civil Law
## Dataset Structure
```json
[
{
"query": "...",
"relevant_chunk": "...",
"gt_answer": "...",
"answer_correctness": "Correct | Partially Correct | Incorrect"
}
]
```
## Loading the Dataset
```python
from datasets import load_dataset
# Load GDPR subset only
gdpr_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset", name="gdpr")
# Load Civil Law subset only
civil_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset", name="civil")
# Load both subsets together (default config)
all_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset")
# Load with mlcroissant
import mlcroissant as mlc
ds = mlc.Dataset("https://huggingface.co/api/datasets/souvickdascmsa019/GDPR_QA_dataset/croissant")
for record in ds.records(record_set="gdpr_qa_records"):
print(record)
```
## Source Data
Queries are synthetically generated from GDPR and Civil Law article text.
Ground-truth answers were authored by legal-AI domain experts at the University
of Luxembourg. Correctness labels were assigned through manual annotation against
the source provisions.
## Limitations and Biases
- Coverage may not be uniform across all articles within each legal domain
- Correctness labels reflect expert judgment and may embed interpretive biases
inherent to legal annotation
- Not intended for use as legal advice
## Croissant Metadata
This dataset includes a [Croissant](https://mlcommons.org/croissant/)
machine-readable metadata file (`metadata.json`) at the repository root,
compliant with the MLCommons Croissant 1.0 specification. It covers both
datasets and includes core and Responsible AI (RAI) fields.
## License
This dataset is released under
[Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{das2026legalqa,
author = {Das, Souvick},
title = {{Legal QA Evaluation Dataset (GDPR and Civil Law)}},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/souvickdascmsa019/GDPR_QA_dataset},
institution = {University of Luxembourg}
}
```
## Contact
**Souvick Das** — University of Luxembourg
提供机构:
souvickdascmsa019



