five

souvickdascmsa019/GDPR_QA_dataset

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/souvickdascmsa019/GDPR_QA_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en task_categories: - question-answering - text-classification tags: - GDPR - civil-law - legal-ai - RAG - hallucination-detection - retrieval-augmented-generation - llm-evaluation - compliance - multi-domain libraries: - mlcroissant pretty_name: Legal QA Evaluation Dataset (GDPR + Civil Law) size_categories: - n<1K configs: - config_name: gdpr data_files: - split: train path: GDPR_QA_dataset.json - config_name: civil data_files: - split: train path: Civil_QA_dataset.json - config_name: default data_files: - split: gdpr path: GDPR_QA_dataset.json - split: civil path: Civil_QA_dataset.json --- # Legal QA Evaluation Dataset (GDPR + Civil Law) A **multi-domain Retrieval-Augmented Generation (RAG) evaluation dataset** covering GDPR and Civil Law provisions, developed at the **University of Luxembourg**. ## Dataset Description This repository contains two evaluation datasets sharing the same schema: | Config | File | Domain | |---|---|---| | `gdpr` | `GDPR_QA_dataset.json` | European General Data Protection Regulation | | `civil` | `Civil_QA_dataset.json` | Civil Law provisions | Each record in both datasets contains: - **query** — A natural language question grounded in a specific legal provision - **relevant_chunk** — The retrieved passage from the source legal document that provides the factual basis for answering the query - **gt_answer** — An expert-authored ground-truth answer used as the evaluation reference - **answer_correctness** — A categorical label: `Correct`, `Partially Correct`, or `Incorrect` ## Intended Use This dataset is designed to: - Benchmark LLM-based RAG pipelines on multi-domain legal texts - Evaluate answer correctness and detect hallucinations in legal Q&A systems - Support regulatory compliance checking research for GDPR and Civil Law ## Dataset Structure ```json [ { "query": "...", "relevant_chunk": "...", "gt_answer": "...", "answer_correctness": "Correct | Partially Correct | Incorrect" } ] ``` ## Loading the Dataset ```python from datasets import load_dataset # Load GDPR subset only gdpr_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset", name="gdpr") # Load Civil Law subset only civil_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset", name="civil") # Load both subsets together (default config) all_data = load_dataset("souvickdascmsa019/GDPR_QA_dataset") # Load with mlcroissant import mlcroissant as mlc ds = mlc.Dataset("https://huggingface.co/api/datasets/souvickdascmsa019/GDPR_QA_dataset/croissant") for record in ds.records(record_set="gdpr_qa_records"): print(record) ``` ## Source Data Queries are synthetically generated from GDPR and Civil Law article text. Ground-truth answers were authored by legal-AI domain experts at the University of Luxembourg. Correctness labels were assigned through manual annotation against the source provisions. ## Limitations and Biases - Coverage may not be uniform across all articles within each legal domain - Correctness labels reflect expert judgment and may embed interpretive biases inherent to legal annotation - Not intended for use as legal advice ## Croissant Metadata This dataset includes a [Croissant](https://mlcommons.org/croissant/) machine-readable metadata file (`metadata.json`) at the repository root, compliant with the MLCommons Croissant 1.0 specification. It covers both datasets and includes core and Responsible AI (RAI) fields. ## License This dataset is released under [Creative Commons Attribution 4.0 (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/). ## Citation If you use this dataset, please cite: ```bibtex @dataset{das2026legalqa, author = {Das, Souvick}, title = {{Legal QA Evaluation Dataset (GDPR and Civil Law)}}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/souvickdascmsa019/GDPR_QA_dataset}, institution = {University of Luxembourg} } ``` ## Contact **Souvick Das** — University of Luxembourg
提供机构:
souvickdascmsa019
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作