cycloevan/gdpr-sft-2277-combined
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cycloevan/gdpr-sft-2277-combined
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
task_categories:
- text-generation
- question-answering
tags:
- gdpr
- compliance
- legal
- privacy
- instruction-tuning
- sft
size_categories:
- 1K<n<10K
pretty_name: GDPR SFT Combined (2,277 pairs)
---
# GDPR SFT Combined Dataset (2,277 instructions)
Supervised Fine-Tuning dataset for GDPR (General Data Protection Regulation)
compliance Q&A in English. Produced during the `gdpr-gemma2` Phase 5
experiment.
## Composition
| Source | Count | Method |
|---|---|---|
| `sims2k/GDPR_QA_instruct_dataset` | 316 | Original expert-authored pairs |
| Upstage Solar-Pro synthetic | 1,961 | LLM-generated, deduplicated (from 2,812 raw → 1,961 unique) |
| **Total** | **2,277** | |
The synthetic half covers **101 GDPR topics** across 9 categories:
- Chapter I–IX (core articles, 46 topics)
- SCENARIO (business scenarios, 10 topics)
- SECTOR (healthcare / fintech / HR, 12 topics)
- ADVANCED (sub-processor / BCR / one-stop-shop, 8 topics)
- SPECIAL_DATA (biometric / genetic / children, 6 topics)
- WORKFLOW (DPIA / ROPA / breach, 7 topics)
- ENFORCEMENT (fines / case law, 5 topics)
- CROSS_REG (ePrivacy / AI Act / NIS2, 4 topics)
- INTL (EU-US DPF / UK GDPR / LGPD, 3 topics)
Three question types rotated: `article_focused`, `scenario_based`,
`comparative`.
## Schema
```json
{
"instruction": "How does the GDPR distinguish ...",
"input": "Differentiate between ...",
"output": "The GDPR distinguishes between ..."
}
```
## Quick load
```python
from datasets import load_dataset
ds = load_dataset("cycloevan/gdpr-sft-2277-combined", split="train")
print(ds[0])
```
## Known limitations
- **Synthetic duplicate rate**: ~30% of raw Solar-Pro generations overlapped
in instruction prefix; deduplicated at the instruction level (longest
output kept).
- **English only**.
- **Not a retrieval substitute**: article citations inside outputs are
model-generated and should be verified against the official GDPR text.
- **Empirically does not lift a 2B SFT above Base**: in the source project,
training `gemma-2-2b-it` on this data produced SFT v2 ≈ Base on all four
GPT-4o qualitative criteria (see Phase 5 findings).
## Intended use
- SFT / instruction-tuning for GDPR compliance assistants.
- As a starting point for larger base models (9B+) where data quantity
may translate to measurable quality lift.
- For studying synthetic-data duplication behaviour under rotated topic
prompts at temperature 0.7.
## Citation
```bibtex
@misc{gdpr_sft_combined_2026,
title = {GDPR SFT Combined: 2,277 GDPR Instruct Pairs (Original + Upstage Solar Synthetic)},
author = {seok-hee97},
year = {2026},
howpublished = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/cycloevan/gdpr-sft-2277-combined}
}
```
## License
Apache-2.0. Downstream users are responsible for ensuring compliance with
the license terms of the seed dataset (`sims2k/GDPR_QA_instruct_dataset`)
and of Upstage API outputs used in the synthetic half.
提供机构:
cycloevan



