five

cycloevan/gdpr-sft-2277-combined

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cycloevan/gdpr-sft-2277-combined
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en task_categories: - text-generation - question-answering tags: - gdpr - compliance - legal - privacy - instruction-tuning - sft size_categories: - 1K<n<10K pretty_name: GDPR SFT Combined (2,277 pairs) --- # GDPR SFT Combined Dataset (2,277 instructions) Supervised Fine-Tuning dataset for GDPR (General Data Protection Regulation) compliance Q&A in English. Produced during the `gdpr-gemma2` Phase 5 experiment. ## Composition | Source | Count | Method | |---|---|---| | `sims2k/GDPR_QA_instruct_dataset` | 316 | Original expert-authored pairs | | Upstage Solar-Pro synthetic | 1,961 | LLM-generated, deduplicated (from 2,812 raw → 1,961 unique) | | **Total** | **2,277** | | The synthetic half covers **101 GDPR topics** across 9 categories: - Chapter I–IX (core articles, 46 topics) - SCENARIO (business scenarios, 10 topics) - SECTOR (healthcare / fintech / HR, 12 topics) - ADVANCED (sub-processor / BCR / one-stop-shop, 8 topics) - SPECIAL_DATA (biometric / genetic / children, 6 topics) - WORKFLOW (DPIA / ROPA / breach, 7 topics) - ENFORCEMENT (fines / case law, 5 topics) - CROSS_REG (ePrivacy / AI Act / NIS2, 4 topics) - INTL (EU-US DPF / UK GDPR / LGPD, 3 topics) Three question types rotated: `article_focused`, `scenario_based`, `comparative`. ## Schema ```json { "instruction": "How does the GDPR distinguish ...", "input": "Differentiate between ...", "output": "The GDPR distinguishes between ..." } ``` ## Quick load ```python from datasets import load_dataset ds = load_dataset("cycloevan/gdpr-sft-2277-combined", split="train") print(ds[0]) ``` ## Known limitations - **Synthetic duplicate rate**: ~30% of raw Solar-Pro generations overlapped in instruction prefix; deduplicated at the instruction level (longest output kept). - **English only**. - **Not a retrieval substitute**: article citations inside outputs are model-generated and should be verified against the official GDPR text. - **Empirically does not lift a 2B SFT above Base**: in the source project, training `gemma-2-2b-it` on this data produced SFT v2 ≈ Base on all four GPT-4o qualitative criteria (see Phase 5 findings). ## Intended use - SFT / instruction-tuning for GDPR compliance assistants. - As a starting point for larger base models (9B+) where data quantity may translate to measurable quality lift. - For studying synthetic-data duplication behaviour under rotated topic prompts at temperature 0.7. ## Citation ```bibtex @misc{gdpr_sft_combined_2026, title = {GDPR SFT Combined: 2,277 GDPR Instruct Pairs (Original + Upstage Solar Synthetic)}, author = {seok-hee97}, year = {2026}, howpublished = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/cycloevan/gdpr-sft-2277-combined} } ``` ## License Apache-2.0. Downstream users are responsible for ensuring compliance with the license terms of the seed dataset (`sims2k/GDPR_QA_instruct_dataset`) and of Upstage API outputs used in the synthetic half.
提供机构:
cycloevan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作