danielnoumon/eu-regulations-nl-queries

Name: danielnoumon/eu-regulations-nl-queries
Creator: danielnoumon
Published: 2026-03-25 21:06:13
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/danielnoumon/eu-regulations-nl-queries

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - nl license: cc-by-4.0 task_categories: - sentence-similarity - feature-extraction tags: - sentence-transformers - embedding - retrieval - legal - dutch - eu-ai-act - gdpr - uavg - rag - synthetic size_categories: - 1K<n<10K --- # Dutch EU Regulations - Synthetic Query-Chunk Pairs ## Dataset Description ### Dataset Summary This dataset contains **5,732** synthetic Dutch query-chunk pairs derived from three core EU/Dutch regulations in Dutch: - **EU AI Act** (Verordening Artificiële Intelligentie) — 3,210 pairs from 574 chunks - **AVG/GDPR** (Algemene Verordening Gegevensbescherming) — 2,262 pairs from 379 chunks - **UAVG** (Uitvoeringswet Algemene verordening gegevensbescherming) — 342 pairs from 57 chunks Each pair consists of a realistic user query and the relevant text chunk from the regulation that answers it. The dataset is designed for fine-tuning embedding models for semantic search and retrieval-augmented generation (RAG) in the Dutch legal/regulatory domain. | Statistic | Value | |-----------|-------| | Total pairs | 5,732 | | Source documents | 3 | | Total chunks | 1,010 | | Query types | 5 (factual, definition, procedural, scenario, keyword) | | Duplicates removed | 82 | ### Supported Tasks - **Embedding fine-tuning**: Train or fine-tune sentence/document embedding models using Multiple Negatives Ranking Loss (MNRL) - **Semantic search**: Build search systems for Dutch legal/regulatory documents - **Retrieval-augmented generation (RAG)**: Create question-answering systems for EU regulations - **Information retrieval evaluation**: Benchmark embedding models on Dutch legal text ### Languages Dutch (nl) ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `question_id` | int | Unique identifier per query pair | | `query` | string | Synthetic Dutch search query | | `chunk` | string | Relevant regulation text chunk | | `document_name` | string | Source document: "EU AI Act (NL)", "AVG/GDPR (NL)", or "UAVG (NL)" | | `chunk_id` | int | Unique chunk ID within the source document | | `section_type` | string | Legal structure type (overweging, artikel, bijlage, preambule) | | `hierarchy_path` | string | Full structural path, e.g. "Hoofdstuk 2 \> Artikel 6 \> Lid 3" | Note: `(document_name, chunk_id)` together form a globally unique chunk identifier. ### Data Splits This dataset does not include predefined splits. Users should create their own train/validation/test splits. **Recommended**: Split at the chunk level (not pair level) to prevent data leakage, as multiple queries reference the same chunk. ### Examples **EU AI Act — Scenario query:** ```json { "question_id": 42, "query": "Een gezondheidsorganisatie wil AI gebruiken voor patiëntgegevens, welke voorbeelden van hoog-risico zijn opgenomen?", "chunk": "5.\nDe Commissie verstrekt na raadpleging van de Europese raad voor artificiële intelligentie...", "document_name": "EU AI Act (NL)", "chunk_id": 254, "section_type": "artikel", "hierarchy_path": "HOOFDSTUK III — AI-SYSTEMEN MET EEN HOOG RISICO > Artikel 6 > Lid 5" } ``` **AVG/GDPR — Keyword query:** ```json { "question_id": 3200, "query": "gegevensbeschermingseffectbeoordeling inhoud", "chunk": "7.\nDe beoordeling bevat ten minste:\na\) een systematische beschrijving van de beoogde verwerkingen...", "document_name": "AVG/GDPR (NL)", "chunk_id": 253, "section_type": "artikel", "hierarchy_path": "Afdeling 3 — Gegevensbeschermingseffectbeoordeling > Artikel 35 > Lid 7" } ``` **UAVG — Procedural query:** ```json { "question_id": 5500, "query": "Hoe legt de Autoriteit persoonsgegevens een bestuurlijke boete op?", "chunk": "1.\tDe Autoriteit persoonsgegevens kan een bestuurlijke boete opleggen...", "document_name": "UAVG (NL)", "chunk_id": 40, "section_type": "artikel", "hierarchy_path": "Hoofdstuk 5 — Handhaving > Artikel 14 — Bestuurlijke boetes" } ``` ## Dataset Creation ### Source Data Three official Dutch (translations of) EU/Dutch regulations: 1. **EU AI Act** — Regulation on Artificial Intelligence, establishing rules for AI systems in the EU 2. **AVG/GDPR** — General Data Protection Regulation, the EU's data protection framework 3. **UAVG** — Uitvoeringswet AVG, the Dutch national implementation law for the GDPR The EU AI Act and GDPR are publicly available EU legal texts (PDF). The UAVG is a Dutch national law available from wetten.overheid.nl (plain text). ### Text Preprocessing Documents were processed using a semantic hierarchical chunking strategy that preserves legal structure: 1. **Text extraction**: PDF extraction using PyMuPDF (EU AI Act, GDPR); plain text loading (UAVG) 2. **Structure parsing**: Documents parsed into recitals, chapters, articles, and annexes using format-specific regex patterns 3. **Semantic chunking**: Text split respecting legal boundaries (max ~1000 tokens, min 50 tokens) 4. **Deduplication**: Footnote references that match recital numbering patterns were deduplicated 5. **Metadata tagging**: Each chunk tagged with section type, hierarchy path, and unique ID | Document | Chunks | |----------|--------| | EU AI Act | 574 | | AVG/GDPR | 379 | | UAVG | 57 | ### Query Generation Queries were generated using **Qwen3-30B-A3B** (open-source MoE model, 30B total / 3B active parameters) via an OpenAI-compatible endpoint. **Generation parameters:** - Queries per chunk: 6 - Temperature: 0.7 - Post-processing: 82 exact-duplicate queries removed **Five query types were enforced per chunk:** | Type | Description | Example | |------|-------------|---------| | Factual | Direct knowledge questions | "Welke AI-systemen zijn verboden?" | | Definition | Concept explanations | "Wat wordt bedoeld met hoog-risico AI?" | | Procedural | How-to questions | "Hoe voer ik een DPIA uit?" | | Scenario | Practical situations | "Een bedrijf wil gezichtsherkenning inzetten, welke regels gelden?" | | Keyword | Search-style queries | "verboden AI-systemen" | ### Quality Control - All queries verified to be in Dutch (zero English or mixed-language) - Query-chunk relevance spot-checked across all three documents - Article number references avoided in queries - Length variation enforced: keyword queries (20-50 chars), medium questions (60-100 chars), long queries (100-150+ chars) - Exact-duplicate queries removed during post-processing ## Considerations for Using the Data ### Intended Use - Fine-tuning embedding models for Dutch legal/regulatory text retrieval - Building semantic search and RAG systems for EU regulations - Benchmarking retrieval models on Dutch legal text ### Limitations 1. **Synthetic queries**: LLM-generated, may not fully represent real user information needs 2. **Three documents only**: Limited to EU AI Act, GDPR, and UAVG — may not generalize to all legal domains 3. **No hard negatives**: Contains only positive pairs; hard negatives must be mined separately 4. **Temporal scope**: Based on specific versions of the regulations 5. **UAVG underrepresented**: Only 57 chunks / 342 pairs vs 574/379 for the EU regulations ### Evaluation Recommendations - Use chunk-level splits to avoid data leakage - Evaluate on IR metrics: NDCG@k, MRR@k, Recall@k - Benchmark against baselines (e.g., multilingual-e5-large, text-embedding-3-large) ## Additional Information ### Licensing Released under CC BY 4.0. Source documents are official EU regulations in the public domain under EU law, and Dutch national legislation in the public domain. ### Citation ```bibtex @dataset{noumon2026euregulations, title={Dutch EU Regulations - Synthetic Query-Chunk Pairs}, author={Noumon, Daniel}, year={2026}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/danielnoumon/eu-regulations-nl-queries}} } ``` ### Contact For questions, issues, or feedback, please open an issue on the dataset repository. ### Acknowledgments - **Source documents**: European Union (EU AI Act + AVG/GDPR — Dutch translations), Dutch Government (UAVG) - **Query generation**: Qwen3-30B-A3B (open-source) - **Chunking and processing**: Custom semantic hierarchical chunking pipeline

提供机构：

danielnoumon

5,000+

优质数据集

54 个

任务类型

进入经典数据集