danielnoumon/eu-regulations-nl-queries
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danielnoumon/eu-regulations-nl-queries
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- nl
license: cc-by-4.0
task_categories:
- sentence-similarity
- feature-extraction
tags:
- sentence-transformers
- embedding
- retrieval
- legal
- dutch
- eu-ai-act
- gdpr
- uavg
- rag
- synthetic
size_categories:
- 1K<n<10K
---
# Dutch EU Regulations - Synthetic Query-Chunk Pairs
## Dataset Description
### Dataset Summary
This dataset contains **5,732** synthetic Dutch query-chunk pairs derived from three core EU/Dutch regulations in Dutch:
- **EU AI Act** (Verordening Artificiële Intelligentie) — 3,210 pairs from 574 chunks
- **AVG/GDPR** (Algemene Verordening Gegevensbescherming) — 2,262 pairs from 379 chunks
- **UAVG** (Uitvoeringswet Algemene verordening gegevensbescherming) — 342 pairs from 57 chunks
Each pair consists of a realistic user query and the relevant text chunk from the regulation that answers it. The dataset is designed for fine-tuning embedding models for semantic search and retrieval-augmented generation (RAG) in the Dutch legal/regulatory domain.
| Statistic | Value |
|-----------|-------|
| Total pairs | 5,732 |
| Source documents | 3 |
| Total chunks | 1,010 |
| Query types | 5 (factual, definition, procedural, scenario, keyword) |
| Duplicates removed | 82 |
### Supported Tasks
- **Embedding fine-tuning**: Train or fine-tune sentence/document embedding models using Multiple Negatives Ranking Loss (MNRL)
- **Semantic search**: Build search systems for Dutch legal/regulatory documents
- **Retrieval-augmented generation (RAG)**: Create question-answering systems for EU regulations
- **Information retrieval evaluation**: Benchmark embedding models on Dutch legal text
### Languages
Dutch (nl)
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `question_id` | int | Unique identifier per query pair |
| `query` | string | Synthetic Dutch search query |
| `chunk` | string | Relevant regulation text chunk |
| `document_name` | string | Source document: "EU AI Act (NL)", "AVG/GDPR (NL)", or "UAVG (NL)" |
| `chunk_id` | int | Unique chunk ID within the source document |
| `section_type` | string | Legal structure type (overweging, artikel, bijlage, preambule) |
| `hierarchy_path` | string | Full structural path, e.g. "Hoofdstuk 2 \> Artikel 6 \> Lid 3" |
Note: `(document_name, chunk_id)` together form a globally unique chunk identifier.
### Data Splits
This dataset does not include predefined splits. Users should create their own train/validation/test splits.
**Recommended**: Split at the chunk level (not pair level) to prevent data leakage, as multiple queries reference the same chunk.
### Examples
**EU AI Act — Scenario query:**
```json
{
"question_id": 42,
"query": "Een gezondheidsorganisatie wil AI gebruiken voor patiëntgegevens, welke voorbeelden van hoog-risico zijn opgenomen?",
"chunk": "5.\nDe Commissie verstrekt na raadpleging van de Europese raad voor artificiële intelligentie...",
"document_name": "EU AI Act (NL)",
"chunk_id": 254,
"section_type": "artikel",
"hierarchy_path": "HOOFDSTUK III — AI-SYSTEMEN MET EEN HOOG RISICO > Artikel 6 > Lid 5"
}
```
**AVG/GDPR — Keyword query:**
```json
{
"question_id": 3200,
"query": "gegevensbeschermingseffectbeoordeling inhoud",
"chunk": "7.\nDe beoordeling bevat ten minste:\na\) een systematische beschrijving van de beoogde verwerkingen...",
"document_name": "AVG/GDPR (NL)",
"chunk_id": 253,
"section_type": "artikel",
"hierarchy_path": "Afdeling 3 — Gegevensbeschermingseffectbeoordeling > Artikel 35 > Lid 7"
}
```
**UAVG — Procedural query:**
```json
{
"question_id": 5500,
"query": "Hoe legt de Autoriteit persoonsgegevens een bestuurlijke boete op?",
"chunk": "1.\tDe Autoriteit persoonsgegevens kan een bestuurlijke boete opleggen...",
"document_name": "UAVG (NL)",
"chunk_id": 40,
"section_type": "artikel",
"hierarchy_path": "Hoofdstuk 5 — Handhaving > Artikel 14 — Bestuurlijke boetes"
}
```
## Dataset Creation
### Source Data
Three official Dutch (translations of) EU/Dutch regulations:
1. **EU AI Act** — Regulation on Artificial Intelligence, establishing rules for AI systems in the EU
2. **AVG/GDPR** — General Data Protection Regulation, the EU's data protection framework
3. **UAVG** — Uitvoeringswet AVG, the Dutch national implementation law for the GDPR
The EU AI Act and GDPR are publicly available EU legal texts (PDF). The UAVG is a Dutch national law available from wetten.overheid.nl (plain text).
### Text Preprocessing
Documents were processed using a semantic hierarchical chunking strategy that preserves legal structure:
1. **Text extraction**: PDF extraction using PyMuPDF (EU AI Act, GDPR); plain text loading (UAVG)
2. **Structure parsing**: Documents parsed into recitals, chapters, articles, and annexes using format-specific regex patterns
3. **Semantic chunking**: Text split respecting legal boundaries (max ~1000 tokens, min 50 tokens)
4. **Deduplication**: Footnote references that match recital numbering patterns were deduplicated
5. **Metadata tagging**: Each chunk tagged with section type, hierarchy path, and unique ID
| Document | Chunks |
|----------|--------|
| EU AI Act | 574 |
| AVG/GDPR | 379 |
| UAVG | 57 |
### Query Generation
Queries were generated using **Qwen3-30B-A3B** (open-source MoE model, 30B total / 3B active parameters) via an OpenAI-compatible endpoint.
**Generation parameters:**
- Queries per chunk: 6
- Temperature: 0.7
- Post-processing: 82 exact-duplicate queries removed
**Five query types were enforced per chunk:**
| Type | Description | Example |
|------|-------------|---------|
| Factual | Direct knowledge questions | "Welke AI-systemen zijn verboden?" |
| Definition | Concept explanations | "Wat wordt bedoeld met hoog-risico AI?" |
| Procedural | How-to questions | "Hoe voer ik een DPIA uit?" |
| Scenario | Practical situations | "Een bedrijf wil gezichtsherkenning inzetten, welke regels gelden?" |
| Keyword | Search-style queries | "verboden AI-systemen" |
### Quality Control
- All queries verified to be in Dutch (zero English or mixed-language)
- Query-chunk relevance spot-checked across all three documents
- Article number references avoided in queries
- Length variation enforced: keyword queries (20-50 chars), medium questions (60-100 chars), long queries (100-150+ chars)
- Exact-duplicate queries removed during post-processing
## Considerations for Using the Data
### Intended Use
- Fine-tuning embedding models for Dutch legal/regulatory text retrieval
- Building semantic search and RAG systems for EU regulations
- Benchmarking retrieval models on Dutch legal text
### Limitations
1. **Synthetic queries**: LLM-generated, may not fully represent real user information needs
2. **Three documents only**: Limited to EU AI Act, GDPR, and UAVG — may not generalize to all legal domains
3. **No hard negatives**: Contains only positive pairs; hard negatives must be mined separately
4. **Temporal scope**: Based on specific versions of the regulations
5. **UAVG underrepresented**: Only 57 chunks / 342 pairs vs 574/379 for the EU regulations
### Evaluation Recommendations
- Use chunk-level splits to avoid data leakage
- Evaluate on IR metrics: NDCG@k, MRR@k, Recall@k
- Benchmark against baselines (e.g., multilingual-e5-large, text-embedding-3-large)
## Additional Information
### Licensing
Released under CC BY 4.0. Source documents are official EU regulations in the public domain under EU law, and Dutch national legislation in the public domain.
### Citation
```bibtex
@dataset{noumon2026euregulations,
title={Dutch EU Regulations - Synthetic Query-Chunk Pairs},
author={Noumon, Daniel},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/danielnoumon/eu-regulations-nl-queries}}
}
```
### Contact
For questions, issues, or feedback, please open an issue on the dataset repository.
### Acknowledgments
- **Source documents**: European Union (EU AI Act + AVG/GDPR — Dutch translations), Dutch Government (UAVG)
- **Query generation**: Qwen3-30B-A3B (open-source)
- **Chunking and processing**: Custom semantic hierarchical chunking pipeline
提供机构:
danielnoumon



