oliverkinch/eur-lex-sum
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/eur-lex-sum
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- da
- en
license: cc-by-4.0
task_categories:
- summarization
pretty_name: EUR-Lex-Sum
size_categories:
- 1K<n<10K
tags:
- legal
- eur-lex
- legislation
- eu
configs:
- config_name: all
default: true
data_files:
- split: train
path: data/all/train.jsonl
- config_name: da
data_files:
- split: train
path: data/da/train.jsonl
- config_name: en
data_files:
- split: train
path: data/en/train.jsonl
---
# EUR-Lex-Sum
A dataset of EU legislation paired with legislative summaries from EUR-Lex, covering Danish and English.
Built from the [EU Publications Office CELLAR](https://publications.europa.eu/) repository using SPARQL-based discovery and XHTML content extraction.
## Dataset Description
Each record pairs a full EU legislative document with its official plain-language summary from the [EUR-Lex Summaries of EU Legislation](https://eur-lex.europa.eu/browse/summaries.html) collection.
### Configs
| Config | Records | Description |
|--------|---------|-------------|
| `all` (default) | 1,605 | Bilingual DA+EN — every record has all four text fields |
| `da` | 1,632 | Danish only — `celex_id`, `da_document`, `da_summary` |
| `en` | 1,619 | English only — `celex_id`, `en_document`, `en_summary` |
The `all` config is the intersection of `da` and `en`.
### Usage
```python
from datasets import load_dataset
# Load default (bilingual) config
ds = load_dataset("oliverkinch/eur-lex-sum")
# Load Danish only
ds_da = load_dataset("oliverkinch/eur-lex-sum", "da")
# Load English only
ds_en = load_dataset("oliverkinch/eur-lex-sum", "en")
```
### Fields
**`all` config:**
| Field | Type | Description |
|-------|------|-------------|
| `celex_id` | `string` | CELEX identifier for the legislation |
| `da_document` | `string` | Full legislative text in Danish |
| `da_summary` | `string` | Plain-language summary in Danish |
| `en_document` | `string` | Full legislative text in English |
| `en_summary` | `string` | Plain-language summary in English |
**`da` config:** `celex_id`, `da_document`, `da_summary`
**`en` config:** `celex_id`, `en_document`, `en_summary`
### Statistics
| | DA | EN |
|--|----|----|
| Records | 1,632 | 1,619 |
| Mean document tokens | 22,706 | 24,758 |
| Median document tokens | 8,304 | 9,187 |
| Mean summary tokens | 910 | 986 |
| Median compression ratio | 10.1x | 10.4x |
## Dataset Construction
### Source
Documents and summaries were obtained from the EU Publications Office [CELLAR](https://publications.europa.eu/) repository via:
1. **SPARQL discovery** — querying the CELLAR endpoint to find legislation with linked legislative summaries available as XHTML in Danish (and optionally English).
2. **Content extraction** — fetching XHTML manifestations and extracting body text.
### Filtering
The following filters were applied per language:
1. **Availability** — remove records where document or summary content is missing (HTTP 404 from CELLAR).
2. **PDF scan artifacts** — remove records containing `[NEW PAGE]` markers (residual from scanned PDFs).
3. **Deduplication** — when multiple CELEX IDs share an identical summary, keep the record with the longest document.
4. **Short document removal** — remove records where the document is shorter than or equal to the summary (by whitespace token count).
### Text Cleaning
- Non-breaking spaces (`\xa0`) normalized to regular spaces.
- Consecutive newlines collapsed into paragraph boundaries.
- Blank lines and `.xml` identifier lines removed.
### Pipeline
Starting from 2,838 discovered CELEX IDs:
| Step | DA | EN |
|------|----|----|
| After availability filter | 2,536 | 2,521 |
| After scan filter | 2,536 | 2,521 |
| After deduplication | 1,701 | 1,684 |
| After short-document filter | 1,632 | 1,619 |
| Bilingual intersection | 1,605 | 1,605 |
## Related Datasets
- [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — the original EUR-Lex-Sum dataset covering 24 EU languages. This dataset uses a similar methodology but sources data from CELLAR (the original EUR-Lex scraping endpoint is now behind WAF bot protection).
## License
The legislative documents and summaries are sourced from EUR-Lex. EU legal documents are available under the [reuse policy of the European Commission](https://commission.europa.eu/legal-notice_en) (CC BY 4.0 compatible).
---
language:
- 丹麦语
- 英语
license: CC BY 4.0
task_categories:
- 摘要生成
pretty_name: EUR-Lex-Sum
size_categories: 1K<n<10K
tags:
- 法律
- EUR-Lex
- 立法文件
- 欧盟
configs:
- config_name: all
default: true
data_files:
- split: train
path: data/all/train.jsonl
- config_name: da
data_files:
- split: train
path: data/da/train.jsonl
- config_name: en
data_files:
- split: train
path: data/en/train.jsonl
---
# EUR-Lex-Sum
本数据集为欧盟立法文件与来自EUR-Lex的立法摘要的配对集合,涵盖丹麦语与英语两种语言。本数据集基于欧盟出版物办公室CELLAR(https://publications.europa.eu/)知识库构建,通过基于SPARQL的发现流程与XHTML内容提取方法生成。
## 数据集描述
每条数据记录均将完整的欧盟立法文件与来自EUR-Lex欧盟立法摘要库(https://eur-lex.europa.eu/browse/summaries.html)的官方通俗语言摘要进行配对。
### 数据集配置
| 配置 | 数据条数 | 说明 |
|--------|---------|-------------|
| `all`(默认) | 1,605 | 双语(丹麦语+英语)—— 每条记录包含全部四个文本字段 |
| `da` | 1,632 | 仅丹麦语—— 包含`celex_id`、`da_document`、`da_summary`三个字段 |
| `en` | 1,619 | 仅英语—— 包含`celex_id`、`en_document`、`en_summary`三个字段 |
`all`配置为`da`与`en`配置的交集。
### 使用方法
python
from datasets import load_dataset
# 加载默认(双语)配置
ds = load_dataset("oliverkinch/eur-lex-sum")
# 仅加载丹麦语配置
ds_da = load_dataset("oliverkinch/eur-lex-sum", "da")
# 仅加载英语配置
ds_en = load_dataset("oliverkinch/eur-lex-sum", "en")
### 数据字段
**`all` 配置:**
| 字段 | 类型 | 说明 |
|-------|------|-------------|
| `celex_id` | 字符串(string) | 立法文件的CELEX标识符(CELEX identifier) |
| `da_document` | 字符串(string) | 丹麦语完整立法文本 |
| `da_summary` | 字符串(string) | 丹麦语通俗语言摘要 |
| `en_document` | 字符串(string) | 英语完整立法文本 |
| `en_summary` | 字符串(string) | 英语通俗语言摘要 |
**`da` 配置:** `celex_id`、`da_document`、`da_summary`
**`en` 配置:** `celex_id`、`en_document`、`en_summary`
### 统计信息
| | 丹麦语 | 英语 |
|--|----|----|
| 数据条数 | 1,632 | 1,619 |
| 平均文档Token数 | 22,706 | 24,758 |
| 文档Token数中位数 | 8,304 | 9,187 |
| 平均摘要Token数 | 910 | 986 |
| 平均压缩比 | 10.1倍 | 10.4倍 |
## 数据集构建流程
### 数据来源
立法文件与摘要均通过以下方式从欧盟出版物办公室CELLAR(https://publications.europa.eu/)知识库获取:
1. **SPARQL发现** —— 通过查询CELLAR接口,查找带有丹麦语(可选英语)XHTML格式关联立法摘要的立法文件。
2. **内容提取** —— 获取XHTML格式文件并提取正文文本。
### 数据过滤
针对每种语言均应用了以下过滤规则:
1. **可用性过滤** —— 移除存在文件或摘要内容缺失(从CELLAR返回HTTP 404错误)的记录。
2. **PDF扫描伪影过滤** —— 移除包含`[NEW PAGE]`标记的记录(该标记为扫描PDF残留的内容)。
3. **去重过滤** —— 当多个CELEX标识符对应同一份摘要时,保留拥有最长立法文本的记录。
4. **短文本过滤** —— 移除立法文本长度(按空白符分割的Token数)小于等于摘要长度的记录。
### 文本清洗
- 将不间断空格(`xa0`)统一转换为常规空格。
- 将连续换行符合并为段落分隔符。
- 移除空行与`.xml`标识符行。
### 数据处理流程
初始共发现2,838个CELEX标识符:
| 处理步骤 | 丹麦语 | 英语 |
|------|----|----|
| 可用性过滤后 | 2,536 | 2,521 |
| 扫描伪影过滤后 | 2,536 | 2,521 |
| 去重过滤后 | 1,701 | 1,684 |
| 短文本过滤后 | 1,632 | 1,619 |
| 双语交集 | 1,605 | 1,605 |
## 相关数据集
- [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — 覆盖24种欧盟语言的原版EUR-Lex-Sum数据集。本数据集采用类似的构建方法,但数据来源改为CELLAR(原EUR-Lex爬取接口现已受WAF机器人防护)。
## 许可证
本数据集的立法文件与摘要均来自EUR-Lex。欧盟法律文件可遵循欧盟委员会复用政策(https://commission.europa.eu/legal-notice_en)使用(兼容CC BY 4.0许可)。
提供机构:
oliverkinch



