oliverkinch/eur-lex-sum

Name: oliverkinch/eur-lex-sum
Creator: oliverkinch
Published: 2026-04-01 11:45:11
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/oliverkinch/eur-lex-sum

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - da - en license: cc-by-4.0 task_categories: - summarization pretty_name: EUR-Lex-Sum size_categories: - 1K<n<10K tags: - legal - eur-lex - legislation - eu configs: - config_name: all default: true data_files: - split: train path: data/all/train.jsonl - config_name: da data_files: - split: train path: data/da/train.jsonl - config_name: en data_files: - split: train path: data/en/train.jsonl --- # EUR-Lex-Sum A dataset of EU legislation paired with legislative summaries from EUR-Lex, covering Danish and English. Built from the [EU Publications Office CELLAR](https://publications.europa.eu/) repository using SPARQL-based discovery and XHTML content extraction. ## Dataset Description Each record pairs a full EU legislative document with its official plain-language summary from the [EUR-Lex Summaries of EU Legislation](https://eur-lex.europa.eu/browse/summaries.html) collection. ### Configs | Config | Records | Description | |--------|---------|-------------| | `all` (default) | 1,605 | Bilingual DA+EN — every record has all four text fields | | `da` | 1,632 | Danish only — `celex_id`, `da_document`, `da_summary` | | `en` | 1,619 | English only — `celex_id`, `en_document`, `en_summary` | The `all` config is the intersection of `da` and `en`. ### Usage ```python from datasets import load_dataset # Load default (bilingual) config ds = load_dataset("oliverkinch/eur-lex-sum") # Load Danish only ds_da = load_dataset("oliverkinch/eur-lex-sum", "da") # Load English only ds_en = load_dataset("oliverkinch/eur-lex-sum", "en") ``` ### Fields **`all` config:** | Field | Type | Description | |-------|------|-------------| | `celex_id` | `string` | CELEX identifier for the legislation | | `da_document` | `string` | Full legislative text in Danish | | `da_summary` | `string` | Plain-language summary in Danish | | `en_document` | `string` | Full legislative text in English | | `en_summary` | `string` | Plain-language summary in English | **`da` config:** `celex_id`, `da_document`, `da_summary` **`en` config:** `celex_id`, `en_document`, `en_summary` ### Statistics | | DA | EN | |--|----|----| | Records | 1,632 | 1,619 | | Mean document tokens | 22,706 | 24,758 | | Median document tokens | 8,304 | 9,187 | | Mean summary tokens | 910 | 986 | | Median compression ratio | 10.1x | 10.4x | ## Dataset Construction ### Source Documents and summaries were obtained from the EU Publications Office [CELLAR](https://publications.europa.eu/) repository via: 1. **SPARQL discovery** — querying the CELLAR endpoint to find legislation with linked legislative summaries available as XHTML in Danish (and optionally English). 2. **Content extraction** — fetching XHTML manifestations and extracting body text. ### Filtering The following filters were applied per language: 1. **Availability** — remove records where document or summary content is missing (HTTP 404 from CELLAR). 2. **PDF scan artifacts** — remove records containing `[NEW PAGE]` markers (residual from scanned PDFs). 3. **Deduplication** — when multiple CELEX IDs share an identical summary, keep the record with the longest document. 4. **Short document removal** — remove records where the document is shorter than or equal to the summary (by whitespace token count). ### Text Cleaning - Non-breaking spaces (`\xa0`) normalized to regular spaces. - Consecutive newlines collapsed into paragraph boundaries. - Blank lines and `.xml` identifier lines removed. ### Pipeline Starting from 2,838 discovered CELEX IDs: | Step | DA | EN | |------|----|----| | After availability filter | 2,536 | 2,521 | | After scan filter | 2,536 | 2,521 | | After deduplication | 1,701 | 1,684 | | After short-document filter | 1,632 | 1,619 | | Bilingual intersection | 1,605 | 1,605 | ## Related Datasets - [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — the original EUR-Lex-Sum dataset covering 24 EU languages. This dataset uses a similar methodology but sources data from CELLAR (the original EUR-Lex scraping endpoint is now behind WAF bot protection). ## License The legislative documents and summaries are sourced from EUR-Lex. EU legal documents are available under the [reuse policy of the European Commission](https://commission.europa.eu/legal-notice_en) (CC BY 4.0 compatible).

--- language: - 丹麦语 - 英语 license: CC BY 4.0 task_categories: - 摘要生成 pretty_name: EUR-Lex-Sum size_categories: 1K<n<10K tags: - 法律 - EUR-Lex - 立法文件 - 欧盟 configs: - config_name: all default: true data_files: - split: train path: data/all/train.jsonl - config_name: da data_files: - split: train path: data/da/train.jsonl - config_name: en data_files: - split: train path: data/en/train.jsonl --- # EUR-Lex-Sum 本数据集为欧盟立法文件与来自EUR-Lex的立法摘要的配对集合，涵盖丹麦语与英语两种语言。本数据集基于欧盟出版物办公室CELLAR（https://publications.europa.eu/）知识库构建，通过基于SPARQL的发现流程与XHTML内容提取方法生成。 ## 数据集描述每条数据记录均将完整的欧盟立法文件与来自EUR-Lex欧盟立法摘要库（https://eur-lex.europa.eu/browse/summaries.html）的官方通俗语言摘要进行配对。 ### 数据集配置 | 配置 | 数据条数 | 说明 | |--------|---------|-------------| | `all`（默认） | 1,605 | 双语（丹麦语+英语）—— 每条记录包含全部四个文本字段 | | `da` | 1,632 | 仅丹麦语—— 包含`celex_id`、`da_document`、`da_summary`三个字段 | | `en` | 1,619 | 仅英语—— 包含`celex_id`、`en_document`、`en_summary`三个字段 | `all`配置为`da`与`en`配置的交集。 ### 使用方法 python from datasets import load_dataset # 加载默认（双语）配置 ds = load_dataset("oliverkinch/eur-lex-sum") # 仅加载丹麦语配置 ds_da = load_dataset("oliverkinch/eur-lex-sum", "da") # 仅加载英语配置 ds_en = load_dataset("oliverkinch/eur-lex-sum", "en") ### 数据字段 **`all` 配置：** | 字段 | 类型 | 说明 | |-------|------|-------------| | `celex_id` | 字符串（string） | 立法文件的CELEX标识符（CELEX identifier） | | `da_document` | 字符串（string） | 丹麦语完整立法文本 | | `da_summary` | 字符串（string） | 丹麦语通俗语言摘要 | | `en_document` | 字符串（string） | 英语完整立法文本 | | `en_summary` | 字符串（string） | 英语通俗语言摘要 | **`da` 配置：** `celex_id`、`da_document`、`da_summary` **`en` 配置：** `celex_id`、`en_document`、`en_summary` ### 统计信息 | | 丹麦语 | 英语 | |--|----|----| | 数据条数 | 1,632 | 1,619 | | 平均文档Token数 | 22,706 | 24,758 | | 文档Token数中位数 | 8,304 | 9,187 | | 平均摘要Token数 | 910 | 986 | | 平均压缩比 | 10.1倍 | 10.4倍 | ## 数据集构建流程 ### 数据来源立法文件与摘要均通过以下方式从欧盟出版物办公室CELLAR（https://publications.europa.eu/）知识库获取： 1. **SPARQL发现** —— 通过查询CELLAR接口，查找带有丹麦语（可选英语）XHTML格式关联立法摘要的立法文件。 2. **内容提取** —— 获取XHTML格式文件并提取正文文本。 ### 数据过滤针对每种语言均应用了以下过滤规则： 1. **可用性过滤** —— 移除存在文件或摘要内容缺失（从CELLAR返回HTTP 404错误）的记录。 2. **PDF扫描伪影过滤** —— 移除包含`[NEW PAGE]`标记的记录（该标记为扫描PDF残留的内容）。 3. **去重过滤** —— 当多个CELEX标识符对应同一份摘要时，保留拥有最长立法文本的记录。 4. **短文本过滤** —— 移除立法文本长度（按空白符分割的Token数）小于等于摘要长度的记录。 ### 文本清洗 - 将不间断空格（`xa0`）统一转换为常规空格。 - 将连续换行符合并为段落分隔符。 - 移除空行与`.xml`标识符行。 ### 数据处理流程初始共发现2,838个CELEX标识符： | 处理步骤 | 丹麦语 | 英语 | |------|----|----| | 可用性过滤后 | 2,536 | 2,521 | | 扫描伪影过滤后 | 2,536 | 2,521 | | 去重过滤后 | 1,701 | 1,684 | | 短文本过滤后 | 1,632 | 1,619 | | 双语交集 | 1,605 | 1,605 | ## 相关数据集 - [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — 覆盖24种欧盟语言的原版EUR-Lex-Sum数据集。本数据集采用类似的构建方法，但数据来源改为CELLAR（原EUR-Lex爬取接口现已受WAF机器人防护）。 ## 许可证本数据集的立法文件与摘要均来自EUR-Lex。欧盟法律文件可遵循欧盟委员会复用政策（https://commission.europa.eu/legal-notice_en）使用（兼容CC BY 4.0许可）。

提供机构：

oliverkinch

5,000+

优质数据集

54 个

任务类型

进入经典数据集