five

oliverkinch/eur-lex-sum

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/oliverkinch/eur-lex-sum
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - da - en license: cc-by-4.0 task_categories: - summarization pretty_name: EUR-Lex-Sum size_categories: - 1K<n<10K tags: - legal - eur-lex - legislation - eu configs: - config_name: all default: true data_files: - split: train path: data/all/train.jsonl - config_name: da data_files: - split: train path: data/da/train.jsonl - config_name: en data_files: - split: train path: data/en/train.jsonl --- # EUR-Lex-Sum A dataset of EU legislation paired with legislative summaries from EUR-Lex, covering Danish and English. Built from the [EU Publications Office CELLAR](https://publications.europa.eu/) repository using SPARQL-based discovery and XHTML content extraction. ## Dataset Description Each record pairs a full EU legislative document with its official plain-language summary from the [EUR-Lex Summaries of EU Legislation](https://eur-lex.europa.eu/browse/summaries.html) collection. ### Configs | Config | Records | Description | |--------|---------|-------------| | `all` (default) | 1,605 | Bilingual DA+EN — every record has all four text fields | | `da` | 1,632 | Danish only — `celex_id`, `da_document`, `da_summary` | | `en` | 1,619 | English only — `celex_id`, `en_document`, `en_summary` | The `all` config is the intersection of `da` and `en`. ### Usage ```python from datasets import load_dataset # Load default (bilingual) config ds = load_dataset("oliverkinch/eur-lex-sum") # Load Danish only ds_da = load_dataset("oliverkinch/eur-lex-sum", "da") # Load English only ds_en = load_dataset("oliverkinch/eur-lex-sum", "en") ``` ### Fields **`all` config:** | Field | Type | Description | |-------|------|-------------| | `celex_id` | `string` | CELEX identifier for the legislation | | `da_document` | `string` | Full legislative text in Danish | | `da_summary` | `string` | Plain-language summary in Danish | | `en_document` | `string` | Full legislative text in English | | `en_summary` | `string` | Plain-language summary in English | **`da` config:** `celex_id`, `da_document`, `da_summary` **`en` config:** `celex_id`, `en_document`, `en_summary` ### Statistics | | DA | EN | |--|----|----| | Records | 1,632 | 1,619 | | Mean document tokens | 22,706 | 24,758 | | Median document tokens | 8,304 | 9,187 | | Mean summary tokens | 910 | 986 | | Median compression ratio | 10.1x | 10.4x | ## Dataset Construction ### Source Documents and summaries were obtained from the EU Publications Office [CELLAR](https://publications.europa.eu/) repository via: 1. **SPARQL discovery** — querying the CELLAR endpoint to find legislation with linked legislative summaries available as XHTML in Danish (and optionally English). 2. **Content extraction** — fetching XHTML manifestations and extracting body text. ### Filtering The following filters were applied per language: 1. **Availability** — remove records where document or summary content is missing (HTTP 404 from CELLAR). 2. **PDF scan artifacts** — remove records containing `[NEW PAGE]` markers (residual from scanned PDFs). 3. **Deduplication** — when multiple CELEX IDs share an identical summary, keep the record with the longest document. 4. **Short document removal** — remove records where the document is shorter than or equal to the summary (by whitespace token count). ### Text Cleaning - Non-breaking spaces (`\xa0`) normalized to regular spaces. - Consecutive newlines collapsed into paragraph boundaries. - Blank lines and `.xml` identifier lines removed. ### Pipeline Starting from 2,838 discovered CELEX IDs: | Step | DA | EN | |------|----|----| | After availability filter | 2,536 | 2,521 | | After scan filter | 2,536 | 2,521 | | After deduplication | 1,701 | 1,684 | | After short-document filter | 1,632 | 1,619 | | Bilingual intersection | 1,605 | 1,605 | ## Related Datasets - [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — the original EUR-Lex-Sum dataset covering 24 EU languages. This dataset uses a similar methodology but sources data from CELLAR (the original EUR-Lex scraping endpoint is now behind WAF bot protection). ## License The legislative documents and summaries are sourced from EUR-Lex. EU legal documents are available under the [reuse policy of the European Commission](https://commission.europa.eu/legal-notice_en) (CC BY 4.0 compatible).

--- language: - 丹麦语 - 英语 license: CC BY 4.0 task_categories: - 摘要生成 pretty_name: EUR-Lex-Sum size_categories: 1K<n<10K tags: - 法律 - EUR-Lex - 立法文件 - 欧盟 configs: - config_name: all default: true data_files: - split: train path: data/all/train.jsonl - config_name: da data_files: - split: train path: data/da/train.jsonl - config_name: en data_files: - split: train path: data/en/train.jsonl --- # EUR-Lex-Sum 本数据集为欧盟立法文件与来自EUR-Lex的立法摘要的配对集合,涵盖丹麦语与英语两种语言。本数据集基于欧盟出版物办公室CELLAR(https://publications.europa.eu/)知识库构建,通过基于SPARQL的发现流程与XHTML内容提取方法生成。 ## 数据集描述 每条数据记录均将完整的欧盟立法文件与来自EUR-Lex欧盟立法摘要库(https://eur-lex.europa.eu/browse/summaries.html)的官方通俗语言摘要进行配对。 ### 数据集配置 | 配置 | 数据条数 | 说明 | |--------|---------|-------------| | `all`(默认) | 1,605 | 双语(丹麦语+英语)—— 每条记录包含全部四个文本字段 | | `da` | 1,632 | 仅丹麦语—— 包含`celex_id`、`da_document`、`da_summary`三个字段 | | `en` | 1,619 | 仅英语—— 包含`celex_id`、`en_document`、`en_summary`三个字段 | `all`配置为`da`与`en`配置的交集。 ### 使用方法 python from datasets import load_dataset # 加载默认(双语)配置 ds = load_dataset("oliverkinch/eur-lex-sum") # 仅加载丹麦语配置 ds_da = load_dataset("oliverkinch/eur-lex-sum", "da") # 仅加载英语配置 ds_en = load_dataset("oliverkinch/eur-lex-sum", "en") ### 数据字段 **`all` 配置:** | 字段 | 类型 | 说明 | |-------|------|-------------| | `celex_id` | 字符串(string) | 立法文件的CELEX标识符(CELEX identifier) | | `da_document` | 字符串(string) | 丹麦语完整立法文本 | | `da_summary` | 字符串(string) | 丹麦语通俗语言摘要 | | `en_document` | 字符串(string) | 英语完整立法文本 | | `en_summary` | 字符串(string) | 英语通俗语言摘要 | **`da` 配置:** `celex_id`、`da_document`、`da_summary` **`en` 配置:** `celex_id`、`en_document`、`en_summary` ### 统计信息 | | 丹麦语 | 英语 | |--|----|----| | 数据条数 | 1,632 | 1,619 | | 平均文档Token数 | 22,706 | 24,758 | | 文档Token数中位数 | 8,304 | 9,187 | | 平均摘要Token数 | 910 | 986 | | 平均压缩比 | 10.1倍 | 10.4倍 | ## 数据集构建流程 ### 数据来源 立法文件与摘要均通过以下方式从欧盟出版物办公室CELLAR(https://publications.europa.eu/)知识库获取: 1. **SPARQL发现** —— 通过查询CELLAR接口,查找带有丹麦语(可选英语)XHTML格式关联立法摘要的立法文件。 2. **内容提取** —— 获取XHTML格式文件并提取正文文本。 ### 数据过滤 针对每种语言均应用了以下过滤规则: 1. **可用性过滤** —— 移除存在文件或摘要内容缺失(从CELLAR返回HTTP 404错误)的记录。 2. **PDF扫描伪影过滤** —— 移除包含`[NEW PAGE]`标记的记录(该标记为扫描PDF残留的内容)。 3. **去重过滤** —— 当多个CELEX标识符对应同一份摘要时,保留拥有最长立法文本的记录。 4. **短文本过滤** —— 移除立法文本长度(按空白符分割的Token数)小于等于摘要长度的记录。 ### 文本清洗 - 将不间断空格(`xa0`)统一转换为常规空格。 - 将连续换行符合并为段落分隔符。 - 移除空行与`.xml`标识符行。 ### 数据处理流程 初始共发现2,838个CELEX标识符: | 处理步骤 | 丹麦语 | 英语 | |------|----|----| | 可用性过滤后 | 2,536 | 2,521 | | 扫描伪影过滤后 | 2,536 | 2,521 | | 去重过滤后 | 1,701 | 1,684 | | 短文本过滤后 | 1,632 | 1,619 | | 双语交集 | 1,605 | 1,605 | ## 相关数据集 - [dennlinger/eur-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) — 覆盖24种欧盟语言的原版EUR-Lex-Sum数据集。本数据集采用类似的构建方法,但数据来源改为CELLAR(原EUR-Lex爬取接口现已受WAF机器人防护)。 ## 许可证 本数据集的立法文件与摘要均来自EUR-Lex。欧盟法律文件可遵循欧盟委员会复用政策(https://commission.europa.eu/legal-notice_en)使用(兼容CC BY 4.0许可)。
提供机构:
oliverkinch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作