NLP-AUEB/eurlex

Name: NLP-AUEB/eurlex
Creator: NLP-AUEB
Published: 2024-01-18 11:03:22
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/NLP-AUEB/eurlex

下载链接

链接失效反馈

官方服务：

资源简介：

EUR-Lex数据集是一个包含57,000个英文立法文档的文本分类数据集。每个文档平均长度为727个单词，包含标题、法律背景引用和正文三个主要部分。数据集中的文档由欧盟出版局使用EUROVOC概念进行多标签标注。数据集支持多标签文本分类任务，并且标签分为频繁标签、少样本标签和零样本标签三类。数据集的结构包括celex_id、title、text和eurovoc_concepts四个字段，并且数据集分为训练集、开发集和测试集。

The EUR-Lex dataset is a text classification dataset consisting of 57,000 English legislative documents. Each document has an average length of 727 words and contains three main sections: title, legal background citations, and main body. Documents in the dataset are multi-label annotated by the Publications Office of the European Union using EUROVOC concepts. The dataset supports multi-label text classification tasks, and its labels are categorized into three types: frequent labels, few-shot labels, and zero-shot labels. The dataset structure includes four fields: celex_id, title, text, and eurovoc_concepts, and the dataset is split into training, development, and test sets.

提供机构：

NLP-AUEB

原始信息汇总

数据集卡片 for the EUR-Lex dataset

数据集描述

数据集概述

EURLEX57K 是一个包含 57,000 份英文立法文档的数据集，源自 EUR-Lex 网站（https://eur-lex.europa.eu），平均每份文档长度为 727 词。每份文档包含四个主要部分：标题、法律机构名称、法律背景参考和正文（通常分为条款）。所有文档均由欧盟出版局（https://publications.europa.eu/en）标注了多个来自 EUROVOC（http://eurovoc.europa.eu/）的概念。

支持的任务和排行榜

该数据集支持以下任务：

多标签文本分类：根据文档文本预测相关的 EUROVOC 概念。
少样本和零样本学习：标签分为频繁（746 个标签）、少样本（3,362 个）和零样本（163 个），根据它们在训练文档中的分配情况。

语言

所有文档均为英文。

数据集结构

数据实例

json { "celex_id": "31979D0509", "title": "79/509/EEC: Council Decision of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain", "text": "COUNCIL DECISION of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain (79/509/EEC) THE COUNCIL OF THE EUROPEAN COMMUNITIES Having regard to the Treaty establishing the European Economic Community, and in particular Article 43 thereof, Having regard to the proposal from the Commission (1), Having regard to the opinion of the European Parliament (2), Whereas the Community should take all appropriate measures to protect itself against the appearance of African swine fever on its territory; Whereas to this end the Community has undertaken, and continues to undertake, action designed to contain outbreaks of this type of disease far from its frontiers by helping countries affected to reinforce their preventive measures ; whereas for this purpose Community subsidies have already been granted to Spain; Whereas these measures have unquestionably made an effective contribution to the protection of Community livestock, especially through the creation and maintenance of a buffer zone north of the river Ebro; Whereas, however, in the opinion of the Spanish authorities themselves, the measures so far implemented must be reinforced if the fundamental objective of eradicating the disease from the entire country is to be achieved; Whereas the Spanish authorities have asked the Community to contribute to the expenses necessary for the efficient implementation of a total eradication programme; Whereas a favourable response should be given to this request by granting aid to Spain, having regard to the undertaking given by that country to protect the Community against African swine fever and to eliminate completely this disease by the end of a five-year eradication plan; Whereas this eradication plan must include certain measures which guarantee the effectiveness of the action taken, and it must be possible to adapt these measures to developments in the situation by means of a procedure establishing close cooperation between the Member States and the Commission; Whereas it is necessary to keep the Member States regularly informed as to the progress of the action undertaken,", "eurovoc_concepts": ["192", "2356", "2560", "862", "863"] }

数据字段

数据集包含以下字段：

celex_id：（字符串）文档的官方 ID。
title：（字符串）文档的标题。
text：（字符串）文档的完整内容，包括标题、背景和正文。
eurovoc_concepts：（字符串列表）相关的 EUROVOC 概念（标签）。

数据分割

分割	文档数量	平均词数	平均标签数
训练集	45,000	729	5
开发集	6,000	714	5
测试集	6,000	725	5

数据集创建

策划理由

数据集由 Chalkidis 等人（2019）策划。文档由欧盟出版局（https://publications.europa.eu/en）标注。

源数据

初始数据收集和规范化

原始数据来自 EUR-Lex 门户网站（https://eur-lex.europa.eu），以未处理的 HTML 格式提供。文档从 EUR-Lex 门户网站下载为 HTML 格式，相关元数据和 EUROVOC 概念从欧盟出版局的 SPARQL 端点（http://publications.europa.eu/webapi/rdf/sparql）下载。

源语言生产者

[需要更多信息]

标注

标注过程

原始文档以未处理的 HTML 格式在 EUR-Lex 门户网站（https://eur-lex.europa.eu）上提供。HTML 代码被剥离，文档被分割成部分。
文档由欧盟出版局（https://publications.europa.eu/en）标注。

标注者

欧盟出版局（https://publications.europa.eu/en）

个人和敏感信息

数据集不包含个人或敏感信息。

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

Chalkidis 等人（2019）

许可信息

委员会的文档重用政策基于 2011/833/EU 决定。除非另有说明，您可以出于商业或非商业目的重用 EUR-Lex 上发布的法律文档。

该网站的编辑内容、欧盟立法摘要和合并文本的版权由欧盟所有，并根据知识共享署名 4.0 国际许可协议授权。这意味着您可以重用内容，前提是您承认来源并指出所做的任何更改。

来源：https://eur-lex.europa.eu/content/legal-notice/legal-notice.html 更多信息：https://eur-lex.europa.eu/content/help/faq/reuse-contents-eurlex.html

引用信息

@inproceedings{chalkidis-etal-2019-large, title = "Large-Scale Multi-Label Text Classification on {EU} Legislation", author = "Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Androutsopoulos, Ion", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1636", doi = "10.18653/v1/P19-1636", pages = "6314--6322" }

贡献

感谢 @iliaschalkidis 添加此数据集。

搜集汇总

数据集介绍

构建方式

EURLEX57K数据集由Chalkidis等人（2019年）精心构建，其原始数据来源于EUR-Lex门户网站（https://eur-lex.europa.eu），并以HTML格式下载。随后，这些文档通过去除HTML代码并分割成不同部分进行预处理。此外，相关元数据和EUROVOC概念从欧盟出版办公室的SPARQL端点（http://publications.europa.eu/webapi/rdf/sparql）下载。所有文档均由欧盟出版办公室进行标注，确保了数据的高质量和权威性。

特点

EURLEX57K数据集的核心特点在于其大规模的多标签文本分类能力，涵盖了57,000份英文立法文档，平均每份文档包含727个单词。每个文档被细分为标题、法律主体名称、法律背景参考和主要条款等部分，并附有来自EUROVOC的4,271个概念标签。这些标签进一步细分为频繁、少样本和零样本三类，为模型提供了丰富的训练和评估场景。

使用方法

使用EURLEX57K数据集时，用户可以访问包含celex_id、title、text和eurovoc_concepts等字段的数据实例。通过这些字段，用户可以进行多标签文本分类、少样本学习和零样本学习等任务。此外，用户还可以加载包含EUROVOC概念描述的JSONL文件，以进一步增强模型的理解和预测能力。数据集的训练、验证和测试集分别包含45,000、6,000和6,000个文档，为模型的训练和评估提供了均衡的数据支持。

背景与挑战

背景概述

EURLEX57K数据集是由Ilias Chalkidis等人于2019年创建的，旨在改进和扩展Mencia和Furnkranzand（2007）发布的早期数据集。该数据集包含57,000份来自EUR-Lex的立法文档，涵盖了欧盟法律的广泛领域。这些文档通过欧盟出版办公室（Publications Office of EU）进行了多标签分类标注，使用了EUROVOC概念，共计4,271个标签。EURLEX57K不仅在规模上显著超越了其前身，还在多标签文本分类和少样本学习任务中展示了其重要性，对法律文本处理和自然语言处理领域产生了深远影响。

当前挑战

EURLEX57K数据集在构建过程中面临了多重挑战。首先，法律文本的复杂性和专业性使得标注过程异常复杂，需要高度专业化的知识。其次，数据集中的标签分布不均，存在大量少样本和零样本标签，这对模型的泛化能力和学习效率提出了严峻考验。此外，数据集的规模和多样性也增加了数据处理和模型训练的计算复杂度。这些挑战不仅影响了数据集的质量，也对相关研究提出了更高的技术要求。

常用场景

经典使用场景

在法律文本分类领域，EURLEX57K数据集被广泛应用于多标签文本分类任务。该数据集通过提供大量欧盟立法文档及其对应的EUROVOC概念标签，使得研究者能够训练和评估模型在法律文本中的多标签分类能力。具体而言，模型通过分析文档的标题和正文，预测其所属的多个法律概念，从而实现对法律文本的精准分类。

衍生相关工作

基于EURLEX57K数据集，研究者们开发了多种多标签文本分类模型，并在此基础上进行了深入的研究和改进。例如，Chalkidis等人（2019）在其研究中使用了该数据集来评估和比较不同模型的性能，推动了法律文本分类技术的发展。此外，该数据集还被用于探索少样本学习和零样本学习在法律文本分类中的应用，为相关领域的研究提供了新的思路和方法。

数据集最近研究