NLP-AUEB/eurlex
收藏数据集卡片 for the EUR-Lex dataset
数据集描述
数据集概述
EURLEX57K 是一个包含 57,000 份英文立法文档的数据集,源自 EUR-Lex 网站(https://eur-lex.europa.eu),平均每份文档长度为 727 词。每份文档包含四个主要部分:标题、法律机构名称、法律背景参考和正文(通常分为条款)。所有文档均由欧盟出版局(https://publications.europa.eu/en)标注了多个来自 EUROVOC(http://eurovoc.europa.eu/)的概念。
支持的任务和排行榜
该数据集支持以下任务:
- 多标签文本分类:根据文档文本预测相关的 EUROVOC 概念。
- 少样本和零样本学习:标签分为频繁(746 个标签)、少样本(3,362 个)和零样本(163 个),根据它们在训练文档中的分配情况。
语言
所有文档均为英文。
数据集结构
数据实例
json { "celex_id": "31979D0509", "title": "79/509/EEC: Council Decision of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain", "text": "COUNCIL DECISION of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain (79/509/EEC) THE COUNCIL OF THE EUROPEAN COMMUNITIES Having regard to the Treaty establishing the European Economic Community, and in particular Article 43 thereof, Having regard to the proposal from the Commission (1), Having regard to the opinion of the European Parliament (2), Whereas the Community should take all appropriate measures to protect itself against the appearance of African swine fever on its territory; Whereas to this end the Community has undertaken, and continues to undertake, action designed to contain outbreaks of this type of disease far from its frontiers by helping countries affected to reinforce their preventive measures ; whereas for this purpose Community subsidies have already been granted to Spain; Whereas these measures have unquestionably made an effective contribution to the protection of Community livestock, especially through the creation and maintenance of a buffer zone north of the river Ebro; Whereas, however, in the opinion of the Spanish authorities themselves, the measures so far implemented must be reinforced if the fundamental objective of eradicating the disease from the entire country is to be achieved; Whereas the Spanish authorities have asked the Community to contribute to the expenses necessary for the efficient implementation of a total eradication programme; Whereas a favourable response should be given to this request by granting aid to Spain, having regard to the undertaking given by that country to protect the Community against African swine fever and to eliminate completely this disease by the end of a five-year eradication plan; Whereas this eradication plan must include certain measures which guarantee the effectiveness of the action taken, and it must be possible to adapt these measures to developments in the situation by means of a procedure establishing close cooperation between the Member States and the Commission; Whereas it is necessary to keep the Member States regularly informed as to the progress of the action undertaken,", "eurovoc_concepts": ["192", "2356", "2560", "862", "863"] }
数据字段
数据集包含以下字段:
celex_id:(字符串)文档的官方 ID。title:(字符串)文档的标题。text:(字符串)文档的完整内容,包括标题、背景和正文。eurovoc_concepts:(字符串列表)相关的 EUROVOC 概念(标签)。
数据分割
| 分割 | 文档数量 | 平均词数 | 平均标签数 |
|---|---|---|---|
| 训练集 | 45,000 | 729 | 5 |
| 开发集 | 6,000 | 714 | 5 |
| 测试集 | 6,000 | 725 | 5 |
数据集创建
策划理由
数据集由 Chalkidis 等人(2019)策划。文档由欧盟出版局(https://publications.europa.eu/en)标注。
源数据
初始数据收集和规范化
原始数据来自 EUR-Lex 门户网站(https://eur-lex.europa.eu),以未处理的 HTML 格式提供。文档从 EUR-Lex 门户网站下载为 HTML 格式,相关元数据和 EUROVOC 概念从欧盟出版局的 SPARQL 端点(http://publications.europa.eu/webapi/rdf/sparql)下载。
源语言生产者
[需要更多信息]
标注
标注过程
- 原始文档以未处理的 HTML 格式在 EUR-Lex 门户网站(https://eur-lex.europa.eu)上提供。HTML 代码被剥离,文档被分割成部分。
- 文档由欧盟出版局(https://publications.europa.eu/en)标注。
标注者
欧盟出版局(https://publications.europa.eu/en)
个人和敏感信息
数据集不包含个人或敏感信息。
使用数据的注意事项
数据集的社会影响
[需要更多信息]
偏见的讨论
[需要更多信息]
其他已知限制
[需要更多信息]
附加信息
数据集策展人
Chalkidis 等人(2019)
许可信息
© European Union, 1998-2021
委员会的文档重用政策基于 2011/833/EU 决定。除非另有说明,您可以出于商业或非商业目的重用 EUR-Lex 上发布的法律文档。
该网站的编辑内容、欧盟立法摘要和合并文本的版权由欧盟所有,并根据知识共享署名 4.0 国际许可协议授权。这意味着您可以重用内容,前提是您承认来源并指出所做的任何更改。
来源:https://eur-lex.europa.eu/content/legal-notice/legal-notice.html 更多信息:https://eur-lex.europa.eu/content/help/faq/reuse-contents-eurlex.html
引用信息
@inproceedings{chalkidis-etal-2019-large, title = "Large-Scale Multi-Label Text Classification on {EU} Legislation", author = "Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Androutsopoulos, Ion", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1636", doi = "10.18653/v1/P19-1636", pages = "6314--6322" }
贡献
感谢 @iliaschalkidis 添加此数据集。




