NCube/europa-random-split
收藏Hugging Face2024-06-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/NCube/europa-random-split
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
- de
- en
- it
- nl
- el
- da
- pt
- es
- sv
- fi
- lt
- et
- cs
- hu
- lv
- sl
- pl
- mt
- sk
- ro
- bg
- hr
- ga
license: mit
size_categories:
- 100K<n<1M
pretty_name: Europa Random Split
dataset_info:
features:
- name: celex_id
dtype: string
- name: lang
dtype: string
- name: input_text
dtype: string
- name: keyphrases
sequence: string
splits:
- name: train
num_bytes: 6405779590
num_examples: 159306
- name: valid
num_bytes: 2182262528
num_examples: 53943
- name: test
num_bytes: 2853853947
num_examples: 71708
download_size: 5183354316
dataset_size: 11441896065
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
- split: test
path: data/test-*
tags:
- keyphrase-generation
- text-to-text
- legal
---
# Dataset Card for EUROPA
This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1).
## Dataset Details
### Dataset Description
EUROPA is a dataset designed for training and evaluating multilingual keyphrase generation models in the legal domain. It consists of legal judgments from the Court of Justice of the European Union (EU) and includes instances in all 24 official EU languages.
**Key Features**:
**Multilingual:** Covers 24 official EU languages.
**Domain-Specific:** Focuses on legal documents.
**Source:** Derived from Court of Justice of the European Union judgments.
- **Curated by:** N3 team
- **Languages:** French, German, English, Italian, Dutch, Greek, Danish, Portuguese, Spanish, Swedish, Finnish, Lithuanian, Estonian, Czech, Hungarian, Latvian, Slovenian, Polish, Maltese, Slovak, Romanian, Bulgarian, Croatian, Irish
- **License:** MIT License
### Dataset Sources
- **Paper:** https://arxiv.org/abs/2403.00252
## Dataset Structure
- **celex_id:** CELEX identifier inherited from CJEU. Different translated versions of the same judgment share the same celex_id. If you wish to have a unique identifier for each instance, you can concatenate `lang` and `celex_id` values;
- **lang:** ISO 639-1 language code;
- **input:** judgment transcription or translation;
- **keyphrases:** reference keyphrases drafted by the CJEU.
This page presents an randomly split version of the dataset, thus allowing all sets to have the same distribution in terms of vocabulary and languages. More details can be found in Appendix H of our paper.
- **training set**: 159 306 instances;
- **validation set**: 53 943 instances;
- **test set**: 71 708 instances.
## Citation
```
@article{salaun2024europa,
title={EUROPA: A Legal Multilingual Keyphrase Generation Dataset},
author={Sala{\"u}n, Olivier and Piedboeuf, Fr{\'e}d{\'e}ric and Le Berre, Guillaume and Hermelo, David Alfonso and Langlais, Philippe},
journal={arXiv preprint arXiv:2403.00252},
year={2024}
}
```
提供机构:
NCube
原始信息汇总
数据集概述
数据集名称
- 名称: Europa Random Split
- 别名: EUROPA
数据集描述
- 目的: 用于训练和评估多语言关键短语生成模型,专注于法律领域。
- 内容: 包含欧洲法院(EU)的法律判决,涵盖24种官方欧盟语言。
- 特点:
- 多语言: 支持24种官方欧盟语言。
- 领域特定: 专注于法律文档。
- 来源: 源自欧洲法院的判决。
数据集结构
- 特征:
- celex_id: 字符串类型,CELEX标识符。
- lang: 字符串类型,ISO 639-1语言代码。
- input_text: 字符串类型,判决的转录或翻译。
- keyphrases: 字符串序列类型,由欧洲法院起草的参考关键短语。
- 分割:
- 训练集: 159,306个实例。
- 验证集: 53,943个实例。
- 测试集: 71,708个实例。
数据集大小
- 下载大小: 5,183,354,316字节。
- 数据集大小: 11,441,896,065字节。
许可信息
- 许可证: MIT License
语言支持
- 语言: 法语、德语、英语、意大利语、荷兰语、希腊语、丹麦语、葡萄牙语、西班牙语、瑞典语、芬兰语、立陶宛语、爱沙尼亚语、捷克语、匈牙利语、拉脱维亚语、斯洛文尼亚语、波兰语、马耳他语、斯洛伐克语、罗马尼亚语、保加利亚语、克罗地亚语、爱尔兰语。



