NCube/europa-random-split

Name: NCube/europa-random-split
Creator: NCube
Published: 2024-06-03 02:58:36
License: 暂无描述

Hugging Face2024-06-03 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/NCube/europa-random-split

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr - de - en - it - nl - el - da - pt - es - sv - fi - lt - et - cs - hu - lv - sl - pl - mt - sk - ro - bg - hr - ga license: mit size_categories: - 100K<n<1M pretty_name: Europa Random Split dataset_info: features: - name: celex_id dtype: string - name: lang dtype: string - name: input_text dtype: string - name: keyphrases sequence: string splits: - name: train num_bytes: 6405779590 num_examples: 159306 - name: valid num_bytes: 2182262528 num_examples: 53943 - name: test num_bytes: 2853853947 num_examples: 71708 download_size: 5183354316 dataset_size: 11441896065 configs: - config_name: default data_files: - split: train path: data/train-* - split: valid path: data/valid-* - split: test path: data/test-* tags: - keyphrase-generation - text-to-text - legal --- # Dataset Card for EUROPA This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1). ## Dataset Details ### Dataset Description EUROPA is a dataset designed for training and evaluating multilingual keyphrase generation models in the legal domain. It consists of legal judgments from the Court of Justice of the European Union (EU) and includes instances in all 24 official EU languages. **Key Features**: **Multilingual:** Covers 24 official EU languages. **Domain-Specific:** Focuses on legal documents. **Source:** Derived from Court of Justice of the European Union judgments. - **Curated by:** N3 team - **Languages:** French, German, English, Italian, Dutch, Greek, Danish, Portuguese, Spanish, Swedish, Finnish, Lithuanian, Estonian, Czech, Hungarian, Latvian, Slovenian, Polish, Maltese, Slovak, Romanian, Bulgarian, Croatian, Irish - **License:** MIT License ### Dataset Sources - **Paper:** https://arxiv.org/abs/2403.00252 ## Dataset Structure - **celex_id:** CELEX identifier inherited from CJEU. Different translated versions of the same judgment share the same celex_id. If you wish to have a unique identifier for each instance, you can concatenate `lang` and `celex_id` values; - **lang:** ISO 639-1 language code; - **input:** judgment transcription or translation; - **keyphrases:** reference keyphrases drafted by the CJEU. This page presents an randomly split version of the dataset, thus allowing all sets to have the same distribution in terms of vocabulary and languages. More details can be found in Appendix H of our paper. - **training set**: 159 306 instances; - **validation set**: 53 943 instances; - **test set**: 71 708 instances. ## Citation ``` @article{salaun2024europa, title={EUROPA: A Legal Multilingual Keyphrase Generation Dataset}, author={Sala{\"u}n, Olivier and Piedboeuf, Fr{\'e}d{\'e}ric and Le Berre, Guillaume and Hermelo, David Alfonso and Langlais, Philippe}, journal={arXiv preprint arXiv:2403.00252}, year={2024} } ```

提供机构：

NCube

原始信息汇总

数据集概述

数据集名称

名称: Europa Random Split
别名: EUROPA

数据集描述

目的: 用于训练和评估多语言关键短语生成模型，专注于法律领域。
内容: 包含欧洲法院（EU）的法律判决，涵盖24种官方欧盟语言。
特点:
- 多语言: 支持24种官方欧盟语言。
- 领域特定: 专注于法律文档。
- 来源: 源自欧洲法院的判决。

数据集结构

特征:
- celex_id: 字符串类型，CELEX标识符。
- lang: 字符串类型，ISO 639-1语言代码。
- input_text: 字符串类型，判决的转录或翻译。
- keyphrases: 字符串序列类型，由欧洲法院起草的参考关键短语。
分割:
- 训练集: 159,306个实例。
- 验证集: 53,943个实例。
- 测试集: 71,708个实例。

数据集大小

下载大小: 5,183,354,316字节。
数据集大小: 11,441,896,065字节。

许可信息

许可证: MIT License

语言支持

语言: 法语、德语、英语、意大利语、荷兰语、希腊语、丹麦语、葡萄牙语、西班牙语、瑞典语、芬兰语、立陶宛语、爱沙尼亚语、捷克语、匈牙利语、拉脱维亚语、斯洛文尼亚语、波兰语、马耳他语、斯洛伐克语、罗马尼亚语、保加利亚语、克罗地亚语、爱尔兰语。

5,000+

优质数据集

54 个

任务类型

进入经典数据集