five

NCube/europa-random-split

收藏
Hugging Face2024-06-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/NCube/europa-random-split
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr - de - en - it - nl - el - da - pt - es - sv - fi - lt - et - cs - hu - lv - sl - pl - mt - sk - ro - bg - hr - ga license: mit size_categories: - 100K<n<1M pretty_name: Europa Random Split dataset_info: features: - name: celex_id dtype: string - name: lang dtype: string - name: input_text dtype: string - name: keyphrases sequence: string splits: - name: train num_bytes: 6405779590 num_examples: 159306 - name: valid num_bytes: 2182262528 num_examples: 53943 - name: test num_bytes: 2853853947 num_examples: 71708 download_size: 5183354316 dataset_size: 11441896065 configs: - config_name: default data_files: - split: train path: data/train-* - split: valid path: data/valid-* - split: test path: data/test-* tags: - keyphrase-generation - text-to-text - legal --- # Dataset Card for EUROPA This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1). ## Dataset Details ### Dataset Description EUROPA is a dataset designed for training and evaluating multilingual keyphrase generation models in the legal domain. It consists of legal judgments from the Court of Justice of the European Union (EU) and includes instances in all 24 official EU languages. **Key Features**: **Multilingual:** Covers 24 official EU languages. **Domain-Specific:** Focuses on legal documents. **Source:** Derived from Court of Justice of the European Union judgments. - **Curated by:** N3 team - **Languages:** French, German, English, Italian, Dutch, Greek, Danish, Portuguese, Spanish, Swedish, Finnish, Lithuanian, Estonian, Czech, Hungarian, Latvian, Slovenian, Polish, Maltese, Slovak, Romanian, Bulgarian, Croatian, Irish - **License:** MIT License ### Dataset Sources - **Paper:** https://arxiv.org/abs/2403.00252 ## Dataset Structure - **celex_id:** CELEX identifier inherited from CJEU. Different translated versions of the same judgment share the same celex_id. If you wish to have a unique identifier for each instance, you can concatenate `lang` and `celex_id` values; - **lang:** ISO 639-1 language code; - **input:** judgment transcription or translation; - **keyphrases:** reference keyphrases drafted by the CJEU. This page presents an randomly split version of the dataset, thus allowing all sets to have the same distribution in terms of vocabulary and languages. More details can be found in Appendix H of our paper. - **training set**: 159 306 instances; - **validation set**: 53 943 instances; - **test set**: 71 708 instances. ## Citation ``` @article{salaun2024europa, title={EUROPA: A Legal Multilingual Keyphrase Generation Dataset}, author={Sala{\"u}n, Olivier and Piedboeuf, Fr{\'e}d{\'e}ric and Le Berre, Guillaume and Hermelo, David Alfonso and Langlais, Philippe}, journal={arXiv preprint arXiv:2403.00252}, year={2024} } ```
提供机构:
NCube
原始信息汇总

数据集概述

数据集名称

  • 名称: Europa Random Split
  • 别名: EUROPA

数据集描述

  • 目的: 用于训练和评估多语言关键短语生成模型,专注于法律领域。
  • 内容: 包含欧洲法院(EU)的法律判决,涵盖24种官方欧盟语言。
  • 特点:
    • 多语言: 支持24种官方欧盟语言。
    • 领域特定: 专注于法律文档。
    • 来源: 源自欧洲法院的判决。

数据集结构

  • 特征:
    • celex_id: 字符串类型,CELEX标识符。
    • lang: 字符串类型,ISO 639-1语言代码。
    • input_text: 字符串类型,判决的转录或翻译。
    • keyphrases: 字符串序列类型,由欧洲法院起草的参考关键短语。
  • 分割:
    • 训练集: 159,306个实例。
    • 验证集: 53,943个实例。
    • 测试集: 71,708个实例。

数据集大小

  • 下载大小: 5,183,354,316字节。
  • 数据集大小: 11,441,896,065字节。

许可信息

  • 许可证: MIT License

语言支持

  • 语言: 法语、德语、英语、意大利语、荷兰语、希腊语、丹麦语、葡萄牙语、西班牙语、瑞典语、芬兰语、立陶宛语、爱沙尼亚语、捷克语、匈牙利语、拉脱维亚语、斯洛文尼亚语、波兰语、马耳他语、斯洛伐克语、罗马尼亚语、保加利亚语、克罗地亚语、爱尔兰语。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作