five

disi-unibo-nlp/COMMA

收藏
Hugging Face2024-05-21 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/disi-unibo-nlp/COMMA
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: constitution features: - name: article_n dtype: int64 - name: article_commas struct: - name: '1' dtype: string - name: '10' dtype: string - name: '11' dtype: string - name: '12' dtype: string - name: '2' dtype: string - name: '3' dtype: string - name: '4' dtype: string - name: '5' dtype: string - name: '6' dtype: string - name: '7' dtype: string - name: '8' dtype: string - name: '9' dtype: string splits: - name: it num_bytes: 70944 num_examples: 139 - name: en num_bytes: 69749 num_examples: 139 - name: fr num_bytes: 77000 num_examples: 139 - name: es num_bytes: 76072 num_examples: 139 download_size: 225640 dataset_size: 293765 - config_name: en features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 555145830 num_examples: 12600 - name: test num_bytes: 30737608 num_examples: 700 - name: validation num_bytes: 31671019 num_examples: 700 download_size: 278441383 dataset_size: 617554457 - config_name: es features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 575679719 num_examples: 12600 - name: test num_bytes: 31896832 num_examples: 700 - name: validation num_bytes: 32827830 num_examples: 700 download_size: 300803577 dataset_size: 640404381 - config_name: fr features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 580985816 num_examples: 12600 - name: test num_bytes: 32177379 num_examples: 700 - name: validation num_bytes: 33152939 num_examples: 700 download_size: 306338176 dataset_size: 646316134 - config_name: it features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 557553146 num_examples: 12600 - name: test num_bytes: 30850184 num_examples: 700 - name: validation num_bytes: 31775341 num_examples: 700 download_size: 293523614 dataset_size: 620178671 configs: - config_name: constitution data_files: - split: it path: constitution/it-* - split: en path: constitution/en-* - split: fr path: constitution/fr-* - split: es path: constitution/es-* - config_name: en data_files: - split: train path: en/train-* - split: test path: en/test-* - split: validation path: en/validation-* - config_name: es data_files: - split: train path: es/train-* - split: test path: es/test-* - split: validation path: es/validation-* - config_name: fr data_files: - split: train path: fr/train-* - split: test path: fr/test-* - split: validation path: fr/validation-* - config_name: it data_files: - split: train path: it/train-* - split: test path: it/test-* - split: validation path: it/validation-* --- ### Dataset Summary COMMA is a constitutional multi-task and multi-lingual archive consisting of 14K CCIR rulings with expert-authored annotations. It embodies distinctive features that render it a valuable object of study for broader NLP research. ### Languages Italian, English, Spanish, French ## Dataset ### Data Fields The dataset contains a list of instances (rulings); each instance contains the following data: | Field | Description | |-------------------------: | ------------------------------------------------: | | id | `(str)` The ruling ID | | ruling_type | `(int)` The ruling type | | epigraph | `(str)` The ruling epigraph | | text | `(str)` The ruling text | | decision | `(str)` The ruling decision | | maxims_text | `(str)` The text of ruling maxims | | maxims_title | `(str)` The title of ruling maxims | | full_text | `(str)` The ruling full_text | | num_maxims | `(int)` The number of maxims | | maxims_len | `(int)` The length of maxims | | full_text_len | `(int)` The length of the full text | | judgment_type | `(int)` The judgment type | | constitutional_parameters | `(List[List[str]])` The constitutional parameters | | maxims | `(dict)` The maxims' numbers, texts, and titles | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset comma_en = load_dataset("disi-unibo-nlp/COMMA", "en") # Download comma_en locally and load it as a Dataset object. example = comma_en["validation"][0] # The first instance of the dev set example["full_text"] # The full text (i.e., epigraph + text + decision) for the ruling print(example['maxims_title']) # The corresponding maxims title for the ruling ``` ### Data Splits | IT | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | EN | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | ES | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | FR | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 |
提供机构:
disi-unibo-nlp
原始信息汇总

数据集概述

数据集名称

COMMA

数据集描述

COMMA是一个包含14K CCIR裁决的多任务和多语言档案,包含专家编写的注释。该数据集具有独特的特征,使其成为更广泛的NLP研究的宝贵研究对象。

支持的语言

  • 意大利语
  • 英语
  • 西班牙语
  • 法语

数据集结构

数据字段

  • id: 裁决ID(字符串)
  • ruling_type: 裁决类型(整数)
  • epigraph: 裁决引言(字符串)
  • body: 裁决正文(字符串)
  • decision: 裁决决定(字符串)
  • maxims_text: 裁决格言文本(字符串)
  • maxims_title: 裁决格言标题(字符串)
  • full_text: 裁决全文(字符串)
  • num_maxims: 格言数量(整数)
  • maxims_len: 格言长度(整数)
  • full_text_len: 全文长度(整数)
  • judgment_type: 判决类型(整数)
  • constitutional_parameters: 宪法参数(字符串列表)
  • maxims: 格言信息(字典)

数据分割

  • Train: 训练集(90%,12,600个实例)
  • Test: 测试集(5%,700个实例)
  • Validation: 验证集(5%,700个实例)

数据集大小

  • 下载大小: 根据不同语言配置有所不同,范围从278441383字节到306338176字节。
  • 数据集大小: 根据不同语言配置有所不同,范围从617554457字节到646316134字节。

数据集配置

  • constitution: 包含文章编号和文章逗号信息,支持意大利语、英语、西班牙语和法语。
  • en: 英语配置,包含详细裁决信息。
  • es: 西班牙语配置,包含详细裁决信息。
  • fr: 法语配置,包含详细裁决信息。
  • it: 意大利语配置,包含详细裁决信息。

每个配置包含不同的数据文件路径,用于加载特定语言和分割的数据。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作