disi-unibo-nlp/COMMA

Name: disi-unibo-nlp/COMMA
Creator: disi-unibo-nlp
Published: 2024-05-21 16:32:14
License: 暂无描述

Hugging Face2024-05-21 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/disi-unibo-nlp/COMMA

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: constitution features: - name: article_n dtype: int64 - name: article_commas struct: - name: '1' dtype: string - name: '10' dtype: string - name: '11' dtype: string - name: '12' dtype: string - name: '2' dtype: string - name: '3' dtype: string - name: '4' dtype: string - name: '5' dtype: string - name: '6' dtype: string - name: '7' dtype: string - name: '8' dtype: string - name: '9' dtype: string splits: - name: it num_bytes: 70944 num_examples: 139 - name: en num_bytes: 69749 num_examples: 139 - name: fr num_bytes: 77000 num_examples: 139 - name: es num_bytes: 76072 num_examples: 139 download_size: 225640 dataset_size: 293765 - config_name: en features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 555145830 num_examples: 12600 - name: test num_bytes: 30737608 num_examples: 700 - name: validation num_bytes: 31671019 num_examples: 700 download_size: 278441383 dataset_size: 617554457 - config_name: es features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 575679719 num_examples: 12600 - name: test num_bytes: 31896832 num_examples: 700 - name: validation num_bytes: 32827830 num_examples: 700 download_size: 300803577 dataset_size: 640404381 - config_name: fr features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 580985816 num_examples: 12600 - name: test num_bytes: 32177379 num_examples: 700 - name: validation num_bytes: 33152939 num_examples: 700 download_size: 306338176 dataset_size: 646316134 - config_name: it features: - name: id dtype: string - name: ruling_type dtype: int64 - name: epigraph dtype: string - name: body dtype: string - name: decision dtype: string - name: maxims_text dtype: string - name: maxims_title dtype: string - name: full_text dtype: string - name: num_maxims dtype: int64 - name: maxims_len dtype: int64 - name: full_text_len dtype: int64 - name: judgment_type dtype: int64 - name: constitutional_parameters dtype: string - name: maxims dtype: string splits: - name: train num_bytes: 557553146 num_examples: 12600 - name: test num_bytes: 30850184 num_examples: 700 - name: validation num_bytes: 31775341 num_examples: 700 download_size: 293523614 dataset_size: 620178671 configs: - config_name: constitution data_files: - split: it path: constitution/it-* - split: en path: constitution/en-* - split: fr path: constitution/fr-* - split: es path: constitution/es-* - config_name: en data_files: - split: train path: en/train-* - split: test path: en/test-* - split: validation path: en/validation-* - config_name: es data_files: - split: train path: es/train-* - split: test path: es/test-* - split: validation path: es/validation-* - config_name: fr data_files: - split: train path: fr/train-* - split: test path: fr/test-* - split: validation path: fr/validation-* - config_name: it data_files: - split: train path: it/train-* - split: test path: it/test-* - split: validation path: it/validation-* --- ### Dataset Summary COMMA is a constitutional multi-task and multi-lingual archive consisting of 14K CCIR rulings with expert-authored annotations. It embodies distinctive features that render it a valuable object of study for broader NLP research. ### Languages Italian, English, Spanish, French ## Dataset ### Data Fields The dataset contains a list of instances (rulings); each instance contains the following data: | Field | Description | |-------------------------: | ------------------------------------------------: | | id | `(str)` The ruling ID | | ruling_type | `(int)` The ruling type | | epigraph | `(str)` The ruling epigraph | | text | `(str)` The ruling text | | decision | `(str)` The ruling decision | | maxims_text | `(str)` The text of ruling maxims | | maxims_title | `(str)` The title of ruling maxims | | full_text | `(str)` The ruling full_text | | num_maxims | `(int)` The number of maxims | | maxims_len | `(int)` The length of maxims | | full_text_len | `(int)` The length of the full text | | judgment_type | `(int)` The judgment type | | constitutional_parameters | `(List[List[str]])` The constitutional parameters | | maxims | `(dict)` The maxims' numbers, texts, and titles | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset comma_en = load_dataset("disi-unibo-nlp/COMMA", "en") # Download comma_en locally and load it as a Dataset object. example = comma_en["validation"][0] # The first instance of the dev set example["full_text"] # The full text (i.e., epigraph + text + decision) for the ruling print(example['maxims_title']) # The corresponding maxims title for the ruling ``` ### Data Splits | IT | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | EN | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | ES | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 | | FR | Instances | | ----------: | --------: | | Train (90%) | 12,600 | | Test (5%) | 700 | | Dev (5%) | 700 |

提供机构：

disi-unibo-nlp

原始信息汇总

数据集概述

数据集名称

COMMA

数据集描述

COMMA是一个包含14K CCIR裁决的多任务和多语言档案，包含专家编写的注释。该数据集具有独特的特征，使其成为更广泛的NLP研究的宝贵研究对象。

支持的语言

意大利语
英语
西班牙语
法语

数据集结构

数据字段

id: 裁决ID（字符串）
ruling_type: 裁决类型（整数）
epigraph: 裁决引言（字符串）
body: 裁决正文（字符串）
decision: 裁决决定（字符串）
maxims_text: 裁决格言文本（字符串）
maxims_title: 裁决格言标题（字符串）
full_text: 裁决全文（字符串）
num_maxims: 格言数量（整数）
maxims_len: 格言长度（整数）
full_text_len: 全文长度（整数）
judgment_type: 判决类型（整数）
constitutional_parameters: 宪法参数（字符串列表）
maxims: 格言信息（字典）

数据分割

Train: 训练集（90%，12,600个实例）
Test: 测试集（5%，700个实例）
Validation: 验证集（5%，700个实例）

数据集大小

下载大小: 根据不同语言配置有所不同，范围从278441383字节到306338176字节。
数据集大小: 根据不同语言配置有所不同，范围从617554457字节到646316134字节。

数据集配置

constitution: 包含文章编号和文章逗号信息，支持意大利语、英语、西班牙语和法语。
en: 英语配置，包含详细裁决信息。
es: 西班牙语配置，包含详细裁决信息。
fr: 法语配置，包含详细裁决信息。
it: 意大利语配置，包含详细裁决信息。

每个配置包含不同的数据文件路径，用于加载特定语言和分割的数据。

5,000+

优质数据集

54 个

任务类型

进入经典数据集