disi-unibo-nlp/COMMA
收藏Hugging Face2024-05-21 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/disi-unibo-nlp/COMMA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: constitution
features:
- name: article_n
dtype: int64
- name: article_commas
struct:
- name: '1'
dtype: string
- name: '10'
dtype: string
- name: '11'
dtype: string
- name: '12'
dtype: string
- name: '2'
dtype: string
- name: '3'
dtype: string
- name: '4'
dtype: string
- name: '5'
dtype: string
- name: '6'
dtype: string
- name: '7'
dtype: string
- name: '8'
dtype: string
- name: '9'
dtype: string
splits:
- name: it
num_bytes: 70944
num_examples: 139
- name: en
num_bytes: 69749
num_examples: 139
- name: fr
num_bytes: 77000
num_examples: 139
- name: es
num_bytes: 76072
num_examples: 139
download_size: 225640
dataset_size: 293765
- config_name: en
features:
- name: id
dtype: string
- name: ruling_type
dtype: int64
- name: epigraph
dtype: string
- name: body
dtype: string
- name: decision
dtype: string
- name: maxims_text
dtype: string
- name: maxims_title
dtype: string
- name: full_text
dtype: string
- name: num_maxims
dtype: int64
- name: maxims_len
dtype: int64
- name: full_text_len
dtype: int64
- name: judgment_type
dtype: int64
- name: constitutional_parameters
dtype: string
- name: maxims
dtype: string
splits:
- name: train
num_bytes: 555145830
num_examples: 12600
- name: test
num_bytes: 30737608
num_examples: 700
- name: validation
num_bytes: 31671019
num_examples: 700
download_size: 278441383
dataset_size: 617554457
- config_name: es
features:
- name: id
dtype: string
- name: ruling_type
dtype: int64
- name: epigraph
dtype: string
- name: body
dtype: string
- name: decision
dtype: string
- name: maxims_text
dtype: string
- name: maxims_title
dtype: string
- name: full_text
dtype: string
- name: num_maxims
dtype: int64
- name: maxims_len
dtype: int64
- name: full_text_len
dtype: int64
- name: judgment_type
dtype: int64
- name: constitutional_parameters
dtype: string
- name: maxims
dtype: string
splits:
- name: train
num_bytes: 575679719
num_examples: 12600
- name: test
num_bytes: 31896832
num_examples: 700
- name: validation
num_bytes: 32827830
num_examples: 700
download_size: 300803577
dataset_size: 640404381
- config_name: fr
features:
- name: id
dtype: string
- name: ruling_type
dtype: int64
- name: epigraph
dtype: string
- name: body
dtype: string
- name: decision
dtype: string
- name: maxims_text
dtype: string
- name: maxims_title
dtype: string
- name: full_text
dtype: string
- name: num_maxims
dtype: int64
- name: maxims_len
dtype: int64
- name: full_text_len
dtype: int64
- name: judgment_type
dtype: int64
- name: constitutional_parameters
dtype: string
- name: maxims
dtype: string
splits:
- name: train
num_bytes: 580985816
num_examples: 12600
- name: test
num_bytes: 32177379
num_examples: 700
- name: validation
num_bytes: 33152939
num_examples: 700
download_size: 306338176
dataset_size: 646316134
- config_name: it
features:
- name: id
dtype: string
- name: ruling_type
dtype: int64
- name: epigraph
dtype: string
- name: body
dtype: string
- name: decision
dtype: string
- name: maxims_text
dtype: string
- name: maxims_title
dtype: string
- name: full_text
dtype: string
- name: num_maxims
dtype: int64
- name: maxims_len
dtype: int64
- name: full_text_len
dtype: int64
- name: judgment_type
dtype: int64
- name: constitutional_parameters
dtype: string
- name: maxims
dtype: string
splits:
- name: train
num_bytes: 557553146
num_examples: 12600
- name: test
num_bytes: 30850184
num_examples: 700
- name: validation
num_bytes: 31775341
num_examples: 700
download_size: 293523614
dataset_size: 620178671
configs:
- config_name: constitution
data_files:
- split: it
path: constitution/it-*
- split: en
path: constitution/en-*
- split: fr
path: constitution/fr-*
- split: es
path: constitution/es-*
- config_name: en
data_files:
- split: train
path: en/train-*
- split: test
path: en/test-*
- split: validation
path: en/validation-*
- config_name: es
data_files:
- split: train
path: es/train-*
- split: test
path: es/test-*
- split: validation
path: es/validation-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- split: test
path: fr/test-*
- split: validation
path: fr/validation-*
- config_name: it
data_files:
- split: train
path: it/train-*
- split: test
path: it/test-*
- split: validation
path: it/validation-*
---
### Dataset Summary
COMMA is a constitutional multi-task and multi-lingual archive consisting of 14K CCIR rulings with expert-authored annotations. It embodies distinctive features that render it a valuable object of study for broader NLP research.
### Languages
Italian, English, Spanish, French
## Dataset
### Data Fields
The dataset contains a list of instances (rulings); each instance contains the following data:
| Field | Description |
|-------------------------: | ------------------------------------------------: |
| id | `(str)` The ruling ID |
| ruling_type | `(int)` The ruling type |
| epigraph | `(str)` The ruling epigraph |
| text | `(str)` The ruling text |
| decision | `(str)` The ruling decision |
| maxims_text | `(str)` The text of ruling maxims |
| maxims_title | `(str)` The title of ruling maxims |
| full_text | `(str)` The ruling full_text |
| num_maxims | `(int)` The number of maxims |
| maxims_len | `(int)` The length of maxims |
| full_text_len | `(int)` The length of the full text |
| judgment_type | `(int)` The judgment type |
| constitutional_parameters | `(List[List[str]])` The constitutional parameters |
| maxims | `(dict)` The maxims' numbers, texts, and titles |
Please check the exemplar usage below for loading the data:
```python
from datasets import load_dataset
comma_en = load_dataset("disi-unibo-nlp/COMMA", "en")
# Download comma_en locally and load it as a Dataset object.
example = comma_en["validation"][0] # The first instance of the dev set
example["full_text"] # The full text (i.e., epigraph + text + decision) for the ruling
print(example['maxims_title']) # The corresponding maxims title for the ruling
```
### Data Splits
| IT | Instances |
| ----------: | --------: |
| Train (90%) | 12,600 |
| Test (5%) | 700 |
| Dev (5%) | 700 |
| EN | Instances |
| ----------: | --------: |
| Train (90%) | 12,600 |
| Test (5%) | 700 |
| Dev (5%) | 700 |
| ES | Instances |
| ----------: | --------: |
| Train (90%) | 12,600 |
| Test (5%) | 700 |
| Dev (5%) | 700 |
| FR | Instances |
| ----------: | --------: |
| Train (90%) | 12,600 |
| Test (5%) | 700 |
| Dev (5%) | 700 |
提供机构:
disi-unibo-nlp
原始信息汇总
数据集概述
数据集名称
COMMA
数据集描述
COMMA是一个包含14K CCIR裁决的多任务和多语言档案,包含专家编写的注释。该数据集具有独特的特征,使其成为更广泛的NLP研究的宝贵研究对象。
支持的语言
- 意大利语
- 英语
- 西班牙语
- 法语
数据集结构
数据字段
- id: 裁决ID(字符串)
- ruling_type: 裁决类型(整数)
- epigraph: 裁决引言(字符串)
- body: 裁决正文(字符串)
- decision: 裁决决定(字符串)
- maxims_text: 裁决格言文本(字符串)
- maxims_title: 裁决格言标题(字符串)
- full_text: 裁决全文(字符串)
- num_maxims: 格言数量(整数)
- maxims_len: 格言长度(整数)
- full_text_len: 全文长度(整数)
- judgment_type: 判决类型(整数)
- constitutional_parameters: 宪法参数(字符串列表)
- maxims: 格言信息(字典)
数据分割
- Train: 训练集(90%,12,600个实例)
- Test: 测试集(5%,700个实例)
- Validation: 验证集(5%,700个实例)
数据集大小
- 下载大小: 根据不同语言配置有所不同,范围从278441383字节到306338176字节。
- 数据集大小: 根据不同语言配置有所不同,范围从617554457字节到646316134字节。
数据集配置
- constitution: 包含文章编号和文章逗号信息,支持意大利语、英语、西班牙语和法语。
- en: 英语配置,包含详细裁决信息。
- es: 西班牙语配置,包含详细裁决信息。
- fr: 法语配置,包含详细裁决信息。
- it: 意大利语配置,包含详细裁决信息。
每个配置包含不同的数据文件路径,用于加载特定语言和分割的数据。



