taln-ls2n/ARRContributions
收藏Hugging Face2025-11-04 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/taln-ls2n/ARRContributions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
- split: test_annotated
path: data/test_annotated-*
dataset_info:
features:
- name: acl_id
dtype: string
- name: title
dtype: string
- name: abstract
dtype: string
- name: conference_name
dtype: string
- name: conference_track
dtype: string
- name: year
dtype: int64
- name: url
dtype: string
- name: contribution_types
sequence: string
- name: openreview_id
dtype: string
- name: openreview_cycle
dtype: string
- name: openreview_history
list:
- name: contribution_types
sequence: string
- name: contribution_types_has_changed
dtype: bool
- name: cycle
dtype: string
- name: id
dtype: string
- name: article_content
dtype: string
splits:
- name: train
num_bytes: 110792374
num_examples: 1621
- name: validation
num_bytes: 15470469
num_examples: 222
- name: test
num_bytes: 13985449
num_examples: 207
- name: test_annotated
num_bytes: 13984522
num_examples: 207
download_size: 75156873
dataset_size: 154232814
---
# ARRContributions: A Dataset of Contribution Types from ARR Papers
## About
ARRContributions is a dataset of more than 2000 articles extracted from ARR papers submitted to [OpenReview](https://openreview.net/group?id=aclweb.org/ACL/ARR) that present contribution types information.
[Contributions types](https://aclrollingreview.org/cfp#scope-of-submissions) are required to be specified by the authors when making submission to ARR.
The ARR typology [(Rogers et al., 2023)](https://aclanthology.org/2023.acl-long.911/) defines 11 contribution types that authors can select from to best characterize their work:
(1) NLP engineering experiment (e.g., methods improving state-of-the-art results),
(2) approaches for low-compute settings and efficiency,
(3) approaches for low-resource settings,
(4) data resources,
(5) data analysis,
(6) model analysis and interpretability,
(7) reproduction studies,
(8) position papers,
(9) surveys,
(10) theory, and
(11) publicly available software and pre-trained models.
## Content
The following data fields are available :
| **Feature** | **Type** | **Description** |
| -------------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `acl_id` | `string` | Unique identifier of the paper in the ACL Anthology. |
| `title` | `string` | Title of the paper. |
| `abstract` | `string` | Abstract of the paper. |
| `conference_name` | `string` | Name of the conference (e.g., *acl*, *emnlp*, *eacl*). |
| `conference_track` | `string` | Track or submission category within the conference. |
| `year` | `int64` | Year of publication. |
| `url` | `string` | ACL Anthology link to the paper. |
| `contribution_types` | `list[string]` | List of contribution types selected according to the ARR typology (Rogers et al., 2023), e.g., *data resources*, *model analysis*, *theory*. |
| `openreview_id` | `string` | Unique OpenReview submission ID. |
| `openreview_cycle` | `string` | Review cycle or round associated with the OpenReview submission. |
| `openreview_history` | `list[object]` | List of previous submission records for the same paper when available. Each record includes: <br>• `contribution_types` (`list[string]`): Contribution types selected in that cycle. <br>• `contribution_types_has_changed` (`bool`): Whether the contribution types differ from the previous cycle. <br>• `cycle` (`string`): The OpenReview cycle name. <br>• `id` (`string`): The OpenReview submission ID. |
| `article_content` | `string` | Full text of the paper (extracted using [nougat](https://github.com/facebookresearch/nougat)). |
We split our dataset into training, validation, and test sets using an 80-10-10 ratio, ensuring label balance through multi-label stratification strategy.
The test set was manually annotated by three independent annotators to establish an additional gold-standard labeling.
We provide both the original test annotations from the dataset authors and the consensus annotations from the three annotators as separate splits.
## Licence
**Dataset:** CC BY-NC 4.0
**Original papers:** CC BY 4.0 (retain attribution)
If you use this dataset:
- You may use, share, and adapt the dataset for **non-commercial research or educational purposes only**.
- Must attribute both the dataset creators and the original ACL Anthology authors for any content used.
## Citation
```
@misc{,
title={},
author={},
year={},
eprint={},
archivePrefix={},
primaryClass={}
}
```
提供机构:
taln-ls2n



