five

launch/ampere

收藏
Hugging Face2022-11-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/launch/ampere
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - en license: - cc-by-4.0 multilinguality: - monolingual task_categories: - text-classification task_ids: [] pretty_name: AMPERE --- # Dataset Card for AMPERE ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Dataset Creation](#dataset-creation) ## Dataset Description This dataset is released together with our NAACL 2019 Paper "[`Argument Mining for Understanding Peer Reviews`](https://aclanthology.org/N19-1219/)". If you find our work useful, please cite: ``` @inproceedings{hua-etal-2019-argument, title = "Argument Mining for Understanding Peer Reviews", author = "Hua, Xinyu and Nikolov, Mitko and Badugu, Nikhil and Wang, Lu", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N19-1219", doi = "10.18653/v1/N19-1219", pages = "2131--2137", } ``` This dataset includes 400 scientific peer reviews collected from ICLR 2018 hosted at the Openreview platform. Each review is segmented into multiple propositions. We include the original untokenized text for each proposition. Each proposition is labeled as one of the following types: - **evaluation**: a proposition that is not objectively verifiable and does not require any action to be performed, such as qualitative judgement and interpretation of the paper, e.g. "The paper shows nice results on a number of small tasks." - **request**: a proposition that is not objectively verifiable and suggests a course of action to be taken, such as recommendation and suggestion for new experiments, e.g. "I would really like to see how the method performs without this hack." - **fact**: a proposition that is verifiable with objective evidence, such as mathematical conclusion and common knowledge of the field, e.g. "This work proposes a dynamic weight update scheme." - **quote**: a quote from the paper or another source, e.g. "The author wrote 'where r is lower bound of feature norm'." - **reference**: a proposition that refers to an objective evidence, such as URL link and citation, e.g. "see MuseGAN (Dong et al), MidiNet (Yang et al), etc." - **non-arg**: a non-argumentative discourse unit that does not contribute to the overall agenda of the review, such as greetings, metadata, and clarification questions, e.g. "Aha, now I understand." ## Dataset Structure The dataset is partitioned into train/val/test sets. Each set is uploaded as a jsonl format. Each line contains the following elements: - `doc_id` (str): a unique id for review document - `text` (list[str]): a list of segmented propositions - `labels` (list[str]): a list of labels corresponding to the propositions An example looks as follows. ``` { "doc_id": "H1WORsdlG", "text": [ "This paper addresses the important problem of understanding mathematically how GANs work.", "The approach taken here is to look at GAN through the lense of the scattering transform.", "Unfortunately the manuscrit submitted is very poorly written.", "Introduction and flow of thoughts is really hard to follow.", "In method sections, the text jumps from one concept to the next without proper definitions.", "Sorry I stopped reading on page 3.", "I suggest to rewrite this work before sending it to review.", "Among many things: - For citations use citep and not citet to have () at the right places.", "- Why does it seems -> Why does it seem etc.", ], "labels": [ 'fact', 'fact', 'evaluation', 'evaluation', 'evaluation', 'evaluation', 'request', 'request', 'request', ] } ``` ## Dataset Creation For human annotators, they will be asked to first read the above definitions and controversial cases carefully. The dataset to be annotated consists of 400 reviews partitioned in 20 batches. Each annotator will follow the following steps for annotation: - Step 1: Open a review file with a text editor. The unannotated review file contains only one line, please separate it into multiple lines with each line corresponding to one single proposition. Repeat the above actions on all 400 reviews. - Step 2: Based on the segmented units, label the type for each proposition. Start labeling at the end of each file with the marker "## Labels:". Indicate the line number of the proposition first, then annotate the type, e.g. "1. evaluation" for the first proposition. Repeat the above actions on all 400 reviews. A third annotator then resolves the disagreements between the two annotators on both segmentation and proposition type.
提供机构:
launch
原始信息汇总

数据集概述

数据集名称

  • 名称: AMPERE

数据集属性

  • 语言: 英语 (en)
  • 许可证: CC-BY-4.0
  • 多语言性: 单语
  • 任务类别: 文本分类

数据集内容

  • 来源: 包含400篇从ICLR 2018收集的科学同行评审。
  • 结构: 每个评审被分割成多个命题,每个命题包含以下类型之一:
    • 评估
    • 请求
    • 事实
    • 引用
    • 参考
    • 非论点

数据集结构

  • 格式: JSONL
  • 组成部分:
    • doc_id: 评审文档的唯一ID
    • text: 分割后的命题列表
    • labels: 对应命题的标签列表

数据集创建

  • 标注过程:
    • 标注者首先阅读定义和争议案例。
    • 将未标注的评审文件分割成多个命题,并为每个命题标注类型。
    • 第三标注者解决分割和命题类型上的分歧。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作