ReviewRobot Dataset

Name: ReviewRobot Dataset
Creator: OpenDataLab
License: 暂无描述

OpenXLab2026-04-18 收录

下载链接：

https://openxlab.org.cn/datasets/OpenDataLab/ReviewRobot_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

ReviewRobot 数据集概述该存储库包含论文 ReviewRobot 的数据：基于知识合成的可解释论文评论生成。 [数据集] 数据集共有三个文件夹：Raw_data、IE_result 和 KGs。原始数据文件夹 Raw_data 有两部分：背景语料库和论文评论语料库。我们通过从语义学者开放研究语料库中选择与机器学习相关的论文来创建背景语料库。它包含从 1965 年到 2019 年（包括在内）发表的论文及其标题和摘要。论文评论语料库包含解析的论文 pdf 及其相应的评论。 acl_2017 和 iclr_2017 文件夹的论文评论对来自 PeerRead 数据集。我们从 OpenReview 和 NeruIPS 获取其余部分。我们使用 GROBID 解析了这些 pdf。在每个文件夹中，metadata.txt 包含所有人工评论，而 txt/ 文件夹包含所有已处理的论文。 IE_result 文件夹 IE_result 文件夹包含来自 SciIE 的信息提取结果。在每个组中，*_json/ 包含标记化文本，*_output/ 包含标记化文本的 IE 结果。 Background_IE 包含一组中的两个文件夹，用于 1965 年至 2019 年的所有论文摘要。 Paper-review_IE 包含来自两组的四个文件夹。第一组：iclrnipsabs_json 和 iclrnipsabs_output 包含 Paper-review 语料库摘要的 IE 结果。第二组：iclrnips_json 和 iclrnips_output 包含 Paper-review 语料库中其余论文的 IE 结果。幼儿园 KGs 文件夹包含基于 IE_result 构建的知识图谱。 back_kg back_kg 包含了建立到某一年的背景 KG。每一年，有三个文件。以 2012 年为例： * 2012.pkl 包含截至（包括）2012 年的背景知识图。它包含 6 个字段的字典：num_doc 是截至该年的论文数量，cluster2entity 是从实体到其提及的映射，entity2cluster 是映射从mention到对应的entity，cluster2type是从entity到type的映射，entity是指当前KG中的所有mention，relationship是指当前KG中的所有关系。 * 2012_key.pkl 包含从知识元素到论文 ID 的映射。它有两个字段：cluster 是从实体到其对应的 paper id 的映射，relation 是从关系到对应的 paper id 的映射。 * 2012_paper 包含从论文 ID 到论文标题的映射。想法公斤 idea_kg 文件夹包含由论文摘要和结论构成的想法 KG。每行是场地中的一篇论文，并具有以下字段：id 为论文 id，abs_num 为抽象句子的数量，send 用于与 idea_kg 相关的所有句子，entity 用于当前 KG 中的所有提及，cluster2sent 用于相应的句子 id对于特定实体，entity2num 用于特定提及的出现，relation2num 用于特定关系的出现，cluster2entity 用于从实体到其提及的映射，entity2type 包含从提及到类型的映射，关系中的所有关系当前 KG，relation2sent 用于特定关系的相应句子 id，entity2cluster 用于从提及到其相应实体的映射。相关公斤 related_kg 包含根据每个场地的相关工作构建的相关 KG。它与idea_kg的结构相同。贡献_kg 该contribute_kg 包含由论文贡献部分（在介绍部分下）和实验部分构成的贡献KG。它包含一个由 4 个字段组成的字典：id 表示论文 ID，total 表示贡献部分中涵盖的实体数量，covered 表示实验部分中涵盖的实体数量，从两个部分发送涵盖这些实体的相关句子。未来_公斤 future_kg 包含根据每个场地的未来工作构建的未来 KG。它与idea_kg的结构相同。评论-注释 Review-annotation 文件夹包含评论类别和论文评论句子对的人工注释。 review.txt 包含评论类别的注释，其中“SUMMARY”有 236 个句子，“NOVELTY”有 33 个句子，“SOUNDNESS_CORRECTNESS”有 174 个句子，“MEANINGFUL_COMPARISON”有 16 个句子，“IMPACT”有 14 个句子。 pair.txt 包含 2,535 对评论论文。对于每一对，第一个槽是评论句子；第二个槽是论文句子，第三个槽是标签，其中 0 表示两个句子不相关，1 表示它们相关。执照知识共享 — 署名 4.0 国际 — CC BY 4.0

The ReviewRobot Dataset Overview This repository contains the data for the paper *ReviewRobot: Explainable Paper Review Generation via Knowledge Synthesis* [Dataset]. Dataset The repository includes three folders: Raw_data, IE_result, and KGs. Raw Data Folder The Raw_data folder consists of two parts: the background corpus and the paper review corpus. We created the background corpus by selecting machine learning-related papers from the Semantic Scholar Open Research Corpus. It contains papers published from 1965 to 2019 (inclusive), along with their titles and abstracts. The paper review corpus contains parsed paper PDFs and their corresponding reviews. The paper-review pairs in the acl_2017 and iclr_2017 folders are sourced from the PeerRead dataset. We obtained the remaining pairs from OpenReview and NeurIPS. We used GROBID to parse these PDFs. In each folder, metadata.txt contains all human-written reviews, and the txt/ folder contains all processed papers. IE_result Folder The IE_result folder contains information extraction results from SciIE. In each group, the *_json/ folder holds tokenized text, and the *_output/ folder contains the IE results for the tokenized text. Background_IE contains two subfolders for the abstracts of all papers published between 1965 and 2019. Paper-review_IE includes four subfolders from two groups. The first group: iclrnipsabs_json and iclrnipsabs_output contain the IE results for the abstracts of the paper-review corpus. The second group: iclrnips_json and iclrnips_output contain the IE results for the remaining papers in the paper-review corpus. KGs Folder The KGs folder contains knowledge graphs built based on the IE_result folder. back_kg The back_kg folder contains background KGs established up to specific years. For each year, there are three files. Take 2012 as an example: * 2012.pkl stores the background knowledge graph up to (and including) 2012. It is a dictionary with 6 fields: num_doc represents the number of papers by that year; cluster2entity is the mapping from entities to their mentions; entity2cluster is the mapping from mentions to their corresponding entities; cluster2type is the mapping from entities to their types; entity refers to all mentions in the current KG; relationship refers to all relationships in the current KG. * 2012_key.pkl contains the mapping from knowledge elements to paper IDs. It has two fields: cluster is the mapping from entities to their corresponding paper IDs, and relation is the mapping from relationships to their corresponding paper IDs. * 2012_paper contains the mapping from paper IDs to their titles. idea_kg The idea_kg folder contains idea KGs constructed from paper abstracts and conclusions. Each line corresponds to one paper in the corpus, with the following fields: id is the paper ID; abs_num is the number of abstract sentences; send refers to all sentences related to the idea_kg; entity refers to all mentions in the current KG; cluster2sent is the mapping from entities to their corresponding sentence IDs; entity2num is the number of occurrences for a specific mention; relation2num is the number of occurrences for a specific relationship; cluster2entity is the mapping from entities to their mentions; entity2type contains the mapping from mentions to their types; relationship refers to all relationships in the current KG; relation2sent is the mapping from relationships to their corresponding sentence IDs; entity2cluster is the mapping from mentions to their corresponding entities. related_kg The related_kg folder contains related KGs built based on the related work sections of each paper. It has the same structure as idea_kg. contribute_kg The contribute_kg folder contains contribution KGs constructed from the contribution sections (under the introduction section) and experimental sections of papers. It is a dictionary with 4 fields: id represents the paper ID; total is the number of entities covered in the contribution section; covered is the number of entities covered in the experimental section; send refers to the relevant sentences from both sections that cover these entities. future_kg The future_kg folder contains future work KGs built based on the future work sections of each paper. It has the same structure as idea_kg. Review-annotation The Review-annotation folder contains manual annotations for review categories and paper review sentence pairs. review.txt contains annotations for review categories: 236 sentences for "SUMMARY", 33 sentences for "NOVELTY", 174 sentences for "SOUNDNESS_CORRECTNESS", 16 sentences for "MEANINGFUL_COMPARISON", and 14 sentences for "IMPACT". pair.txt contains 2,535 paper-review pairs. For each pair, the first slot is the review sentence, the second slot is the paper sentence, and the third slot is the label, where 0 indicates that the two sentences are irrelevant, and 1 indicates that they are relevant. License Creative Commons — Attribution 4.0 International — CC BY 4.0

提供机构：

OpenDataLab

创建时间：

2022-06-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集