LimitGen

Name: LimitGen
Creator: maas
Published: 2025-12-05 12:11:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/yale-nlp/LimitGen

下载链接

链接失效反馈

官方服务：

资源简介：

# LimitGen Benchmark While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. **LimitGen**, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: **LimitGen-Syn**, a synthetic dataset carefully created through controlled perturbations of papers, and **LimitGen-Human**, a collection of real human-written limitations. ## LimitGen-Syn The **LimitGen-Syn** subset includes 11 human-designed limitation subtypes that simulate common issues found in real-world papers. 1. **Low Data Quality (data)** The data collection method is unreliable, potentially introducing bias and lacking adequate preprocessing. 2. **Inappropriate Method (inappropriate)** Some methods in the paper are unsuitable for addressing this research question and may lead to errors or oversimplifications. 3. **Insufficient Baselines (baseline)** Fail to evaluate the proposed approach against a broad range of well-established methods. 4. **Limited Datasets (dataset)** Rely on limited datasets, which may hinder the generalizability and robustness of the proposed approach. 5. **Inappropriate Datasets (replace)** Use of inappropriate datasets, which may not accurately reflect the target task or real-world scenarios. 6. **Lack of Ablation Studies (ablation)** Fail to perform an ablation study, leaving the contribution of a certain component to the model’s performance unclear. 7. **Limited Analysis (analysis)** Rely on insufficient evaluation metrics, which may provide an incomplete assessment of the model’s overall performance. 8. **Insufficient Metrics (metric)** Offer insufficient insights into the model’s behavior and failure cases. 9. **Limited Scope (review)** The review may focus on a very specific subset of literature or methods, leaving out important studies or novel perspectives. 10. **Irrelevant Citations (citation)** Include irrelevant references or outdated methods, which distract from the main points and undermine the strength of conclusions. 11. **Inaccurate Description (description)** Provide an inaccurate description of existing methods, which can hinder readers’ understanding of the context and relevance of the proposed approach. In the `syn/annotated` folder, each file contains a paper's title, abstract, and full body text extracted from the parsed PDF. The `syn/sections` folder contains the ground-truth limitation corresponding to each paper. ## LimitGen-Human The **LimitGen-Human** subset contains 1,000 papers from ICLR 2025 submissions, along with human-written limitation comments derived from their official reviews. In the `human/paper` directory, each file includes the full text of a paper extracted from its parsed PDF. The file `human/classified_limitations.json` stores the corresponding limitations for each paper, organized by predefined categories including `methodology`, `experimental design`, `result analysis`, and `literature review`. Each entry includes the paper’s ID, title, abstract, and a dictionary of categorized limitation comments. For example: ```json "rpR9fDZw3D": { "title": "Don’t Throw Away Data: Better Sequence Knowledge Distillation", "abstract": "...", "limitations": { "methodology": ["..."], "experimental design": ["..."], "result analysis": ["..."], "literature review": ["..."] } }

# LimitGen 基准测试集尽管大语言模型（Large Language Model，LLM）在各类科学任务中展现出应用潜力，但其在辅助同行评审——尤其是识别论文局限性方面的潜力，仍未得到充分研究。**LimitGen**是首个用于评估大语言模型支持早期反馈、补充人类同行评审流程能力的综合性基准测试集。本基准测试集包含两个子集：**LimitGen-Syn**（合成子集），即通过对论文进行可控扰动精心构建的合成数据集；以及**LimitGen-Human**（人工子集），即真实人类撰写的局限性评论集合。 ## LimitGen-Syn 合成子集 **LimitGen-Syn** 子集包含11种人工设计的局限性子类型，用于模拟真实学术论文中常见的各类问题。 1. **数据质量低下（data）** 数据收集方法不可靠，可能引入偏差且缺乏充分的预处理步骤。 2. **方法选择不当（inappropriate）** 论文中采用的部分方法不适用于解决该研究问题，可能引发错误或过度简化。 3. **基线模型不足（baseline）** 未将所提出的方法与大量成熟基准方法进行对比评估。 4. **数据集规模有限（dataset）** 依赖规模有限的数据集，可能会限制所提方法的泛化能力与鲁棒性。 5. **数据集选用不当（replace）** 使用了不适配的数据集，无法准确反映目标任务或真实应用场景。 6. **缺失消融实验（ablation）** 未开展消融实验，无法明确模型某一组件对整体性能的贡献程度。 7. **分析维度有限（analysis）** 仅采用不足够的评估指标，无法对模型整体性能进行完整评估。 8. **评估指标不足（metric）** 未能提供足够关于模型行为与失败案例的分析视角。 9. **研究范围受限（review）** 综述仅聚焦于非常特定的文献或方法子集，遗漏了重要研究或新颖视角。 10. **引用无关内容（citation）** 包含无关参考文献或过时方法，分散读者注意力并削弱结论的说服力。 11. **现有方法描述不准确（description）** 对现有方法的描述存在偏差，可能阻碍读者理解所提方法的背景与相关性。在`syn/annotated`文件夹中，每个文件均包含从解析后的PDF中提取的论文标题、摘要与完整正文。 `syn/sections`文件夹存储了每篇论文对应的真值标注局限性。 ## LimitGen-Human 人工子集 **LimitGen-Human** 子集包含来自ICLR 2025投稿的1000篇论文，以及从其官方评审意见中提取的人工撰写的局限性评论。在`human/paper`目录下，每个文件均包含从解析后的PDF中提取的论文完整正文。文件`human/classified_limitations.json`存储了每篇论文对应的局限性评论，这些评论按照预定义类别进行组织，包括`methodology`（方法论）、`experimental design`（实验设计）、`result analysis`（结果分析）与`literature review`（文献综述）。每条条目均包含论文ID、标题、摘要以及分类后的局限性评论字典。例如： json "rpR9fDZw3D": { "title": "《切勿丢弃数据：改进序列知识蒸馏方法》", "abstract": "...", "limitations": { "methodology": ["..."], "experimental design": ["..."], "result analysis": ["..."], "literature review": ["..."] } }

提供机构：

maas

创建时间：

2025-07-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集