five

LimitGen

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/LimitGen
下载链接
链接失效反馈
官方服务:
资源简介:
# LimitGen Benchmark While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. **LimitGen**, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: **LimitGen-Syn**, a synthetic dataset carefully created through controlled perturbations of papers, and **LimitGen-Human**, a collection of real human-written limitations. ## LimitGen-Syn The **LimitGen-Syn** subset includes 11 human-designed limitation subtypes that simulate common issues found in real-world papers. 1. **Low Data Quality (data)** The data collection method is unreliable, potentially introducing bias and lacking adequate preprocessing. 2. **Inappropriate Method (inappropriate)** Some methods in the paper are unsuitable for addressing this research question and may lead to errors or oversimplifications. 3. **Insufficient Baselines (baseline)** Fail to evaluate the proposed approach against a broad range of well-established methods. 4. **Limited Datasets (dataset)** Rely on limited datasets, which may hinder the generalizability and robustness of the proposed approach. 5. **Inappropriate Datasets (replace)** Use of inappropriate datasets, which may not accurately reflect the target task or real-world scenarios. 6. **Lack of Ablation Studies (ablation)** Fail to perform an ablation study, leaving the contribution of a certain component to the model’s performance unclear. 7. **Limited Analysis (analysis)** Rely on insufficient evaluation metrics, which may provide an incomplete assessment of the model’s overall performance. 8. **Insufficient Metrics (metric)** Offer insufficient insights into the model’s behavior and failure cases. 9. **Limited Scope (review)** The review may focus on a very specific subset of literature or methods, leaving out important studies or novel perspectives. 10. **Irrelevant Citations (citation)** Include irrelevant references or outdated methods, which distract from the main points and undermine the strength of conclusions. 11. **Inaccurate Description (description)** Provide an inaccurate description of existing methods, which can hinder readers’ understanding of the context and relevance of the proposed approach. In the `syn/annotated` folder, each file contains a paper's title, abstract, and full body text extracted from the parsed PDF. The `syn/sections` folder contains the ground-truth limitation corresponding to each paper. ## LimitGen-Human The **LimitGen-Human** subset contains 1,000 papers from ICLR 2025 submissions, along with human-written limitation comments derived from their official reviews. In the `human/paper` directory, each file includes the full text of a paper extracted from its parsed PDF. The file `human/classified_limitations.json` stores the corresponding limitations for each paper, organized by predefined categories including `methodology`, `experimental design`, `result analysis`, and `literature review`. Each entry includes the paper’s ID, title, abstract, and a dictionary of categorized limitation comments. For example: ```json "rpR9fDZw3D": { "title": "Don’t Throw Away Data: Better Sequence Knowledge Distillation", "abstract": "...", "limitations": { "methodology": ["..."], "experimental design": ["..."], "result analysis": ["..."], "literature review": ["..."] } }

# LimitGen 基准测试集 尽管大语言模型(Large Language Model,LLM)在各类科学任务中展现出应用潜力,但其在辅助同行评审——尤其是识别论文局限性方面的潜力,仍未得到充分研究。**LimitGen**是首个用于评估大语言模型支持早期反馈、补充人类同行评审流程能力的综合性基准测试集。本基准测试集包含两个子集:**LimitGen-Syn**(合成子集),即通过对论文进行可控扰动精心构建的合成数据集;以及**LimitGen-Human**(人工子集),即真实人类撰写的局限性评论集合。 ## LimitGen-Syn 合成子集 **LimitGen-Syn** 子集包含11种人工设计的局限性子类型,用于模拟真实学术论文中常见的各类问题。 1. **数据质量低下(data)** 数据收集方法不可靠,可能引入偏差且缺乏充分的预处理步骤。 2. **方法选择不当(inappropriate)** 论文中采用的部分方法不适用于解决该研究问题,可能引发错误或过度简化。 3. **基线模型不足(baseline)** 未将所提出的方法与大量成熟基准方法进行对比评估。 4. **数据集规模有限(dataset)** 依赖规模有限的数据集,可能会限制所提方法的泛化能力与鲁棒性。 5. **数据集选用不当(replace)** 使用了不适配的数据集,无法准确反映目标任务或真实应用场景。 6. **缺失消融实验(ablation)** 未开展消融实验,无法明确模型某一组件对整体性能的贡献程度。 7. **分析维度有限(analysis)** 仅采用不足够的评估指标,无法对模型整体性能进行完整评估。 8. **评估指标不足(metric)** 未能提供足够关于模型行为与失败案例的分析视角。 9. **研究范围受限(review)** 综述仅聚焦于非常特定的文献或方法子集,遗漏了重要研究或新颖视角。 10. **引用无关内容(citation)** 包含无关参考文献或过时方法,分散读者注意力并削弱结论的说服力。 11. **现有方法描述不准确(description)** 对现有方法的描述存在偏差,可能阻碍读者理解所提方法的背景与相关性。 在`syn/annotated`文件夹中,每个文件均包含从解析后的PDF中提取的论文标题、摘要与完整正文。 `syn/sections`文件夹存储了每篇论文对应的真值标注局限性。 ## LimitGen-Human 人工子集 **LimitGen-Human** 子集包含来自ICLR 2025投稿的1000篇论文,以及从其官方评审意见中提取的人工撰写的局限性评论。 在`human/paper`目录下,每个文件均包含从解析后的PDF中提取的论文完整正文。 文件`human/classified_limitations.json`存储了每篇论文对应的局限性评论,这些评论按照预定义类别进行组织,包括`methodology`(方法论)、`experimental design`(实验设计)、`result analysis`(结果分析)与`literature review`(文献综述)。 每条条目均包含论文ID、标题、摘要以及分类后的局限性评论字典。例如: json "rpR9fDZw3D": { "title": "《切勿丢弃数据:改进序列知识蒸馏方法》", "abstract": "...", "limitations": { "methodology": ["..."], "experimental design": ["..."], "result analysis": ["..."], "literature review": ["..."] } }
提供机构:
maas
创建时间:
2025-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作