LimitGen
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/LimitGen
下载链接
链接失效反馈官方服务:
资源简介:
# LimitGen Benchmark
While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. **LimitGen**, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: **LimitGen-Syn**, a synthetic dataset carefully created through controlled perturbations of papers, and **LimitGen-Human**, a collection of real human-written limitations.
## LimitGen-Syn
The **LimitGen-Syn** subset includes 11 human-designed limitation subtypes that simulate common issues found in real-world papers.
1. **Low Data Quality (data)**
The data collection method is unreliable, potentially introducing bias and lacking adequate preprocessing.
2. **Inappropriate Method (inappropriate)**
Some methods in the paper are unsuitable for addressing this research question and may lead to errors or oversimplifications.
3. **Insufficient Baselines (baseline)**
Fail to evaluate the proposed approach against a broad range of well-established methods.
4. **Limited Datasets (dataset)**
Rely on limited datasets, which may hinder the generalizability and robustness of the proposed approach.
5. **Inappropriate Datasets (replace)**
Use of inappropriate datasets, which may not accurately reflect the target task or real-world scenarios.
6. **Lack of Ablation Studies (ablation)**
Fail to perform an ablation study, leaving the contribution of a certain component to the model’s performance unclear.
7. **Limited Analysis (analysis)**
Rely on insufficient evaluation metrics, which may provide an incomplete assessment of the model’s overall performance.
8. **Insufficient Metrics (metric)**
Offer insufficient insights into the model’s behavior and failure cases.
9. **Limited Scope (review)**
The review may focus on a very specific subset of literature or methods, leaving out important studies or novel perspectives.
10. **Irrelevant Citations (citation)**
Include irrelevant references or outdated methods, which distract from the main points and undermine the strength of conclusions.
11. **Inaccurate Description (description)**
Provide an inaccurate description of existing methods, which can hinder readers’ understanding of the context and relevance of the proposed approach.
In the `syn/annotated` folder, each file contains a paper's title, abstract, and full body text extracted from the parsed PDF.
The `syn/sections` folder contains the ground-truth limitation corresponding to each paper.
## LimitGen-Human
The **LimitGen-Human** subset contains 1,000 papers from ICLR 2025 submissions, along with human-written limitation comments derived from their official reviews.
In the `human/paper` directory, each file includes the full text of a paper extracted from its parsed PDF.
The file `human/classified_limitations.json` stores the corresponding limitations for each paper, organized by predefined categories including `methodology`, `experimental design`, `result analysis`, and `literature review`.
Each entry includes the paper’s ID, title, abstract, and a dictionary of categorized limitation comments. For example:
```json
"rpR9fDZw3D": {
"title": "Don’t Throw Away Data: Better Sequence Knowledge Distillation",
"abstract": "...",
"limitations": {
"methodology": ["..."],
"experimental design": ["..."],
"result analysis": ["..."],
"literature review": ["..."]
}
}
# LimitGen 基准测试集
尽管大语言模型(Large Language Model,LLM)在各类科学任务中展现出应用潜力,但其在辅助同行评审——尤其是识别论文局限性方面的潜力,仍未得到充分研究。**LimitGen**是首个用于评估大语言模型支持早期反馈、补充人类同行评审流程能力的综合性基准测试集。本基准测试集包含两个子集:**LimitGen-Syn**(合成子集),即通过对论文进行可控扰动精心构建的合成数据集;以及**LimitGen-Human**(人工子集),即真实人类撰写的局限性评论集合。
## LimitGen-Syn 合成子集
**LimitGen-Syn** 子集包含11种人工设计的局限性子类型,用于模拟真实学术论文中常见的各类问题。
1. **数据质量低下(data)**
数据收集方法不可靠,可能引入偏差且缺乏充分的预处理步骤。
2. **方法选择不当(inappropriate)**
论文中采用的部分方法不适用于解决该研究问题,可能引发错误或过度简化。
3. **基线模型不足(baseline)**
未将所提出的方法与大量成熟基准方法进行对比评估。
4. **数据集规模有限(dataset)**
依赖规模有限的数据集,可能会限制所提方法的泛化能力与鲁棒性。
5. **数据集选用不当(replace)**
使用了不适配的数据集,无法准确反映目标任务或真实应用场景。
6. **缺失消融实验(ablation)**
未开展消融实验,无法明确模型某一组件对整体性能的贡献程度。
7. **分析维度有限(analysis)**
仅采用不足够的评估指标,无法对模型整体性能进行完整评估。
8. **评估指标不足(metric)**
未能提供足够关于模型行为与失败案例的分析视角。
9. **研究范围受限(review)**
综述仅聚焦于非常特定的文献或方法子集,遗漏了重要研究或新颖视角。
10. **引用无关内容(citation)**
包含无关参考文献或过时方法,分散读者注意力并削弱结论的说服力。
11. **现有方法描述不准确(description)**
对现有方法的描述存在偏差,可能阻碍读者理解所提方法的背景与相关性。
在`syn/annotated`文件夹中,每个文件均包含从解析后的PDF中提取的论文标题、摘要与完整正文。
`syn/sections`文件夹存储了每篇论文对应的真值标注局限性。
## LimitGen-Human 人工子集
**LimitGen-Human** 子集包含来自ICLR 2025投稿的1000篇论文,以及从其官方评审意见中提取的人工撰写的局限性评论。
在`human/paper`目录下,每个文件均包含从解析后的PDF中提取的论文完整正文。
文件`human/classified_limitations.json`存储了每篇论文对应的局限性评论,这些评论按照预定义类别进行组织,包括`methodology`(方法论)、`experimental design`(实验设计)、`result analysis`(结果分析)与`literature review`(文献综述)。
每条条目均包含论文ID、标题、摘要以及分类后的局限性评论字典。例如:
json
"rpR9fDZw3D": {
"title": "《切勿丢弃数据:改进序列知识蒸馏方法》",
"abstract": "...",
"limitations": {
"methodology": ["..."],
"experimental design": ["..."],
"result analysis": ["..."],
"literature review": ["..."]
}
}
提供机构:
maas
创建时间:
2025-07-11



