griffin/ChemSum
收藏数据集卡片:ChemSum
ChemSum 描述
ChemSum 概述
我们引入了一个专注于化学领域的数据集,通过编译一系列开放获取的化学学术期刊文章。对于每个期刊,我们使用可用的API或通过Selenium Chrome WebDriver抓取从开放获取部分下载全文PDF文章。每个PDF文件通过本地安装的Grobid客户端处理,以提取带有章节的自由文本段落。
下表显示了从中获取开放获取文章的期刊以及处理的文章数量:
| 来源 | 文章数量 |
|---|---|
| Beilstein | 1,829 |
| Chem Cell | 546 |
| ChemRxiv | 12,231 |
| Chemistry Open | 398 |
| Nature Communications Chemistry | 572 |
| PubMed Author Manuscript | 57,680 |
| PubMed Open Access | 29,540 |
| Royal Society of Chemistry (RSC) | 9,334 |
| Scientific Reports - Nature | 6,826 |
对于所有期刊,我们筛选了提供化学主题的论文,当其他学科的论文也可用时(例如PubMed)。
语言
英语
数据集结构
数据字段
| 列 | 描述 |
|---|---|
uuid |
示例的唯一标识符 |
title |
文章标题 |
article_source |
开放源期刊(见上表) |
abstract |
摘要(总结参考) |
sections |
文章正文的全文部分(<!>表示部分边界) |
headers |
sections字段的相应部分标题(<!>分隔) |
source_toks |
sections中的总令牌数 |
target_toks |
abstract中的令牌数 |
compression |
source_toks与target_toks的比率 |
请参考预处理脚本中的load_chemistry()函数,输入为sections和headers,目标为abstract。
数据分割
| 分割 | 数量 |
|---|---|
train |
115,956 |
validation |
1,000 |
test |
2,000 |
引用信息
@inproceedings{adams-etal-2023-desired, title = "What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization", author = "Adams, Griffin and Nguyen, Bichlien and Smith, Jake and Xia, Yingce and Xie, Shufang and Ostropolets, Anna and Deb, Budhaditya and Chen, Yuan-Jyue and Naumann, Tristan and Elhadad, No{e}mie", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.587", doi = "10.18653/v1/2023.acl-long.587", pages = "10520--10542", abstract = "Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on extit{how} to generate and optimize these sets. Less is known about extit{why} one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise{--}the disagreement between model and metric defined candidate rankings{--}minimized.", }




