ncbi/MedFact-Synth

Name: ncbi/MedFact-Synth
Creator: ncbi
Published: 2026-03-12 19:27:05
License: 暂无描述

Hugging Face2026-03-12 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/ncbi/MedFact-Synth

下载链接

链接失效反馈

官方服务：

资源简介：

为了训练Med-V1，我们构建了MedFact-Synth，一个包含150万个实例的大规模合成训练集。每个实例包含：一个待验证的合成声明、一个作为证据的源文章、一个解释验证过程的理由，以及一个5点Likert量表的裁决，范围从强烈矛盾（-2）和部分矛盾（-1）到中立（0）、部分一致（+1）和强烈一致（+2）。构建该数据集时，我们从PubMed 2025基线中抽取了一百万篇文章。对于每篇文章，GPT-4o-mini被提示生成两个声明：一个可能是文章支持的，另一个可能是文章反驳的。为了收集多样化的声明-文章对进行验证，我们使用MedCPT为每个声明检索最相关的10篇PubMed文章。然后，一组前沿LLM通过基于投票的机制验证每对声明和文章，生成理由和裁决。

To train Med-V1, we construct MedFact-Synth, a large-scale synthetic training set including 1.5 million instances. Each instance contains: a synthetic claim to be verified, a source article serving as evidence, a rationale explaining the verification, and a 5-point Likert-scale verdict, ranging from strong contradiction (-2) and partial contradiction (-1) to neutral (0), partial agreement (+1), and strong agreement (+2). To build this dataset, we begin by sampling one million articles from the PubMed 2025 baseline. For each article, GPT-4o-mini is prompted to generate two claims: one that the article may support and one that it may refute. To collect diverse claim-article pairs for verification, we use MedCPT to retrieve the top 10 most relevant PubMed articles for each claim. A panel of frontier LLMs then verifies each pair, generating both rationales and verdicts via a voting-based mechanism.

提供机构：

ncbi

5,000+

优质数据集

54 个

任务类型

进入经典数据集