almanach/Biomed-Enriched

Name: almanach/Biomed-Enriched
Creator: almanach
Published: 2025-06-27 10:29:46
License: 暂无描述

Hugging Face2025-06-27 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/almanach/Biomed-Enriched

下载链接

链接失效反馈

官方服务：

资源简介：

Biomed-Enriched 是一个使用大型语言模型 (LLM) 预训练和提取罕见和隐藏内容的生物医学数据集。它由两部分组成：商业和非商业。商业部分包含文本内容、路径、许可证 URL 和作者，而非商业部分不包含文本内容。该数据集是通过两阶段注释过程创建的，包括大型语言模型的初始注释和模型蒸馏的注释扩展。Biomed-Enriched 旨在提高生物医学预训练的效率，并根据文档类型和领域创建新的生物医学子集。

Biomed-Enriched is a PubMed-derived dataset enriched with large language models (LLMs) for pretraining and extracting rare and hidden content. It is structured into two splits: Commercial and Non-Commercial. The Commercial split includes textual content, path, license_url, and authors, while the Non-Commercial split does not include textual content due to licensing restrictions. The dataset was created using a two-stage annotation process involving an initial annotation by a large language model followed by annotation scaling via model distillation. Biomed-Enriched aims to improve efficiency in biomedical pretraining and create new biomedical subsets tailored to specific research needs.

提供机构：

almanach

5,000+

优质数据集

54 个

任务类型

进入经典数据集