five

almanach/Biomed-Enriched

收藏
Hugging Face2025-06-27 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/almanach/Biomed-Enriched
下载链接
链接失效反馈
官方服务:
资源简介:
Biomed-Enriched 是一个使用大型语言模型 (LLM) 预训练和提取罕见和隐藏内容的生物医学数据集。它由两部分组成:商业和非商业。商业部分包含文本内容、路径、许可证 URL 和作者,而非商业部分不包含文本内容。该数据集是通过两阶段注释过程创建的,包括大型语言模型的初始注释和模型蒸馏的注释扩展。Biomed-Enriched 旨在提高生物医学预训练的效率,并根据文档类型和领域创建新的生物医学子集。

Biomed-Enriched is a PubMed-derived dataset enriched with large language models (LLMs) for pretraining and extracting rare and hidden content. It is structured into two splits: Commercial and Non-Commercial. The Commercial split includes textual content, path, license_url, and authors, while the Non-Commercial split does not include textual content due to licensing restrictions. The dataset was created using a two-stage annotation process involving an initial annotation by a large language model followed by annotation scaling via model distillation. Biomed-Enriched aims to improve efficiency in biomedical pretraining and create new biomedical subsets tailored to specific research needs.
提供机构:
almanach
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作