almanach/Biomed-Enriched
收藏Hugging Face2025-06-27 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/almanach/Biomed-Enriched
下载链接
链接失效反馈官方服务:
资源简介:
Biomed-Enriched 是一个使用大型语言模型 (LLM) 预训练和提取罕见和隐藏内容的生物医学数据集。它由两部分组成:商业和非商业。商业部分包含文本内容、路径、许可证 URL 和作者,而非商业部分不包含文本内容。该数据集是通过两阶段注释过程创建的,包括大型语言模型的初始注释和模型蒸馏的注释扩展。Biomed-Enriched 旨在提高生物医学预训练的效率,并根据文档类型和领域创建新的生物医学子集。
Biomed-Enriched is a PubMed-derived dataset enriched with large language models (LLMs) for pretraining and extracting rare and hidden content. It is structured into two splits: Commercial and Non-Commercial. The Commercial split includes textual content, path, license_url, and authors, while the Non-Commercial split does not include textual content due to licensing restrictions. The dataset was created using a two-stage annotation process involving an initial annotation by a large language model followed by annotation scaling via model distillation. Biomed-Enriched aims to improve efficiency in biomedical pretraining and create new biomedical subsets tailored to specific research needs.
提供机构:
almanach



