EssentialAI/eai-taxonomy-med-w-dclm
收藏Hugging Face2025-06-22 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/EssentialAI/eai-taxonomy-med-w-dclm
下载链接
链接失效反馈官方服务:
资源简介:
这是一个高质量的医疗数据集,由Essential-Web项目从网络数据中整理而来,包含2050亿个医疗内容的标记。数据集通过基于分类法的过滤方法进行筛选,旨在包含具有推理性和技术正确性的科学医疗内容,并通过DCLM分类器筛选出指令密集型文档。README文件还展示了数据集在各种医疗评估基准上的性能,显示出比现有数据集有显著提升。此外,README还详细描述了数据集的架构,包括核心字段、EAI分类法、文档特征、内容质量维度和元数据结构。
This is a high-quality medical dataset curated from web data using a taxonomy-based filtering approach, containing 205 billion tokens of medical content. The dataset is designed to contain scientific medical content that is technically correct and exhibits reasoning, filtered using the DCLM classifier for instruction-dense documents. The README highlights the performance of the dataset on various medical evaluation benchmarks, showing significant improvements over existing datasets. It also describes the dataset schema, including core fields, EAI Taxonomy Classification, Document Characteristics, Content Quality Dimensions, and Metadata Structure.
提供机构:
EssentialAI



