EssentialAI/eai-taxonomy-stem-w-dclm
收藏Hugging Face2025-06-22 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/EssentialAI/eai-taxonomy-stem-w-dclm
下载链接
链接失效反馈官方服务:
资源简介:
EAI-Taxonomy STEM w/ DCLM是一个高质量的STEM数据集,从网络数据中通过基于分类法的过滤技术精心挑选,包含1742亿个科学、技术、工程和数学内容的token。该数据集是Essential-Web项目的一部分,引入了一种新的数据集整理范式,使用丰富的元数据和简单的语义过滤器。与传统的STEM数据集需要复杂的特定领域管道不同,我们的方法利用12个分类的目录高效地识别和提取高质量的STEM内容。
EAI-Taxonomy STEM w/ DCLM is a high-quality STEM dataset curated from web data using taxonomy-based filtering, containing 1742 billion tokens of science, technology, engineering, and mathematics content. It is part of the Essential-Web project, introducing a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional STEM datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality STEM content.
提供机构:
EssentialAI



