EssentialAI/eai-taxonomy-stem-w-dclm-100b-sample
收藏Hugging Face2025-06-22 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/EssentialAI/eai-taxonomy-stem-w-dclm-100b-sample
下载链接
链接失效反馈官方服务:
资源简介:
EAI-Taxonomy STEM w/ DCLM是一个高质量的STEM数据集,包含1000亿个科学、技术、工程和数学内容。该数据集使用基于分类法的过滤方法进行筛选,可以有效地识别和提取高质量的STEM内容。它属于Essential-Web项目的一部分,该项目引入了使用表达性元数据和简单语义过滤器的新数据集筛选方法。该数据集优于基线数据和教育数据集,无需复杂的领域特定流程即可实现更好的结果。它涵盖了科学、工程、医学和计算机科学领域,并侧重于选择高质量的文档类型,并过滤推理内容。数据集模式包括全面的元数据、质量信号和分类法分类。每个记录都代表从网络存档中提取的文档,并具有详细的来源跟踪和质量评估指标。EAI分类法分类是一个全面的分层分类系统,具有主标签和次标签。数据集还包括Bloom分类法集成,用于教育内容分析,以及文档特征用于文档分类。内容质量维度评估逻辑推理的复杂性和 sophistication、技术信息的准确性和精度,以及理解内容所需的适当教育背景。
EAI-Taxonomy STEM w/ DCLM is a high-quality dataset containing 100 billion tokens of science, technology, engineering, and mathematics content. The dataset is curated using taxonomy-based filtering, which efficiently identifies and extracts high-quality STEM content. It is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. The dataset outperforms baseline and educational datasets and achieves superior results without complex domain-specific pipelines. It encompasses science, engineering, medical, and computer science domains and focuses on selecting high-quality document types and filtering for reasoning content. The dataset schema includes comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics. The EAI Taxonomy Classification is a comprehensive hierarchical classification system with primary and secondary labels. The dataset also includes Blooms Taxonomy Integration for educational content analysis and Document Characteristics for document classification. The Content Quality Dimensions assess the complexity and sophistication of logical reasoning, the accuracy and precision of technical information, and the appropriate educational background required to comprehend the content. The Metadata field contains a nested structure with web archive information.
提供机构:
EssentialAI



