EssentialAI/eai-taxonomy-code-w-dclm
收藏Hugging Face2025-06-22 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/EssentialAI/eai-taxonomy-code-w-dclm
下载链接
链接失效反馈官方服务:
资源简介:
EAI-Taxonomy Code 是一个高质量代码数据集,包含5640亿个标记,通过基于分类法的过滤从网络数据中精选。该数据集是 Essential-Web 项目的一部分,该项目引入了一种使用表达性元数据和简单语义过滤器的新数据集策展范式。与传统的需要复杂领域特定管道的代码数据集不同,EAI-Taxonomy Code 使用12个分类的目录系统有效地识别和提取高质量的代码数据。该数据集还包含数学内容,以匹配现有代码数据集的范围。
EAI-Taxonomy Code is a high-quality code dataset containing 564 billion tokens, curated from web data using taxonomy-based filtering. The dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional code datasets that require complex domain-specific pipelines, EAI-Taxonomy Code leverages a 12-category taxonomy to efficiently identify and extract high-quality code data. The dataset also includes mathematics content to match the scope of existing code datasets.
提供机构:
EssentialAI



