Essential-Web v1.0
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了24万亿个标记的文档,这些文档根据主题、格式、内容复杂度和质量被分类到十二个类别中。此外,该数据集中还包含了来自不同领域的竞争性网络精选数据集,并使用经过精细调整的模型进行整理,该模型具有高标注者一致性。该数据集的任务是用于训练语言模型,并在数学、网络代码、STEM(科学、技术、工程和数学)以及医学任务上评估性能。
This dataset contains 24 trillion annotated documents, which are categorized into twelve categories based on their topic, format, content complexity and quality. Additionally, this dataset also incorporates curated competitive web datasets from various domains, and the entire dataset is curated using a fine-tuned model that boasts high inter-annotator consistency. This dataset is designed for training language models and evaluating model performance across mathematical, web code, STEM (Science, Technology, Engineering and Mathematics) and medical tasks.
提供机构:
EssentialAI



