five

Essential-Web v1.0

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了24万亿个标记的文档,这些文档根据主题、格式、内容复杂度和质量被分类到十二个类别中。此外,该数据集中还包含了来自不同领域的竞争性网络精选数据集,并使用经过精细调整的模型进行整理,该模型具有高标注者一致性。该数据集的任务是用于训练语言模型,并在数学、网络代码、STEM(科学、技术、工程和数学)以及医学任务上评估性能。

This dataset contains 24 trillion annotated documents, which are categorized into twelve categories based on their topic, format, content complexity and quality. Additionally, this dataset also incorporates curated competitive web datasets from various domains, and the entire dataset is curated using a fine-tuned model that boasts high inter-annotator consistency. This dataset is designed for training language models and evaluating model performance across mathematical, web code, STEM (Science, Technology, Engineering and Mathematics) and medical tasks.
提供机构:
EssentialAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作