EssentialAI/essential-web-v1.0
收藏Hugging Face2025-10-02 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/EssentialAI/essential-web-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
Essential-Web 是一个包含 24 万亿个标记的网页数据集,具有文档级元数据,旨在用于灵活的数据集整理。该数据集为 236 亿个文档提供了包括主题分类、网页类型、内容复杂性和文档质量评分在内的元数据。研究人员可以使用提供的元数据过滤和整理专业数据集,从而减少自定义预处理管道和特定领域分类器的需求。
Essential-Web is a 24-trillion-token web dataset with document-level metadata designed for flexible dataset curation. The dataset provides metadata including subject matter classification, web page type, content complexity, and document quality scores for each of the 23.6 billion documents. Researchers can filter and curate specialized datasets using the provided metadata, reducing the need for custom preprocessing pipelines and domain-specific classifiers.
提供机构:
EssentialAI



