anandjh8/common-crawl-english-filtered

Name: anandjh8/common-crawl-english-filtered
Creator: anandjh8
Published: 2025-10-12 19:39:10
License: 暂无描述

Hugging Face2025-10-12 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/anandjh8/common-crawl-english-filtered

下载链接

链接失效反馈

官方服务：

资源简介：

FineWeb-English-Filtered 是一个大规模、经过清洗的、仅包含英语文本的数据集，来源于 Common Crawl 的 WET 归档。该数据集包含大约 9.4 亿个公开可用的网页文本，转换为 Apache Parquet 格式，具有一致的架构，以便快速高效地加载数据。数据集通过定制的 AWS Glue 管道处理、过滤和合并了多个 terabytes 的 Common Crawl 数据生成。这个数据集非常适合用于训练大型语言模型、检索研究以及 Web 规模的自然语言处理任务。

FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives. It contains approximately 940 million publicly available web documents, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated using a custom AWS Glue pipeline that processed, filtered, and merged terabytes of Common Crawl data. This dataset is ideal for training large language models, retrieval research, and web-scale NLP tasks.

提供机构：

anandjh8

5,000+

优质数据集

54 个

任务类型

进入经典数据集