anandjh8/common-crawl-english-filtered
收藏Hugging Face2025-10-12 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/anandjh8/common-crawl-english-filtered
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb-English-Filtered 是一个大规模、经过清洗的、仅包含英语文本的数据集,来源于 Common Crawl 的 WET 归档。该数据集包含大约 9.4 亿个公开可用的网页文本,转换为 Apache Parquet 格式,具有一致的架构,以便快速高效地加载数据。数据集通过定制的 AWS Glue 管道处理、过滤和合并了多个 terabytes 的 Common Crawl 数据生成。这个数据集非常适合用于训练大型语言模型、检索研究以及 Web 规模的自然语言处理任务。
FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives. It contains approximately 940 million publicly available web documents, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated using a custom AWS Glue pipeline that processed, filtered, and merged terabytes of Common Crawl data. This dataset is ideal for training large language models, retrieval research, and web-scale NLP tasks.
提供机构:
anandjh8



