Shirova/Common-Crawl-2025-June

Name: Shirova/Common-Crawl-2025-June
Creator: Shirova
Published: 2025-11-09 10:57:54
License: 暂无描述

Hugging Face2025-11-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Shirova/Common-Crawl-2025-June

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification - token-classification tags: - Web - WebData - Common_Crawl pretty_name: Common Crawl 2025 June size_categories: - 100B<n<1T language: - en dataset_name: Ujjwal-Tyagi/Common-Crawl-2025-June --- # Common Crawl 2025 June **Common-Crawl-2025-June** is a curated, processed, and filtered dataset built from the **June 2025 Common Crawl** web corpus. It contains data crawled between **June 1, 2025, and June 10, 2025**, processed using **Hugging Face’s Data Trove** pipeline and several **AI-based content filters** to remove unsafe, harmful, or low-quality text. --- ## Dataset Summary This dataset represents one of the latest structured Common Crawl releases with high-quality web data. The extraction and filtering process focused on ensuring text cleanliness, linguistic consistency, and ethical compliance. All pages have undergone language detection, PII filtering, and text normalization to make the dataset suitable for research, pretraining, and web-scale analysis. --- ## Key Details - **Crawl period:** June 1, 2025 – June 10, 2025 - **Processed using:** Hugging Face Data Trove and AI model filters - **Language coverage:** More than 1k languages - **Data type:** Web text and metadata - **License:** Apache 2.0 - **Total data size:** 153b tokens --- ## Dataset Structure Each record in this dataset represents one cleaned web document with the following fields: | Field | Type | Description | |--------|------|-------------| | `text` | string | Extracted and fully cleaned text content of the crawled web page. | | `id` | string | Unique identifier assigned to each record, derived from the source URL hash. | | `dump` | string | Identifier of the Common Crawl dataset dump, specifically `CC-MAIN-2025-06`. | | `url` | string | Original URL of the crawled web page from which the text was extracted. | | `date` | datetime | Timestamp indicating when the page was crawled, within the range `2025-06-12 11:29:07` to `2025-06-25 09:51:48`. | | `file_path` | string | File path pointing to the specific shard or data file that contains the record, selected from 866 total shard paths. | | `language` | string | Language code detected for the web page content. For this dataset, all records are in various languages that are more than 1k. | | `language_score` | float | Numeric confidence score produced by the language identification model | | `token_count` | int | Total count of tokens present in the extracted text field of the document. | --- # Data Processing Pipeline - Crawl ingestion — Collected raw WARC files from Common Crawl (June 2025). - Extraction — Extracted HTML, titles, and cleaned text content. - Normalization — Applied Unicode normalization and boilerplate removal. - Filtering — Removed or flagged: - PII and personal identifiers - Hateful, toxic, or explicit content - Spam and duplicated text - Unsafe or low-quality material - Scoring — Assigned language_score and quality confidence metrics. - Packaging — Structured dataset stored in JSONL shards with manifests and metadata. # Intended Uses - Pretraining and fine-tuning large language models - Research on large-scale web text data - Data quality benchmarking - Language modeling and token-level analysis # Limitations - Some residual harmful or biased content may remain despite filtering. - Dataset includes only a 10-day snapshot of the web, not full temporal coverage. - Users must ensure compliance with legal and ethical guidelines for downstream use. # Ethical Considerations While multiple safety models were used, no automated filtering system is perfect. Users are responsible for performing additional checks before deploying models trained on this data in production or public-facing environments. # Citation ``` @dataset{common-crawl-2025-june,} title = {Common-Crawl-2025-June: Curated Common Crawl Dataset (June 1–10, 2025)}, author = {Ujjwal Tyagi}, year = {2025}, howpublished = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/Ujjwal-Tyagi/Common-Crawl-2025-June} } ``` # Contact - Maintainer: **Ujjwal Tyagi** - Hugging Face: https://huggingface.co/Ujjwal-Tyagi - Issues and feedback: Submit via the Hugging Face dataset page.

提供机构：

Shirova

5,000+

优质数据集

54 个

任务类型

进入经典数据集