five

mocatex/cc-news-sample

收藏
Hugging Face2025-12-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mocatex/cc-news-sample
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: YourDatasetName tags: - common-crawl - web-crawl - news - corpus - article license: other size_categories: - 10M<n<100M --- # YourDatasetName ## Summary This dataset is a 10% English-language random sample derived from Common Crawl, provided for research and reproducibility. This sample was created of [this dataset](https://huggingface.co/datasets/Geralt-Targaryen/CC-News) ## Data origin & rights Derived from Common Crawl. Copyright in the underlying content remains with the original authors/publishers. Common Crawl Terms of Use apply. [oai_citation:4‡Common Crawl](https://commoncrawl.org/terms-of-use?utm_source=chatgpt.com) ## What's inside - Format: Parquet shards (one file per year) - Columns: year, source_domain, title, text ### Stats: - Size: ~23GB - ~ 13’399’600 articles | year | number of articles | |------|--------------------| | 2016 | 122'760 | | 2017 | 1'750'263 | | 2018 | 1'700'226 | | 2019 | 2'239'029 | | 2020 | 3'690'418 | | 2021 | 3'896'904 | ## Intended use research / benchmarking / reproducibility ## Privacy / PII notice This dataset may contain personal data and sensitive content as it originates from web crawl data.
提供机构:
mocatex
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作