sumanthbhargava/cc-news-parallel

Name: sumanthbhargava/cc-news-parallel
Creator: sumanthbhargava
Published: 2026-03-04 09:58:05
License: 暂无描述

Hugging Face2026-03-04 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sumanthbhargava/cc-news-parallel

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - text2text-generation language: - en tags: - encoding-errors - parallel-corpus - denoising - nlp-research --- # CC-News Encoding Error Parallel Corpus This is a research dataset created for a Master's thesis on denoising character-level encoding errors in text using Large Language Models. ## Dataset Description The dataset is derived from [cc_news](https://huggingface.co/datasets/cc_news), an English news corpus based on the Common Crawl news crawl. Clean UTF-8 text was synthetically corrupted to simulate real-world encoding errors that occurred during the UTF-8 transition period (circa 2000-2015). ## Corruption Variants | File | Description | Example | |------|-------------|---------| | `data/cc_news_utf8_read_as_ascii.parquet` | UTF-8 bytes decoded as ASCII — bytes > 127 become replacement chars | `ä ->` | | `data/cc_news_utf8_read_as_latin1.parquet` | UTF-8 bytes decoded as Latin-1 — mojibake garbled text | `ä -> Ã¤` | | `data/cc_news_unicode_written_as_ascii.parquet` | Unicode written as ASCII — one `?` per unsupported char | `ä -> ?` | | `data/cc_news_unicode_written_as_latin1.parquet` | Unicode written as Latin-1 — `?` only for chars outside Latin-1 | `€ -> ?` | ## Schema Each Parquet file contains the following fields: | Field | Description | |-------|-------------| | `id` | Sequential record ID | | `url` | Source URL | | `date` | Crawl date | | `title` | Article title | | `text` | Clean UTF-8 text (ground truth) | | `word_count` | Whitespace-tokenised word count of clean text | | `corrupted_text` | Synthetically corrupted text | | `corrupted_word_count` | Word count of corrupted text | | `affected_clean_words` | Number of clean words changed by corruption | | `corrupt_word_count` | Number of corrupted words that differ from clean | ## Source & License - Source corpus: [cc_news](https://huggingface.co/datasets/cc_news) via Common Crawl - Underlying data subject to [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use/) - Intended for **non-commercial academic research only**

提供机构：

sumanthbhargava

5,000+

优质数据集

54 个

任务类型

进入经典数据集