sumanthbhargava/cc-news-parallel
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sumanthbhargava/cc-news-parallel
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text2text-generation
language:
- en
tags:
- encoding-errors
- parallel-corpus
- denoising
- nlp-research
---
# CC-News Encoding Error Parallel Corpus
This is a research dataset created for a Master's thesis on denoising character-level
encoding errors in text using Large Language Models.
## Dataset Description
The dataset is derived from [cc_news](https://huggingface.co/datasets/cc_news),
an English news corpus based on the Common Crawl news crawl.
Clean UTF-8 text was synthetically corrupted to simulate real-world encoding errors
that occurred during the UTF-8 transition period (circa 2000-2015).
## Corruption Variants
| File | Description | Example |
|------|-------------|---------|
| `data/cc_news_utf8_read_as_ascii.parquet` | UTF-8 bytes decoded as ASCII — bytes > 127 become replacement chars | `ä ->` |
| `data/cc_news_utf8_read_as_latin1.parquet` | UTF-8 bytes decoded as Latin-1 — mojibake garbled text | `ä -> ä` |
| `data/cc_news_unicode_written_as_ascii.parquet` | Unicode written as ASCII — one `?` per unsupported char | `ä -> ?` |
| `data/cc_news_unicode_written_as_latin1.parquet` | Unicode written as Latin-1 — `?` only for chars outside Latin-1 | `€ -> ?` |
## Schema
Each Parquet file contains the following fields:
| Field | Description |
|-------|-------------|
| `id` | Sequential record ID |
| `url` | Source URL |
| `date` | Crawl date |
| `title` | Article title |
| `text` | Clean UTF-8 text (ground truth) |
| `word_count` | Whitespace-tokenised word count of clean text |
| `corrupted_text` | Synthetically corrupted text |
| `corrupted_word_count` | Word count of corrupted text |
| `affected_clean_words` | Number of clean words changed by corruption |
| `corrupt_word_count` | Number of corrupted words that differ from clean |
## Source & License
- Source corpus: [cc_news](https://huggingface.co/datasets/cc_news) via Common Crawl
- Underlying data subject to [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use/)
- Intended for **non-commercial academic research only**
提供机构:
sumanthbhargava



