five

sumanthbhargava/cc-news-parallel

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sumanthbhargava/cc-news-parallel
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text2text-generation language: - en tags: - encoding-errors - parallel-corpus - denoising - nlp-research --- # CC-News Encoding Error Parallel Corpus This is a research dataset created for a Master's thesis on denoising character-level encoding errors in text using Large Language Models. ## Dataset Description The dataset is derived from [cc_news](https://huggingface.co/datasets/cc_news), an English news corpus based on the Common Crawl news crawl. Clean UTF-8 text was synthetically corrupted to simulate real-world encoding errors that occurred during the UTF-8 transition period (circa 2000-2015). ## Corruption Variants | File | Description | Example | |------|-------------|---------| | `data/cc_news_utf8_read_as_ascii.parquet` | UTF-8 bytes decoded as ASCII — bytes > 127 become replacement chars | `ä ->` | | `data/cc_news_utf8_read_as_latin1.parquet` | UTF-8 bytes decoded as Latin-1 — mojibake garbled text | `ä -> ä` | | `data/cc_news_unicode_written_as_ascii.parquet` | Unicode written as ASCII — one `?` per unsupported char | `ä -> ?` | | `data/cc_news_unicode_written_as_latin1.parquet` | Unicode written as Latin-1 — `?` only for chars outside Latin-1 | `€ -> ?` | ## Schema Each Parquet file contains the following fields: | Field | Description | |-------|-------------| | `id` | Sequential record ID | | `url` | Source URL | | `date` | Crawl date | | `title` | Article title | | `text` | Clean UTF-8 text (ground truth) | | `word_count` | Whitespace-tokenised word count of clean text | | `corrupted_text` | Synthetically corrupted text | | `corrupted_word_count` | Word count of corrupted text | | `affected_clean_words` | Number of clean words changed by corruption | | `corrupt_word_count` | Number of corrupted words that differ from clean | ## Source & License - Source corpus: [cc_news](https://huggingface.co/datasets/cc_news) via Common Crawl - Underlying data subject to [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use/) - Intended for **non-commercial academic research only**
提供机构:
sumanthbhargava
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作