qikp/wordmix

Name: qikp/wordmix
Creator: qikp
Published: 2026-04-01 18:55:39
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/qikp/wordmix

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-generation language: - en tags: - tokenization - code size_categories: - 10K<n<100K --- # wordmix wordmix is an aggregate dataset containing a diverse selection of content, including essays, synthetic textbooks, code, and satire news articles. It is largely intended for tokenizers. ## Size wordmix contains around 51 million GPT-2 tokens. ### Dataset Processing Summary* * **Crownelius/Creative-Writing-Sonnet4.6-800x** * **Column:** `response` * **Amount:** 800 rows (Full train split) * **Changes:** None; raw response extraction. * **Crownelius/Creative-Writing-Reasoning-KimiK2.5-600x** * **Column:** `response` * **Amount:** 600 rows (Full train split) * **Changes:** None; raw response extraction. * **vietdata/fineweb-mini** * **Column:** `text` * **Amount:** Full train split * **Changes:** None; raw text extraction. * **HuggingFaceTB/cosmopedia-20k** * **Column:** `text` * **Amount:** 5,000 rows (Sliced from train) * **Changes:** Subsampled to the first 5,000 entries. * **qikp/digits** * **Column:** `text` * **Amount:** 10,000 rows (Sliced from train) * **Changes:** Subsampled to the first 10,000 entries. * **jingjietan/essays-big5** * **Column:** `text` * **Amount:** Full train split * **Changes:** None; raw text extraction. * **Biddls/Onion_News** * **Column:** `text` * **Amount:** 3,000 rows (Sliced from train) * **Changes:** * Limited to the first 3,000 entries. * **String Manipulation:** Each entry is split by the delimiter `#~#`. * **Selection:** Only the second element (index 1) is kept. * **Cleaning:** Leading whitespace is removed via `.lstrip()`. *_List generated by a language model._

提供机构：

qikp

5,000+

优质数据集

54 个

任务类型

进入经典数据集