qikp/wordmix
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/qikp/wordmix
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- en
tags:
- tokenization
- code
size_categories:
- 10K<n<100K
---
# wordmix
wordmix is an aggregate dataset containing a diverse selection of content, including essays, synthetic textbooks, code, and satire news articles. It is largely intended for tokenizers.
## Size
wordmix contains around 51 million GPT-2 tokens.
### Dataset Processing Summary*
* **Crownelius/Creative-Writing-Sonnet4.6-800x**
* **Column:** `response`
* **Amount:** 800 rows (Full train split)
* **Changes:** None; raw response extraction.
* **Crownelius/Creative-Writing-Reasoning-KimiK2.5-600x**
* **Column:** `response`
* **Amount:** 600 rows (Full train split)
* **Changes:** None; raw response extraction.
* **vietdata/fineweb-mini**
* **Column:** `text`
* **Amount:** Full train split
* **Changes:** None; raw text extraction.
* **HuggingFaceTB/cosmopedia-20k**
* **Column:** `text`
* **Amount:** 5,000 rows (Sliced from train)
* **Changes:** Subsampled to the first 5,000 entries.
* **qikp/digits**
* **Column:** `text`
* **Amount:** 10,000 rows (Sliced from train)
* **Changes:** Subsampled to the first 10,000 entries.
* **jingjietan/essays-big5**
* **Column:** `text`
* **Amount:** Full train split
* **Changes:** None; raw text extraction.
* **Biddls/Onion_News**
* **Column:** `text`
* **Amount:** 3,000 rows (Sliced from train)
* **Changes:**
* Limited to the first 3,000 entries.
* **String Manipulation:** Each entry is split by the delimiter `#~#`.
* **Selection:** Only the second element (index 1) is kept.
* **Cleaning:** Leading whitespace is removed via `.lstrip()`.
*_List generated by a language model._
提供机构:
qikp



