five

qikp/wordmix

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/qikp/wordmix
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation language: - en tags: - tokenization - code size_categories: - 10K<n<100K --- # wordmix wordmix is an aggregate dataset containing a diverse selection of content, including essays, synthetic textbooks, code, and satire news articles. It is largely intended for tokenizers. ## Size wordmix contains around 51 million GPT-2 tokens. ### Dataset Processing Summary* * **Crownelius/Creative-Writing-Sonnet4.6-800x** * **Column:** `response` * **Amount:** 800 rows (Full train split) * **Changes:** None; raw response extraction. * **Crownelius/Creative-Writing-Reasoning-KimiK2.5-600x** * **Column:** `response` * **Amount:** 600 rows (Full train split) * **Changes:** None; raw response extraction. * **vietdata/fineweb-mini** * **Column:** `text` * **Amount:** Full train split * **Changes:** None; raw text extraction. * **HuggingFaceTB/cosmopedia-20k** * **Column:** `text` * **Amount:** 5,000 rows (Sliced from train) * **Changes:** Subsampled to the first 5,000 entries. * **qikp/digits** * **Column:** `text` * **Amount:** 10,000 rows (Sliced from train) * **Changes:** Subsampled to the first 10,000 entries. * **jingjietan/essays-big5** * **Column:** `text` * **Amount:** Full train split * **Changes:** None; raw text extraction. * **Biddls/Onion_News** * **Column:** `text` * **Amount:** 3,000 rows (Sliced from train) * **Changes:** * Limited to the first 3,000 entries. * **String Manipulation:** Each entry is split by the delimiter `#~#`. * **Selection:** Only the second element (index 1) is kept. * **Cleaning:** Leading whitespace is removed via `.lstrip()`. *_List generated by a language model._
提供机构:
qikp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作