Zoont/english-gutenberg-cleaned-tokenized
收藏Hugging Face2025-12-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Zoont/english-gutenberg-cleaned-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- en
pretty_name: Gutenberg
size_categories:
- 1M<n<10M
---
### Source dataset: [incredible45/Gutenberg-BookCorpus-Cleaned-Data-English](https://huggingface.co/datasets/incredible45/Gutenberg-BookCorpus-Cleaned-Data-English)
### Tokenizer: [microsoft/Phi-tiny-MoE-instruct](https://huggingface.co/microsoft/Phi-tiny-MoE-instruct)
Each row has exactly 2048 tokens
Added EOS token after each book
Removed all occurences of "This eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever. You may copy it, give it away or" which did not get removed in the source cleaned dataset
提供机构:
Zoont



