five

mhla/pre1900-training

收藏
Hugging Face2026-02-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mhla/pre1900-training
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string --- # Pre-1900 Training Corpus Chunked and resharded pre-1900 English text corpus, ready for language model training. ## Format - **266 parquet shards** (265 train + 1 validation) - **12.8M documents** (chunks of ≤8,000 characters) - **~22B tokens** estimated - **Text-only** — single `text` column per row - Row groups divisible by 8 for even DDP distribution across GPUs - Last shard (`shard_00265`) is the validation split ## Processing Pipeline Built from the full pre-1900 filtered corpus through: 1. **OCR cleanup** — removal of OCR artifacts, boilerplate, and unicode normalization 2. **Quality filtering** — token frequency prior-based filtering 3. **Anachronism detection** — three-tier post-1900 physics filter 4. **Document chunking** — long documents split at paragraph/sentence boundaries (max 8K chars, min 200 chars) 5. **Token balancing** — sort-by-length + round-robin distribution across shards for even token counts ## Usage ```python from datasets import load_dataset ds = load_dataset("mhla/pre1900-training") ``` ## Related - [`mhla/pre1900-corpus`](https://huggingface.co/datasets/mhla/pre1900-corpus) — full documents with metadata (title, year, source, OCR scores) - [`mhla/gpt1900-d26-8btok`](https://huggingface.co/mhla/gpt1900-d26-8btok) — GPT-1900 model trained on this data
提供机构:
mhla
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作