five

Abzalbek89/corpus_clean_tokenized

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Abzalbek89/corpus_clean_tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - kk license: apache-2.0 task_categories: - text-generation size_categories: - 100K<n<1M tags: - kazakh - tokenized - language-modeling - pretraining dataset_info: features: - name: input_ids sequence: int32 - name: labels sequence: int32 splits: - name: train num_examples: 236981 - name: validation num_examples: 12473 --- # Kazakh Tokenized Corpus (2048 blocks) Pre-tokenized Kazakh corpus ready for language model training. Built from [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) using [Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k). ## Dataset Summary | Metric | Value | |---|---| | **Train blocks** | 236,981 | | **Validation blocks** | 12,473 | | **Total blocks** | 249,454 | | **Block size** | 2,048 tokens | | **Total tokens** | ~0.51B (510M) | | **Tokenizer** | ByteLevel BPE, 32K vocab | | **Val ratio** | 5% | ## Pipeline 1. **Source:** [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) — 1,502,583 cleaned Kazakh texts 2. **Chunking:** Long texts split into chunks of up to 20K characters at paragraph/word boundaries (1,548,725 chunks) 3. **Tokenization:** ByteLevel BPE tokenizer ([Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k), vocab=32,000) 4. **Grouping:** All token sequences concatenated and split into fixed blocks of 2,048 tokens (no padding, no truncation) 5. **Split:** 95% train / 5% validation (seed=42) ## Fields - **`input_ids`** (`list[int]`) — token IDs, length 2,048 - **`labels`** (`list[int]`) — copy of `input_ids` (for causal LM training) ## Usage ```python from datasets import load_dataset ds = load_dataset("Abzalbek89/corpus_clean_tokenized") train = ds["train"] val = ds["validation"] # Each example is a fixed-size block of 2048 tokens print(len(train[0]["input_ids"])) # 2048 print(len(train[0]["labels"])) # 2048 ``` ### Training with HuggingFace Trainer ```python from transformers import AutoModelForCausalLM, TrainingArguments, Trainer from datasets import load_dataset ds = load_dataset("Abzalbek89/corpus_clean_tokenized") model = AutoModelForCausalLM.from_pretrained("your-model") training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=8, num_train_epochs=3, learning_rate=5e-4, logging_steps=100, save_steps=1000, eval_strategy="steps", eval_steps=500, ) trainer = Trainer( model=model, args=training_args, train_dataset=ds["train"], eval_dataset=ds["validation"], ) trainer.train() ``` ## Source Data | Component | Link | |---|---| | Raw corpus | [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) | | Tokenizer | [Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k) | ## License Apache 2.0
提供机构:
Abzalbek89
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作