Abzalbek89/corpus_clean_tokenized
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Abzalbek89/corpus_clean_tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kk
license: apache-2.0
task_categories:
- text-generation
size_categories:
- 100K<n<1M
tags:
- kazakh
- tokenized
- language-modeling
- pretraining
dataset_info:
features:
- name: input_ids
sequence: int32
- name: labels
sequence: int32
splits:
- name: train
num_examples: 236981
- name: validation
num_examples: 12473
---
# Kazakh Tokenized Corpus (2048 blocks)
Pre-tokenized Kazakh corpus ready for language model training. Built from [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) using [Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k).
## Dataset Summary
| Metric | Value |
|---|---|
| **Train blocks** | 236,981 |
| **Validation blocks** | 12,473 |
| **Total blocks** | 249,454 |
| **Block size** | 2,048 tokens |
| **Total tokens** | ~0.51B (510M) |
| **Tokenizer** | ByteLevel BPE, 32K vocab |
| **Val ratio** | 5% |
## Pipeline
1. **Source:** [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) — 1,502,583 cleaned Kazakh texts
2. **Chunking:** Long texts split into chunks of up to 20K characters at paragraph/word boundaries (1,548,725 chunks)
3. **Tokenization:** ByteLevel BPE tokenizer ([Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k), vocab=32,000)
4. **Grouping:** All token sequences concatenated and split into fixed blocks of 2,048 tokens (no padding, no truncation)
5. **Split:** 95% train / 5% validation (seed=42)
## Fields
- **`input_ids`** (`list[int]`) — token IDs, length 2,048
- **`labels`** (`list[int]`) — copy of `input_ids` (for causal LM training)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Abzalbek89/corpus_clean_tokenized")
train = ds["train"]
val = ds["validation"]
# Each example is a fixed-size block of 2048 tokens
print(len(train[0]["input_ids"])) # 2048
print(len(train[0]["labels"])) # 2048
```
### Training with HuggingFace Trainer
```python
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
ds = load_dataset("Abzalbek89/corpus_clean_tokenized")
model = AutoModelForCausalLM.from_pretrained("your-model")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
num_train_epochs=3,
learning_rate=5e-4,
logging_steps=100,
save_steps=1000,
eval_strategy="steps",
eval_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds["train"],
eval_dataset=ds["validation"],
)
trainer.train()
```
## Source Data
| Component | Link |
|---|---|
| Raw corpus | [Abzalbek89/corpus_clean](https://huggingface.co/datasets/Abzalbek89/corpus_clean) |
| Tokenizer | [Abzalbek89/kk-tokenizer-bpe-32k](https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k) |
## License
Apache 2.0
提供机构:
Abzalbek89



