tvu-vlinhd11/pretrain-dataset-T2048-13B
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tvu-vlinhd11/pretrain-dataset-T2048-13B
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
- vi
tags:
- pretrain
- tokenized
- packed-sequences
size_categories:
- 1M<n<10M
---
# Pretrain Dataset (Tokenized)
This dataset contains tokenized and packed sequences ready for LLM pretraining.
## Dataset Details
| Property | Value |
|----------|-------|
| **Sequences** | 6,474,097 |
| **Sequence Length** | 2048 |
| **Tokenizer** | `./vn_spm_v3_fast2/` |
| **Total Tokens** | 13,258,950,332 |
| **Shards** | 13 |
| **Created** | 2025-12-10 |
## Dataset Structure
Each sample contains:
- `input_ids`: List of token IDs (length: 2048)
- `attention_mask`: Attention mask (1 for real tokens, 0 for padding)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("tvu-vlinhd11/pretrain-dataset-T2048-13B")
train_data = dataset["train"]
sample = train_data[0]
input_ids = sample["input_ids"]
attention_mask = sample["attention_mask"]
```
## License
Apache 2.0
提供机构:
tvu-vlinhd11



