tvu-vlinhd11/pretrain-dataset-T4096-10M
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tvu-vlinhd11/pretrain-dataset-T4096-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
- vi
tags:
- pretrain
- tokenized
- packed-sequences
size_categories:
- 1M<n<10M
---
# Pretrain Dataset (Tokenized)
This dataset contains tokenized and packed sequences ready for LLM pretraining.
## Dataset Details
| Property | Value |
|----------|-------|
| **Sequences** | 3,237,049 |
| **Sequence Length** | 4096 |
| **Tokenizer** | `./vn_spm_v3_fast2/` |
| **Total Tokens** | 13,258,950,332 |
| **Shards** | 7 |
| **Created** | 2025-12-10 |
## Dataset Structure
Each sample contains:
- `input_ids`: List of token IDs (length: 4096)
- `attention_mask`: Attention mask (1 for real tokens, 0 for padding)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("tvu-vlinhd11/pretrain-dataset-T4096-10M")
train_data = dataset["train"]
sample = train_data[0]
input_ids = sample["input_ids"]
attention_mask = sample["attention_mask"]
```
## License
Apache 2.0
许可证:Apache-2.0
任务类别:
- 文本生成(text-generation)
语言:
- 英语(en)
- 越南语(vi)
标签:
- 预训练(pretrain)
- 已分词(tokenized)
- 打包序列(packed-sequences)
规模类别:
- 1M<n<10M
# 预训练数据集(已分词)
本数据集包含已完成分词与打包的序列,可直接用于大语言模型(Large Language Model,LLM)预训练。
## 数据集详情
| 属性 | 取值 |
|----------|-------|
| **序列数** | 3,237,049 |
| **序列长度** | 4096 |
| **分词器(Tokenizer)** | `./vn_spm_v3_fast2/` |
| **总Token数** | 13,258,950,332 |
| **分片数** | 7 |
| **创建日期** | 2025-12-10 |
## 数据集结构
每个样本包含以下内容:
- `input_ids`(输入标识):由Token ID组成的列表,长度为4096
- `attention_mask`(注意力掩码):用于区分有效Token与填充Token的掩码,其中1代表有效Token,0代表填充Token
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("tvu-vlinhd11/pretrain-dataset-T4096-10M")
train_data = dataset["train"]
sample = train_data[0]
input_ids = sample["input_ids"]
attention_mask = sample["attention_mask"]
## 许可证
Apache-2.0
提供机构:
tvu-vlinhd11



