tvu-vlinhd11/pretrain-dataset-raw-10M
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tvu-vlinhd11/pretrain-dataset-raw-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
- vi
tags:
- pretrain
- text-corpus
- multilingual
size_categories:
- 1M<n<10M
---
# Pretrain Dataset (Text)
This dataset contains preprocessed text documents ready for LLM pretraining.
## Dataset Details
| Property | Value |
|----------|-------|
| **Documents** | 10,000,000 |
| **Processed** | 10000000 |
| **Shards** | 21 |
| **Created** | 2025-12-09 |
## Dataset Structure
Each sample contains:
- `text`: The document text
- `source`: Source dataset identifier
- `id`: Unique document ID
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("tvu-vlinhd11/pretrain-dataset-raw-10M")
train_data = dataset["train"]
sample = train_data[0]
text = sample["text"]
```
## License
Apache 2.0
提供机构:
tvu-vlinhd11



