hirine/wikipedia-vietnamese-1M296K-dataset
收藏Hugging Face2025-12-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hirine/wikipedia-vietnamese-1M296K-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- vi
tags:
- wikipedia
- vietnamese
- text
size_categories:
- 1M<n<10M
---
# Vietnamese Wikipedia Dataset
Vietnamese Wikipedia articles extracted from the Vietnamese Wikipedia dump.
## Dataset Details
| Property | Value |
|----------|-------|
| **Records** | 1,296,303 |
| **Size** | ~1.5 GB |
| **Language** | Vietnamese |
| **Last Updated** | 16/12/2025 |
## Schema
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Article ID (e.g., `wiki_000001`) |
| `title` | string | Article title |
| `text` | string | Full article content (cleaned) |
| `source` | string | Always `wikipedia` |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("your-username/vietnamese-wikipedia")
print(ds["train"][0])
```
## Preprocessing
- Removed wiki template tags (`<templatestyles>`, `__NOEDITSECTION__`, etc.)
- Removed `<ref>` tags
- Cleaned excessive whitespace
提供机构:
hirine



