undertheseanlp/UVW-2026
收藏Hugging Face2026-01-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/undertheseanlp/UVW-2026
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
license: cc-by-sa-4.0
task_categories:
- text-generation
- fill-mask
- text-classification
- feature-extraction
- sentence-similarity
tags:
- wikipedia
- vietnamese
- nlp
- underthesea
- wikidata
- pretraining
- language-modeling
pretty_name: UVW 2026 - Vietnamese Wikipedia Dataset
size_categories:
- 1M<n<10M
source_datasets:
- original
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: content
dtype: string
- name: num_chars
dtype: int32
- name: num_sentences
dtype: int32
- name: quality_score
dtype: int32
- name: wikidata_id
dtype: string
- name: main_category
dtype: string
splits:
- name: train
num_examples: 894579
- name: validation
num_examples: 111822
- name: test
num_examples: 111823
configs:
- config_name: default
data_files:
- split: train
path: train.parquet
- split: validation
path: validation.parquet
- split: test
path: test.parquet
---
# UVW 2026: Underthesea Vietnamese Wikipedia Dataset
<div align="center">
[](https://creativecommons.org/licenses/by-sa/4.0/)
[](https://vi.wikipedia.org)
[](https://www.wikidata.org)
</div>
## Dataset Description
**UVW 2026** (Underthesea Vietnamese Wikipedia) is a high-quality, cleaned dataset of Vietnamese Wikipedia articles enriched with Wikidata metadata. Designed for Vietnamese NLP research including language modeling, text generation, text classification, named entity recognition, and model pretraining.
### Key Features
- **Clean text**: Wikipedia markup, templates, references, and formatting removed
- **Wikidata integration**: Articles linked to Wikidata entities with semantic categories
- **Quality scoring**: Each article scored 1-10 based on content quality metrics
- **Unicode normalized**: NFC normalization applied for consistent text processing
- **Ready to use**: Pre-split into train/validation/test sets
### Dataset Summary
| Property | Value |
|----------|-------|
| **Language** | Vietnamese (vi) |
| **Source** | Vietnamese Wikipedia + Wikidata |
| **License** | CC BY-SA 4.0 |
| **Generated** | 2026-01-31 |
| **Total Articles** | 1,118,224 |
| **Wikidata Coverage** | 99.4% |
| **Category Coverage** | 97.0% |
| **Unique Categories** | 11,549 |
| **Avg. Characters** | 1,190 |
| **Avg. Sentences** | 10 |
## Quick Start
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("undertheseanlp/UVW-2026")
# Access splits
train = dataset["train"]
validation = dataset["validation"]
test = dataset["test"]
# View an example
print(train[0])
```
## Dataset Structure
### Data Splits
| Split | Examples | Description |
|-------|----------|-------------|
| `train` | 894,579 | Training set (80%) |
| `validation` | 111,822 | Validation set (10%) |
| `test` | 111,823 | Test set (10%) |
### Schema
```json
{
"id": "Việt_Nam",
"title": "Việt Nam",
"content": "Việt Nam, tên chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam...",
"num_chars": 45000,
"num_sentences": 500,
"quality_score": 9,
"wikidata_id": "Q881",
"main_category": "quốc gia có chủ quyền"
}
```
### Field Descriptions
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique article identifier (URL-safe title) |
| `title` | string | Human-readable article title |
| `content` | string | Cleaned article text content |
| `num_chars` | int32 | Character count of content |
| `num_sentences` | int32 | Estimated sentence count |
| `quality_score` | int32 | Quality score from 1 (lowest) to 10 (highest) |
| `wikidata_id` | string | Wikidata Q-identifier (e.g., "Q881" for Vietnam) |
| `main_category` | string | Primary category from Wikidata P31 (instance of) |
## Usage Examples
### Filter High-Quality Articles
```python
# Get articles with quality score >= 7
high_quality = dataset["train"].filter(lambda x: x["quality_score"] >= 7)
print(f"High-quality articles: {len(high_quality):,}")
```
### Filter by Category
```python
# Get articles about people
people = dataset["train"].filter(lambda x: x["main_category"] == "người")
print(f"Articles about people: {len(people):,}")
# Get articles about locations
locations = dataset["train"].filter(
lambda x: "khu định cư" in (x["main_category"] or "")
)
```
### Filter by Wikidata
```python
# Get articles with Wikidata links
with_wikidata = dataset["train"].filter(lambda x: x["wikidata_id"] != "")
# Lookup specific entity
vietnam = dataset["train"].filter(lambda x: x["wikidata_id"] == "Q881")
```
### Use for Language Modeling
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
def tokenize(examples):
return tokenizer(examples["content"], truncation=True, max_length=512)
tokenized = dataset["train"].map(tokenize, batched=True)
```
## Quality Score
Articles are scored 1-10 based on multiple factors:
| Component | Weight | Criteria |
|-----------|--------|----------|
| **Length** | 40% | Character count (200 - 100,000 optimal) |
| **Sentences** | 30% | Sentence count (3 - 1,000 optimal) |
| **Density** | 30% | Avg sentence length (80-150 chars optimal) |
| **Wikidata bonus** | +0.5 | Has wikidata_id |
| **Category bonus** | +0.5 | Has main_category |
| **Markup penalty** | -1 to -3 | Remaining Wikipedia markup |
### Quality Distribution
| Score | Count | Percentage |
|-------|------:|----------:|
| 1 | 134 | 0.0% |
| 2 | 376 | 0.0% |
| 3 | 28,267 | 2.5% |
| 4 | 607,081 | 54.3% |
| 5 | 208,304 | 18.6% |
| 6 | 134,385 | 12.0% |
| 7 | 70,345 | 6.3% |
| 8 | 57,054 | 5.1% |
| 9 | 9,649 | 0.9% |
| 10 | 2,629 | 0.2% |
## Top Categories
| Category (Vietnamese) | Count | Percentage |
|----------------------|------:|----------:|
| đơn vị phân loại | 618,281 | 55.3% |
| người | 78,191 | 7.0% |
| xã của Pháp | 35,635 | 3.2% |
| khu định cư | 20,276 | 1.8% |
| village of Turkey | 18,540 | 1.7% |
| tiểu hành tinh | 17,891 | 1.6% |
| mahalle | 16,419 | 1.5% |
| xã của Việt Nam | 7,088 | 0.6% |
| đô thị của Ý | 6,700 | 0.6% |
| trang định hướng Wikimedia | 6,202 | 0.6% |
## Data Processing
### Pipeline Steps
1. **Download**: Fetch Vietnamese Wikipedia XML dump from Wikimedia
2. **Extract**: Parse XML and extract article content
3. **Clean**: Remove Wikipedia markup (templates, refs, links, tables, categories)
4. **Normalize**: Apply Unicode NFC normalization
5. **Score**: Calculate quality metrics for each article
6. **Enrich**: Add Wikidata IDs and semantic categories via Wikidata API
7. **Filter**: Remove special pages, redirects, disambiguation, and short articles (<100 chars)
8. **Split**: Create train/validation/test splits (80/10/10) with seed=42
### Removed Content
- Wikipedia templates (`{{...}}`)
- References and citations (`<ref>...</ref>`)
- HTML tags and comments
- Category links (`[[Thể loại:...]]`)
- File/image links (`[[Tập tin:...]]`, `[[File:...]]`)
- Interwiki links
- Tables (`{| ... |}`)
- Infoboxes and navigation templates
### Reproduction
```bash
git clone https://github.com/undertheseanlp/UVW-2026
cd UVW-2026
uv sync --extra huggingface
# Run full pipeline
uv run python scripts/build_dataset.py
# Or run individual steps
uv run python scripts/download_wikipedia.py
uv run python scripts/extract_articles.py
uv run python scripts/wikipedia_quality_score.py
uv run python scripts/add_wikidata.py
uv run python scripts/create_splits.py
uv run python scripts/prepare_huggingface.py --push
```
## Citation
```bibtex
@dataset{uvw2026,
title = {UVW 2026: Underthesea Vietnamese Wikipedia Dataset},
author = {Underthesea NLP},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/undertheseanlp/UVW-2026},
note = {Vietnamese Wikipedia articles enriched with Wikidata metadata}
}
```
## Related Resources
- [Underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit
- [PhoBERT](https://github.com/VinAIResearch/PhoBERT) - Pre-trained language models for Vietnamese
- [Vietnamese Wikipedia](https://vi.wikipedia.org)
- [Wikidata](https://www.wikidata.org)
## License
This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/), consistent with the Wikipedia content license.
---
<div align="center">
Made with ❤️ by <a href="https://github.com/undertheseanlp">Underthesea NLP</a>
</div>
提供机构:
undertheseanlp



