five

undertheseanlp/UVW-2026

收藏
Hugging Face2026-01-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/undertheseanlp/UVW-2026
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi license: cc-by-sa-4.0 task_categories: - text-generation - fill-mask - text-classification - feature-extraction - sentence-similarity tags: - wikipedia - vietnamese - nlp - underthesea - wikidata - pretraining - language-modeling pretty_name: UVW 2026 - Vietnamese Wikipedia Dataset size_categories: - 1M<n<10M source_datasets: - original dataset_info: features: - name: id dtype: string - name: title dtype: string - name: content dtype: string - name: num_chars dtype: int32 - name: num_sentences dtype: int32 - name: quality_score dtype: int32 - name: wikidata_id dtype: string - name: main_category dtype: string splits: - name: train num_examples: 894579 - name: validation num_examples: 111822 - name: test num_examples: 111823 configs: - config_name: default data_files: - split: train path: train.parquet - split: validation path: validation.parquet - split: test path: test.parquet --- # UVW 2026: Underthesea Vietnamese Wikipedia Dataset <div align="center"> [![License: CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-sa/4.0/) [![Language: Vietnamese](https://img.shields.io/badge/Language-Vietnamese-blue.svg)](https://vi.wikipedia.org) [![Wikidata Enriched](https://img.shields.io/badge/Wikidata-Enriched-green.svg)](https://www.wikidata.org) </div> ## Dataset Description **UVW 2026** (Underthesea Vietnamese Wikipedia) is a high-quality, cleaned dataset of Vietnamese Wikipedia articles enriched with Wikidata metadata. Designed for Vietnamese NLP research including language modeling, text generation, text classification, named entity recognition, and model pretraining. ### Key Features - **Clean text**: Wikipedia markup, templates, references, and formatting removed - **Wikidata integration**: Articles linked to Wikidata entities with semantic categories - **Quality scoring**: Each article scored 1-10 based on content quality metrics - **Unicode normalized**: NFC normalization applied for consistent text processing - **Ready to use**: Pre-split into train/validation/test sets ### Dataset Summary | Property | Value | |----------|-------| | **Language** | Vietnamese (vi) | | **Source** | Vietnamese Wikipedia + Wikidata | | **License** | CC BY-SA 4.0 | | **Generated** | 2026-01-31 | | **Total Articles** | 1,118,224 | | **Wikidata Coverage** | 99.4% | | **Category Coverage** | 97.0% | | **Unique Categories** | 11,549 | | **Avg. Characters** | 1,190 | | **Avg. Sentences** | 10 | ## Quick Start ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("undertheseanlp/UVW-2026") # Access splits train = dataset["train"] validation = dataset["validation"] test = dataset["test"] # View an example print(train[0]) ``` ## Dataset Structure ### Data Splits | Split | Examples | Description | |-------|----------|-------------| | `train` | 894,579 | Training set (80%) | | `validation` | 111,822 | Validation set (10%) | | `test` | 111,823 | Test set (10%) | ### Schema ```json { "id": "Việt_Nam", "title": "Việt Nam", "content": "Việt Nam, tên chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam...", "num_chars": 45000, "num_sentences": 500, "quality_score": 9, "wikidata_id": "Q881", "main_category": "quốc gia có chủ quyền" } ``` ### Field Descriptions | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique article identifier (URL-safe title) | | `title` | string | Human-readable article title | | `content` | string | Cleaned article text content | | `num_chars` | int32 | Character count of content | | `num_sentences` | int32 | Estimated sentence count | | `quality_score` | int32 | Quality score from 1 (lowest) to 10 (highest) | | `wikidata_id` | string | Wikidata Q-identifier (e.g., "Q881" for Vietnam) | | `main_category` | string | Primary category from Wikidata P31 (instance of) | ## Usage Examples ### Filter High-Quality Articles ```python # Get articles with quality score >= 7 high_quality = dataset["train"].filter(lambda x: x["quality_score"] >= 7) print(f"High-quality articles: {len(high_quality):,}") ``` ### Filter by Category ```python # Get articles about people people = dataset["train"].filter(lambda x: x["main_category"] == "người") print(f"Articles about people: {len(people):,}") # Get articles about locations locations = dataset["train"].filter( lambda x: "khu định cư" in (x["main_category"] or "") ) ``` ### Filter by Wikidata ```python # Get articles with Wikidata links with_wikidata = dataset["train"].filter(lambda x: x["wikidata_id"] != "") # Lookup specific entity vietnam = dataset["train"].filter(lambda x: x["wikidata_id"] == "Q881") ``` ### Use for Language Modeling ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base") def tokenize(examples): return tokenizer(examples["content"], truncation=True, max_length=512) tokenized = dataset["train"].map(tokenize, batched=True) ``` ## Quality Score Articles are scored 1-10 based on multiple factors: | Component | Weight | Criteria | |-----------|--------|----------| | **Length** | 40% | Character count (200 - 100,000 optimal) | | **Sentences** | 30% | Sentence count (3 - 1,000 optimal) | | **Density** | 30% | Avg sentence length (80-150 chars optimal) | | **Wikidata bonus** | +0.5 | Has wikidata_id | | **Category bonus** | +0.5 | Has main_category | | **Markup penalty** | -1 to -3 | Remaining Wikipedia markup | ### Quality Distribution | Score | Count | Percentage | |-------|------:|----------:| | 1 | 134 | 0.0% | | 2 | 376 | 0.0% | | 3 | 28,267 | 2.5% | | 4 | 607,081 | 54.3% | | 5 | 208,304 | 18.6% | | 6 | 134,385 | 12.0% | | 7 | 70,345 | 6.3% | | 8 | 57,054 | 5.1% | | 9 | 9,649 | 0.9% | | 10 | 2,629 | 0.2% | ## Top Categories | Category (Vietnamese) | Count | Percentage | |----------------------|------:|----------:| | đơn vị phân loại | 618,281 | 55.3% | | người | 78,191 | 7.0% | | xã của Pháp | 35,635 | 3.2% | | khu định cư | 20,276 | 1.8% | | village of Turkey | 18,540 | 1.7% | | tiểu hành tinh | 17,891 | 1.6% | | mahalle | 16,419 | 1.5% | | xã của Việt Nam | 7,088 | 0.6% | | đô thị của Ý | 6,700 | 0.6% | | trang định hướng Wikimedia | 6,202 | 0.6% | ## Data Processing ### Pipeline Steps 1. **Download**: Fetch Vietnamese Wikipedia XML dump from Wikimedia 2. **Extract**: Parse XML and extract article content 3. **Clean**: Remove Wikipedia markup (templates, refs, links, tables, categories) 4. **Normalize**: Apply Unicode NFC normalization 5. **Score**: Calculate quality metrics for each article 6. **Enrich**: Add Wikidata IDs and semantic categories via Wikidata API 7. **Filter**: Remove special pages, redirects, disambiguation, and short articles (<100 chars) 8. **Split**: Create train/validation/test splits (80/10/10) with seed=42 ### Removed Content - Wikipedia templates (`{{...}}`) - References and citations (`<ref>...</ref>`) - HTML tags and comments - Category links (`[[Thể loại:...]]`) - File/image links (`[[Tập tin:...]]`, `[[File:...]]`) - Interwiki links - Tables (`{| ... |}`) - Infoboxes and navigation templates ### Reproduction ```bash git clone https://github.com/undertheseanlp/UVW-2026 cd UVW-2026 uv sync --extra huggingface # Run full pipeline uv run python scripts/build_dataset.py # Or run individual steps uv run python scripts/download_wikipedia.py uv run python scripts/extract_articles.py uv run python scripts/wikipedia_quality_score.py uv run python scripts/add_wikidata.py uv run python scripts/create_splits.py uv run python scripts/prepare_huggingface.py --push ``` ## Citation ```bibtex @dataset{uvw2026, title = {UVW 2026: Underthesea Vietnamese Wikipedia Dataset}, author = {Underthesea NLP}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/undertheseanlp/UVW-2026}, note = {Vietnamese Wikipedia articles enriched with Wikidata metadata} } ``` ## Related Resources - [Underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit - [PhoBERT](https://github.com/VinAIResearch/PhoBERT) - Pre-trained language models for Vietnamese - [Vietnamese Wikipedia](https://vi.wikipedia.org) - [Wikidata](https://www.wikidata.org) ## License This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/), consistent with the Wikipedia content license. --- <div align="center"> Made with ❤️ by <a href="https://github.com/undertheseanlp">Underthesea NLP</a> </div>
提供机构:
undertheseanlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作