trungbb8/vietnamese-news-copus-segmented
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/trungbb8/vietnamese-news-copus-segmented
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 25992880995
num_examples: 7770638
download_size: 0
dataset_size: 25992880995
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
task_categories:
- text-generation
- fill-mask
language:
- vi
tags:
- vietnamese
- vietnamese-news
pretty_name: Vietnamese News Corpus (Cleaned & Segmented)
size_categories:
- 1M<n<10M
---
# Dataset Card for vietnamese-news-copus-segmented
### Dataset Summary
This dataset is a refined collection of Vietnamese news articles, originally sourced from `ademax/binhvq-news-corpus`. It has been processed through a specialized pipeline for cleaning, normalization, and word segmentation. It is ideal for training Vietnamese Language Models (LLMs), word embeddings, or text classification tasks.
* **Original Source:** `ademax/binhvq-news-corpus`
* **Language:** Vietnamese (vi)
* **Format:** Word-segmented text (using VnCoreNLP)
### Data Processing Pipeline
The dataset was constructed using the following automated pipeline:
1. **HTML Cleaning:** Stripped all HTML tags and boilerplate using BeautifulSoup.
2. **Text Normalization:**
* Applied **Unicode NFC** normalization.
* Standardized punctuation (quotes, dashes, ellipses).
* Removed control characters and normalized whitespaces.
3. **Boilerplate & Signature Removal:**
* Used Regex to remove common "See more", "Source", and "Photo by" patterns.
* Applied heuristic rules to detect and strip journalist names, contributor info (CTV), and editorial signatures at the end of articles.
4. **Length Filtering:** Retained only high-quality articles with character lengths between **500** and **20,000**.
5. **Deduplication:** Performed exact deduplication using MD5 hashing to ensure data uniqueness.
6. **Word Segmentation:** Segmented compound words using **VnCoreNLP** (e.g., `trí tuệ nhân tạo` becomes `trí_tuệ nhân_tạo`).
### Dataset Structure
The dataset contains a single column:
* `text` (string): The final cleaned and word-segmented Vietnamese text.
### Usage
You can load this dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("trungbb8/vietnamese-news-copus-segmented")
print(dataset['train'][0]['text'])
```
### Technical Specifications
* **Segmentation Tool:** `VnCoreNLP` (`wseg` annotator).
* **Storage:** Sharded at max 500MB per file for efficient streaming and downloading.
### Limitations
* **Heuristics:** While the cleaning process is robust, some unconventional author signatures might remain or very short concluding sentences might be accidentally removed.
* **Segmentation:** If your model requires syllable-level input, simply replace underscores (`_`) with spaces.
### License
This dataset is derived from the `binhvq-news-corpus`. Users should refer to the original source's licensing terms.
**Note:** This dataset was uploaded using an automated pipeline. For any inquiries, please contact the repository owner.
提供机构:
trungbb8



