trungbb8/vietnamese-news-copus-segmented

Name: trungbb8/vietnamese-news-copus-segmented
Creator: trungbb8
Published: 2026-04-04 10:08:21
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/trungbb8/vietnamese-news-copus-segmented

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 25992880995 num_examples: 7770638 download_size: 0 dataset_size: 25992880995 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 task_categories: - text-generation - fill-mask language: - vi tags: - vietnamese - vietnamese-news pretty_name: Vietnamese News Corpus (Cleaned & Segmented) size_categories: - 1M<n<10M --- # Dataset Card for vietnamese-news-copus-segmented ### Dataset Summary This dataset is a refined collection of Vietnamese news articles, originally sourced from `ademax/binhvq-news-corpus`. It has been processed through a specialized pipeline for cleaning, normalization, and word segmentation. It is ideal for training Vietnamese Language Models (LLMs), word embeddings, or text classification tasks. * **Original Source:** `ademax/binhvq-news-corpus` * **Language:** Vietnamese (vi) * **Format:** Word-segmented text (using VnCoreNLP) ### Data Processing Pipeline The dataset was constructed using the following automated pipeline: 1. **HTML Cleaning:** Stripped all HTML tags and boilerplate using BeautifulSoup. 2. **Text Normalization:** * Applied **Unicode NFC** normalization. * Standardized punctuation (quotes, dashes, ellipses). * Removed control characters and normalized whitespaces. 3. **Boilerplate & Signature Removal:** * Used Regex to remove common "See more", "Source", and "Photo by" patterns. * Applied heuristic rules to detect and strip journalist names, contributor info (CTV), and editorial signatures at the end of articles. 4. **Length Filtering:** Retained only high-quality articles with character lengths between **500** and **20,000**. 5. **Deduplication:** Performed exact deduplication using MD5 hashing to ensure data uniqueness. 6. **Word Segmentation:** Segmented compound words using **VnCoreNLP** (e.g., `trí tuệ nhân tạo` becomes `trí_tuệ nhân_tạo`). ### Dataset Structure The dataset contains a single column: * `text` (string): The final cleaned and word-segmented Vietnamese text. ### Usage You can load this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("trungbb8/vietnamese-news-copus-segmented") print(dataset['train'][0]['text']) ``` ### Technical Specifications * **Segmentation Tool:** `VnCoreNLP` (`wseg` annotator). * **Storage:** Sharded at max 500MB per file for efficient streaming and downloading. ### Limitations * **Heuristics:** While the cleaning process is robust, some unconventional author signatures might remain or very short concluding sentences might be accidentally removed. * **Segmentation:** If your model requires syllable-level input, simply replace underscores (`_`) with spaces. ### License This dataset is derived from the `binhvq-news-corpus`. Users should refer to the original source's licensing terms. **Note:** This dataset was uploaded using an automated pipeline. For any inquiries, please contact the repository owner.

提供机构：

trungbb8

5,000+

优质数据集

54 个

任务类型

进入经典数据集